XML parser that is included with swish uses James Clarks's Expat library

Google

Swish-E Logo


INSTALL - Swish-e Installation Instructions


[ TOC ]

OVERVIEW

This document describes how to download, build and install Swish-e. Also described is how to build Swish-e with optional, yet recommended libraries that extend and enhance Swish-e.

This document also provides instructions on how to get help installing and using Swish-e (and the important information you should provide when asking for help).

Also, below is a basic overview of using Swish-e to index documents, with pointers to other more advanced examples.

For those in a hurry, see Quick Start for the Impatient.

Also, please read the Swish-e FAQ SWISH-FAQ as it answers many frequently asked questoins.

[ TOC ]


SYSTEM REQUIREMENTS

Swish-e is written in C, and, up to this time, it has been tested on a number of platforms. Any current Linux distribution should have no problem building Swish-e. Swish-e has also been used on these platforms:

 
    Intel - FreeBSD 4.6-STABLE
    gcc version 2.95.4 20020320 [FreeBSD]

 
    Sun Enterprise 220R UltraSPARC-based server - loaded with Solaris 8
    gcc version 2.95.3 20010315 (release)

 
    Sun Enterprise 3500 - Solaris 2.6
    gcc version 2.95.1 19990816 (release)

 
    IBM RS/6000 43P-150 - Debian 3.0
    gcc version 2.95.4 20011002 (Debian prerelease)

 
    Apple Dual G4 QuickSilver - MacOS X 10.1.4  
    Apple Computer, Inc. version gcc-932.1, based on gcc version 2.95.2 19991024 (release)

 
    Mac OS X 10.1.2 on an iBook.
    
    DEC Alpha - Debian 3.0
    gcc version 2.95.4 20011002 (Debian prerelease)

 
    Sun Ultra 60 UltraSPARC II workstation - Debian GNU/Linux 3.0
    gcc version 2.95.4 20011002 (Debian prerelease)

 
    OpenVMS  VAX or Alpha version 7.1 or later
    DECC compiler version 6.0 or later
    
    Win32 platforms

Unless you are using the Win32 binary distribution, a C compiler is needed. Pretty much any standard compiler should do, although you will probably have best luck with a current version of gcc. If you are using something else (such as HP-UX or AIX) you may see more warnings during the build process. Any problems should be sent to the Swish-e discussion list after searching the list archives.

HTML and XML files. Instructions for installing and enabling the library are described below.

library (included with the Swish-e distribution), but Swish-e's old html.c but offers a much better HTML parser that Swish-e's html.c parser. Use

Currently, setting a content type (IndexContents or DefaultContents) of ``HTML'' uses Swish-e's html.c parser, where a setting of ``HTML2'' this may change in future releases.

zlib compression

http://www.gzip.org/zlib/

Swish-e can make use of zlib to compress document properties. This is recommended if you are using StoreDescription.

A Swish-e program built with zlib will read an index from a version of Swish-e that was not built with zlib. But, if you are searching an index that was compressed with zlib then you will need to use a version of Swish-e built with zlib. Therefore, it's recommended to always include zlib support.

Memory

Swish needs quite a bit of memory while indexing. How much depends on what you are indexing. The index is portable between platforms (that use the same basic data type sizes), so you can index on a machine that has lots of memory available and move the index files to another machine for searching. Use the -e switch if you are short on memory.

Perl modules

http://www.cpan.org

http://search.cpan.org

Swish-e uses a perl script for spidering web sites. The script requires the LWP bundle of modules (see http://search.cpan.org/search?dist=libwww-perl ). (Note: depending on your perl installation, you might need to install additional modules required by LWP; for requirements and downloads check http://www.cpan.org or http://search.cpan.org). The Perl helper script was tested with perl 5.005, 5.6.0, and 5.6.1 although it should probably work with any version 5 release. Do note that the LWP, HTTP, and HTML modules are updated often for bug fixes and such -- do check for upgrades, and don't expect that your system admin as been keeping up with bug fixes.

[ TOC ]


Platform Specific Information

A configure script is used to determine platform specific details for building swish. Please contact the Swish-e discussion list if you notice any platform specific problems while building Swish-e.

Specific information for various platforms can be found in subdirectories of the src directory. For example, the Win32 files can be found in src/win32, and instructions for building under VMS can be found in src/vms.

The Windows binary is distributed as a separate package from the source distribution. See http://Swish-e.org for download information.

Swish-e indexes are not portable between 32 and 64 bit platforms, but should be portable between machines with different ``endian'' types.

[ TOC ]


INSTALLATION

Instructions below are for installing Swish-e from source. Installing from source is recommended, but you should also check the Swish-e web site for binary distributions for your platform.

Windows binary distributions are available from the Swish-e site.

[ TOC ]


Brief Instructions

 
    ./configure
    make
    make test
    su root
    make install

Swish uses a configure script to generate a Makefile for your platform. The configure script should detect and use optional libraries if found on your system.

[ TOC ]


and XML documents.

recommended, espeically if you are parsing HTML. As mentioned above, the and works well. The HTML parser in Swish-e has been in use for years, but more features (and more features for parsing XML), and is more accurate. If you are running Linux it may already be installed (look for

The library can be downloaded from Debian pacakage system. Check with your distributions web site for more information, as this is a very easy way to install this library.

may need to specify the location of zlib. This happens (on Solaris) when the ./configure script finds the zlib header files, but the compiler and linker do not know to look in /usr/local/lib for the library. You may see an error like:

 
    ld: fatal: library -lz: not found
    *** Error code 1

In this case, try specifying where zlib can be found. For example, if libz was located in /usr/local/lib you would use this when building

  ./configure --with-zlib=/usr/local

 
    ./configure --without-zlib

 
    ./configure --prefix=$HOME/local

This will install the headers and library files in $HOME/local/include and $HOME/local/lib. You will need to inform the Swish-e build process of this non-standard directory location (explained below).

 
    make
    make install

locations.

 

 

where that library is at run time. There seems to be a number of ways to do this. First, you can set the environment variable LD_RUN_PATH *before* running make to create Swish-e. This will add the path directly to the Swish-e executable file.

For example, under Bourne type shells:

 
    LD_RUN_PATH=$HOME/local/lib make

Other shells (like csh and tcsh) may require:

 
    setenv LD_RUN_PATH $HOME/local/lib
    make

Another option is to use the LD_LIBRARY_PATH environment variable. This is a list of directories to search for libraries when a program is run. See the ld(8) man page for more info.

be deleted or moved.

[ TOC ]


Building Swish-e with zlib

Building with zlib is similar to the instructions for building Swish-e with found link Swish-e with the zlib library.

zlib is common on many systems, but may be out of date, and versions prior to 1.1.4 have a know security issue. You should run at least version 1.1.4. To link with zlib in a non-standard location use, for example:

 
    ./configure --with-zlib=$HOME/zlib

or LD_LIBRARY_PATH variables. See above for more details.

[ TOC ]


Downloading and unpacking and building Swish-e

If you are reading this INSTALL document, then you probably already have downloaded and unpacked the distribution. But just in case...

Make sure you are using the current release from http://Swish-e.org. If you have any questions about which version to use, please ask on the Swish-e discussion list.

How you download Swish-e is up to you: lynx, lwp-download, wget are all common methods.

  1. Uncompress the distribution file

     
       gzip -dc swish-e.x.x.tar.gz | tar xof -

    or on some versions of tar, simply

     
       tar -zxof swish-e.x.x.tar.gz

    Uncompressing should create the following directories:

     
       swish-e-x.x/            configure script and top-level Makefile
       swish-e-x.x/pod/        Swish-e documentation
       swish-e-x.x/html/       HTML version of the documentation
       swish-e-x.x/src/        source code
       swish-e-x.x/conf/       example configuration files and stopword files
       swish-e-x.x/example/    working example CGI scripts
       swish-e-x.x/filters/    perl module(s) to make filtering documents easy
       swish-e-x.x/filter-bin/ filter samples for use with FileFilter feature
       swish-e-x.x/prog-bin/   -S prog a web spider and other examples
       swish-e-x.x/perl/       perl interface to the Swish-e C library
       swish-e-x.x/src/expat/  James Clark's Expat XML parser
       swish-e-x.x/src/win32/  win32 binary and buid files
       swish-e-x.x/src/vms/    files required for building under VMS
       swish-e-x.x/tests/      tests used for running "make test"
       swish-e-x.x/doc/        directory used or building the documentation

  2. Make any needed changes in src/config.h

    Compile-time configuration settings are adjusted in the file src/config.h. Most of the settings may also be specified in the configuration file used during indexing.

    You probably will not need to change this file, but it's helpful to become familiar with the default compiled-in settings.

  3. Build Swish-e

    Building Swish-e on most systems is a simple procedure. In the Swish-e-x.x/ top level directory type the following commands

     
       ./configure
       make
       make test

    You should build swish as a normal user (i.e. not as ``root'').

    for the required configure options.

    The above will create the Swish-e executable src/swish-e and test that the executable is working correctly. make test will generate an index file in the tests directory and run a number of searches against this index. At this time, the tests really just make sure that swish-e was compiled correctly and runs.

    You may optionally ``build'' the swish-search executable. This is a version of Swish-e that cannot write to the index file. This version may provide somewhat improved security in a CGI environment. The binaries swish-e and swish-search are the same files -- the additional security is enabled when the binary is named swish-search. swish-search is not a substitute for good file system and CGI security. Please review the many CGI security papers available on-line.

    Again, this is an optional step:

     
       make swish-search

    which simply copies the file swish-e to swish-search.

  4. Install Swish-e

    Move the swish-e (and/or swish-search) executable to its final location (normally /usr/local/bin). You may simply copy the program anywhere you see fit, or you may use the make install command to install it to the location defined by the configure script:

    You may need to superuser privileges:

     
       su root
       make install
       exit

    IMPORTANT: Do not run swish-e as the superuser (root).

    The bin directory may be set when first running ./configure. For example:

     
       ./configure --bindir=$HOME/bin

    sets the installation directory to $HOME/bin and make install will install the program in that location.

[ TOC ]


Join the Swish-e discussion list

The Swish-e discussion list is the place to ask questions about installing and using Swish-e, see or post bug fixes or security announcements, and a place where you can offer help to others.

The list is typically very low traffic, so it won't overload your inbox. Please take time to subscribe. See http://Swish-e.org.

If you are using Swish-e on a public site, please let the list know so it can be added to the list of sites that use Swish-e!

Please review QUESTIONS AND TROUBLESHOOTING before posting a question to the Swish-e list.

[ TOC ]


Installing the Swish-e C Library (optional)

Swish 2.2 creates the C library libswish-e.a during the build. Install this library if you wish to embed Swish-e into another application. For example, the library should be installed before using the high level Perl SWISH modules located on CPAN. http://search.cpan.org/search?mode=module&query=SWISH

This is an *optional* step. Most users will not need to install the library.

To install the library issue the following commands (again, you may need to su root)

 
   su root
   make install-lib
   exit

By default this will install the library in /usr/local/lib, but this directory can be set when running ./configure with the --libdir option. For example:

 
   ./configure --bindir=$HOME/bin --libdir=$HOME/lib

So make install will install the swish-e binary in $HOME/bin and make install-lib will install the libswish-e.a library in $HOME/lib.

Note: You may wish to run make realclean before running ./configure again.

[ TOC ]


Creating PDF and Postscript documentation (optional)

The Swish-e documentation in HTML format was created with Pod::HtmlPsPdf, a package of Perl modules written and/or modified by Stas Bekman to automate the conversion of documents in pod format (see perldoc perlpod) to HTML, Postscript, and PDF. A slightly modified version of this package is include with the Swish-e distribution and used for building the HTML.

If your system has the necessary tools to build Postscript and the converter ps2pdf installed, you may be able to build the Postscript and PDF versions of the documentation. After you have run ./configure, type from the top-level directory of the distribution:

 
    make pdf

And with any luck you will end up with the these two files in the top-level directory:

 
    swish-e_documentation.pdf
    swish-e_documentation.ps

Most people find reading the documentation in HTML most convenient.

[ TOC ]


Installing the Swish-e documentation as man(1) pages (optional)

Part of the included Swish-e documentation can be installed as system man(1) pages. Only the reference related pages are installed (it's assumed that you don't need to install the README or INSTALL documents as man pages). You must have the pod2man program installed on your system (which you probably do if you have Perl).

To build the man pages and install them into your system, type from the top-level directory (after running ./configure):

 
    su root
    make install-man
    exit

You will need to su root if you do not have write access to the man directory.

The man pages are installed in the system man directory. This directory is determined by running ./configure and can be set by passing the directory when running ./configure.

For example,

 
    ./configure --mandir=/usr/local/doc/man

Information on running ./configure can be found by typing:

 
    ./configure --help

The pod source files used to create the man files were written running under perl 5.6.1. Older version of Perl may complain slightly about the formatting of the pod files. This shouldn't be a problem, but please let the Swish-e list know if otherwise. Then upgrade your version of perl. ;)

[ TOC ]


QUESTIONS AND TROUBLESHOOTING

Please search the Swish-e list archive before posting a question, and check the SWISH-FAQ to see if your question hasn't already been asked.

Support for installation, configuration and usage is available via the Swish-e discussion list. Visit http://swish-e.org for information. Do not contact developers directly for help -- always post your question to the list.

Before posting use tools available to narrow down the problem.

Swish-e has the -T, -v, and -k switches that may help resolve issues. If possible find a single document that shows the problem, then index with -T INDEXED_WORDS and watch the exact words that are indexed. Use -H 9 when searching and look at Parsed Words: to make sure you are searching the correct words.

You can also use programs like gdb to help find segfaults and other run-time errors, and programs like truss or strace can often provide interesting information, if you are adventurous.

[ TOC ]


When posting please provide the following information:

  • The exact version of Swish-e that you are using. Running Swish-e with the -V switch will print the version number. Also, supply the output from uname -a or similar command that identifies the operating system you are running on. If you are running an old version of swish be prepared for a response to your question of ``upgrade.''

  • A summary of the problem. This should include the commands issued (e.g. for indexing or searching) and their output, and why you don't think it's working correctly. Please cut-n-paste the exact commands and their output instead of retyping to avoid errors.

  • Include a copy of the configuration file you are using, if any. Swish-e has reasonable defaults so in many cases you can run it without using a configuration file. But, if you need to use a configuration file, reduce it down to the absolute minimum number of commands required to demonstrate your problem. Again, cut-n-paste.

  • A small copy of a source document that demonstrates the problem.

    If you are having problems spidering a web server, use lwp-download or wget to copy the file locally to make sure you can index the document using the file system method.

    If you do need help with spidering, don't post fake URLs, as it makes it impossible to help. If you don't want to expose your web page to the people on the Swish-e list, find some other site to test spidering on. If that works, but you still cannot spider your own site then post your real URL if you want help.

  • If you are having trouble building Swish-e please cut-n-paste the output from make (or from ./configure if that's where the problem is).

[ TOC ]


BASIC CONFIGURATION AND USAGE

This section should give you a basic overview of indexing and searching with Swish-e. Other examples can be found in the conf directory, which will step you through a number of different configurations. Also, please review the SWISH-FAQ.

Swish-e reads a configuration file (see SWISH-CONFIG) for directives that control what and how Swish-e indexes files. Then running Swish-e is controlled by command line arguments (see SWISH-RUN).

Swish-e does not require a configuration file, but most people need to change the default behavior by placing settings in a configuration file.

To try the examples below change to the tests subdirectory of the distribution. The tests will use the *.html files in this directory when creating the test index. You may wish to review these *.html files to get an idea of the various native file formats that Swish-e supports.

[ TOC ]


Step 1: Create a Configuration File

The configuration file controls what and how Swish-e indexes. The configuration file consists of directives, comments, and blank lines. The configuration file can be any name you like.

This example will work with the documents in the tests directory. You may wish to review the tests/test.config configuration file used for the make test tests.

For example, a simple configuration file (Swish-e.conf):

 
    # Example Swish-e Configuration file

 
    # Define *what* to index
    # IndexDir can point to a directories and/or a files

 
    # Here it's pointing to the current directory
    IndexDir .

 
    # But only index the .html files
    IndexOnly .html

 
    # Show basic info while indexing
    IndexReport 1

And that's a simple configuration file. It says to index all the .html files in the current directory, and provide some basic output while indexing.

The complete list of all configuration file directives are described in SWISH-CONFIG.

[ TOC ]


Step 2: Index your Files

Now, make sure you are in the tests directory and save the above example configuration file as swish-e.conf. Then run Swish-e using the -c switch to specify the name of the configuration file.

 
    ../src/swish-e -c swish-e.conf

 
    Indexing Data Source: "File-System"
    Indexing "."
    Removing very common words...
    no words removed.
    Writing main index...
    Sorting words ...
    Sorting 55 words alphabetically
    Writing header ...
    Writing index entries ...
      Writing word text: Complete
      Writing word hash: Complete
      Writing word data: Complete
    55 unique words indexed.
    Writing file list ...
    Property Sorting complete.                                         
    Writing sorted index ...
    5 files indexed.  1252 total bytes.
    Elapsed time: 00:00:00 CPU time: 00:00:00
    Indexing done!

This created the index file index.swish-e. This is the default index file name unless the IndexFile directive is specified in the configuration file:

 
    IndexFile ./website.index

[ TOC ]


Step 3: Search

You specify your search terms with the -w switch. For example, to find the files that contain the word sample you would issue the command:

 
    ../src/swish-e -w sample

This example assumes that you are in the tests directory, and the Swish-e binary is in the ../src directory. Swish-e returns in response to that command the following:

 
    ../src/swish-e -w sample

 
    # SWISH format: 2.2
    # Search words: sample
    # Number of hits: 2
    # Search time: 0.000 seconds
    # Run time: 0.005 seconds
    1000 ./test.html "If you are seeing this, the test was successful!" 437
    .

So the word sample was found in two documents. The first number shown is the relevance or rank of the search term, followed by the file containing the search term, the title of the document, and finally the length of the document.

The period (``.'') alone at the end marks the end of results.

Much more information may be retrieved while searching by using the -x and -H switches (see SWISH-RUN) and by using Document Properties (see SWISH-CONFIG).

[ TOC ]


Phrase Searching

To search for a phrase in a document use double-quotes to delimit your search terms. (The phrase delimiter is set in src/swish.h.)

You must protect the quotes from the shell.

For example, under Unix:

 
    swish-e -w '"this is a pharase" or (this and that)'
    swish-e -w 'meta1=("this is a pharase") or (this and that)'

Or under Windows command.com shell.

 
    swish-e -w \"this is a pharase\" or (this and that)

The phrase delimiter can be set with the -P switch.

[ TOC ]


Boolean Searching

You can use the Boolean operators and, or, or not in searching. Without these Boolean, Swish-e will assume you're anding the words together.

Here are some examples:

 
    ../src/swish-e -w 'apples oranges'
    ../src/swish-e -w 'apples and oranges'  ( Same thing )

 
    ../src/swish-e -w 'apples or oranges'

 
    ../src/swish-e -w 'apples or oranges not juice' -f myIndex 

retrieves first the files that contain both the words ``apples'' and ``oranges''; then among those the ones that do not contain the word ``juice''

A few others to ponder:

 
    ../src/swish-e -w 'apples and oranges or pears'
    ../src/swish-e -w '(apples and oranges) or pears'  ( Same thing )
    ../src/swish-e -w 'apples and (oranges or pears)'  ( Not the same thing )

See SWISH-SEARCH for more information.

[ TOC ]


Context Searching

The -t option in the search command line allows you to search for words that exist only in specific HTML tags. Each character in the string you specify in the argument to this option represents a different tag in which the word is searched; that is you can use any combinations of the following characters:

 
    H means all <HEAD> tags
    B stands for <BODY> tags
    t is all <TITLE> tags
    h is <H1> to <H6> (header) tags
    e is emphasized tags (this may be <B>, <I>, <EM>, or <STRONG>)
    c is HTML comment tags (<!-- ... -->)

For example:

 
    # Find only documents with the word "linux" in the E<lg>TITLEE<gt> tags.
    ./swish-e -w linux -t t

 
    # Find the word "apple" in titles or comments
    ./swish-e -w apple -t tc

[ TOC ]


META Tags

For the last example we will instruct Swish-e to use META tags to define fields in your documents.

META names are a way to define ``fields'' in your documents. You can use the META names in your queries to limit the search to just the words contained in that META name of your document. For example, you might have a META tagged field in your documents called subjects and then you can search your documents for the word ``foo'' but only return documents where ``foo'' is within the subjects META tag.

Document Properties are somewhat related to meta tags: Properties allow the contents of a META tag in a source document to be stored within the index, and that text to be returned along with search results.

META tags can have two formats in your documents.

 
    <META NAME="keyName" CONTENT="some Content">

And in XML format

 
    <keyName>
        Some Content
    </keyName>

 
    <html>
    <body>
        Hello swish users!
        <keyName>
            this is meta data
        </keyName>.
    </body>

This, of course, is invalid HTML.

To continue with our sample Swish-e.conf file, add the following lines:

 
    # Define META tags
    MetaNames meta1 meta2 meta3

Reindex to include the changes:

 
    ../src/swish-e -c swish-e.conf

Now search, but this time limit your search to META tag ``meta1'':

 
    ../src/swish-e -w 'meta1=metatest1'

Again, please see SWISH-RUN and SWISH-CONFIG for complete documentation of the various indexing and searching options.

[ TOC ]


Additional Examples

The above example indexes local files using the file system access method -S fs. You may also index files that are located on a local or remote web server by using the HTTP access method -S http, or via the prog input method -S prog. These are described in SWISH-RUN and example configuration files for using these methods can be found in the conf directory of the Swish-e distribution.

The -S prog access method can be used to index any type of document, such as documents stored in a database (RDBMS), or documents that need to be processed before they can be indexed. Examples for using the -S prog method are shown in the prog-bin directory.

Swish-e can also use filters to convert documents as they are processed by Swish-e. For example, MS-Word or PDF documents can be converted and indexed by Swish-e by using filters. See the section on filters in SWISH-CONFIG, and the examples shown in the filters and filter-bin directories.

[ TOC ]


QUICK START FOR THE IMPATIENT

[ TOC ]


Installation

Here's the steps required on most platforms for downloading and installing swish-e.

 
    ~ $ wget http://swish-e.org/<path to current swish-e version>.tar.gz
    ~ $ gzip -dc <path to current swish-e version>.tar.gz | tar xof -
    ~ $ cd swish-e-2.2  (this directory will depend on the version of Swish-e)
    
    ~/swish-e-2.2 $ ./configure
    ~/swish-e-2.2 $ make
    ~/swish-e-2.2 $ make test
    ...
     ** All tests completed! **

Not required, but once built you can install the src/swish-e program.

 
    ~/swish-e-2.2 $ su root
    (password)
    ~/swish-e-2.2 # make install
    ~/swish-e-2.2 # exit

This installs to /usr/local/bin by default.

[ TOC ]


Indexing Examples

Here's three examples of using swish. The first example shows how to use a very simple configuration file to index the Swish-e documentation. The second example shows spidering using the built-in HTTP method. The third example uses an external program for indexing and shows use of a CGI script for searching.

[ TOC ]


Example 1 - simple file system indexing

First, create a configuration file off your home directory:

 
    ~ $ mkdir test
    ~ $ cd test
    ~/test $ cat conf
    IndexDir ../swish-e/html
    IndexOnly .html

Now create the index:

 
    ~/test $ swish-e -c conf
    Indexing Data Source: "File-System"
    Indexing "../swish-e/html"
    Removing very common words...
    no words removed.
    Writing main index...
    Sorting words ...
    Sorting 2764 words alphabetically
    Writing header ...
    Writing index entries ...
      Writing word text: Complete
      Writing word hash: Complete
      Writing word data: Complete
    2764 unique words indexed.
    4 properties sorted.                                              
    13 files indexed.  430611 total bytes.  42343 total words.
    Elapsed time: 00:00:00 CPU time: 00:00:00
    Indexing done!

And search:

 
    ~/test $ swish-e -w IndexDir
    # SWISH format: 2.2rc1-dev
    # Search words: IndexDir
    # Number of hits: 4
    # Search time: 0.001 seconds
    # Run time: 0.043 seconds
    1000 ../swish-e/html/SWISH-CONFIG.html "SWISH-Enhanced: SWISH-CONFIG - Configuration File Directives" 132887
    898 ../swish-e/html/SWISH-RUN.html "SWISH-Enhanced: SWISH-RUN - Running Swish-e and Command Line Switches" 48037
    856 ../swish-e/html/SWISH-FAQ.html "SWISH-Enhanced: The Swish-e FAQ - Answers to Common Questions" 69709
    834 ../swish-e/html/INSTALL.html "SWISH-Enhanced: INSTALL - Swish-e Installation Instructions" 58647
    .

[ TOC ]


Example 2 - spidering

This example uses Swish-e's ``HTTP'' method of spidering. This method is depreciated due to it's lack of features. Spidering uses a perl helper script. You must have the Perl package LWP (libwww-perl) installed on your system.

 
    ~/test $ cat config.http
    Delay 1

 
    # Only index.html and top-level docs
    MaxDepth 2

 
    # Location of the "swishspider" helper program
    SpiderDirectory ../swish-e/src

 
    EquivalentServer http://swish-e.org http://www.swish-e.org

Index:

 
    ~/test $ swish-e -c config.http -S http -i http://swish-e.org/2.2/docs/
    Indexing Data Source: "HTTP-Crawler"
    Indexing "http://swish-e.org/2.2/docs/";
    Removing very common words...
    no words removed.
    Writing main index...
    Sorting words ...
    Sorting 2915 words alphabetically
    Writing header ...
    Writing index entries ...
      Writing word text: Complete
      Writing word hash: Complete
      Writing word data: Complete
    2915 unique words indexed.
    4 properties sorted.                                              
    16 files indexed.  468568 total bytes.  45051 total words.
    Elapsed time: 00:00:18 CPU time: 00:00:00
    Indexing done!

And search:

 
    ~/test $ swish-e -w IndexDir
    # SWISH format: 2.2rc1-dev
    # Search words: IndexDir
    # Number of hits: 4
    # Search time: 0.001 seconds
    # Run time: 0.043 seconds
    1000 http://swish-e.org/2.2/docs/SWISH-CONFIG.html "SWISH-Enhanced: SWISH-CONFIG - Configuration File Directives" 132887
    898 http://swish-e.org/2.2/docs/SWISH-RUN.html "SWISH-Enhanced: SWISH-RUN - Running Swish-e and Command Line Switches" 48037
    856 http://swish-e.org/2.2/docs/SWISH-FAQ.html "SWISH-Enhanced: The Swish-e FAQ - Answers to Common Questions" 69709
    834 http://swish-e.org/2.2/docs/INSTALL.html "SWISH-Enhanced: INSTALL - Swish-e Installation Instructions" 58647
    .

[ TOC ]


Example 3 - using an external program

This is a more advanced example that spiders a web site using the included prog-bin/spider.pl program and uses the included example/swish.cgi script for searching the index via a web interface.

and zlib installed in the system, you have a current version of Perl and current versions of LWP, HTML:*, and HTTP:* modules installed, and Apache is installed and operating.

If you have any trouble with these instructions please read the detailed installation instructions above, and see the documentation included with the swish.cgi script and the spider.pl programs. Please don't ask for help without reading the ``real'' documentation first.

  1. Make a working directory and copy files:

     
        ~ $ mkdir ~/swishtest
        ~ $ cd ~/swishtest

     
        ~/swishtest $ cp ~/swish-e-2.2/src/swish-e .             
        ~/swishtest $ cp ~/swish-e-2.2/prog-bin/spider.pl .
        ~/swishtest $ cp ~/swish-e-2.2/example/swish.cgi .           
        ~/swishtest $ cp -rp ~/swish-e-2.2/example/modules/ .
        ~/swishtest $ chmod 755 swish.cgi spider.pl
        ~/swishtest $ chmod 644 modules/*

  2. Create the index:

    You must create a swish configuration file and a spider configuration file. Here's the Swish-e configuration file:

     
        ~/swishtest $ cat swish.conf
        
        # Program to read documents
        IndexDir ./spider.pl

     
        # Define the config file for the spider to use
        SwishProgParameters spider.conf     

     
        # Use libxm2 for parsing documents
        DefaultContents HTML2
        IndexContents TXT2 txt

     
        # Cache document contents in the index for context display
        StoreDescription HTML2 <body>

    Here's the configuration file for the spider program. Perldoc prog-bin/spider.pl for details.

     
        ~/swishtest $ cat spider.conf

     
        # Example spider configuration file to index the 
        # split version of the swish-e documentation

     
        @servers = (
            {

     
                base_url        => 'http://swish-e.org/2.2/docs/split/index.html',
                same_hosts      => [ qw/www.swish-e.org/ ],
                email           => 'swish-impatient@domain.invalid',
                delay_min       => .0001,

     
                # Define call-back functions to fine-tune the spider

     
                test_url        => sub {
                    my $uri = shift;

     
                    # Skip requesting files that are probably not text
                    return if $uri->path =~ m[\.(?:gif|jpeg|png)$]i;

     
                    # Limit spidering to the /2.2/docs/split/ path
                    return unless $uri->path =~ m[/2.2/docs/split/];

     
                    return 1;  # otherwise, ok to search
                },

     
                # Only index text/html or text/plain
                test_response   => sub {
                    my ( $uri, $server, $response ) = @_;

     
                    return $response->content_type =~ m[(?:text/html|text/plain)];
                },
            },
        );
        1;

  3. Begin indexing:

     
        ~/swishtest $ ./swish-e -S prog -c swish.conf -v 2
        Indexing Data Source: "External-Program"
        Indexing "./spider.pl"
        ./spider.pl: Reading parameters from 'spider.conf'
        Processing http://swish-e.org/2.2/docs/split/index.html...
        Processing http://swish-e.org/2.2/docs/split/index_long.html...
        Processing http://swish-e.org/2.2/docs/split/search.cgi..
        ...
        2566 unique words indexed.
        5 properties sorted.                                              
        155 files indexed.  609775 total bytes.  49962 total words.
        Elapsed time: 00:00:33 CPU time: 00:00:01
        Indexing done!

  4. Test swish-e from the command line

     
        ~/swishtest $  ./swish-e -w foo -m 1
        # SWISH format: 2.1-dev-25
        # Search words: foo
        # Number of hits: 18
        # Search time: 0.000 seconds
        # Run time: 0.038 seconds
        1000 http://swish-e.org/2.2/docs/split/SWISH-CONFIG/Document_Contents_Directives.html "SWISH-CONFIG/Document Contents Directives" 57466
        .

  5. Test the CGI script from the command line

     
        ~/swishtest $ ./swish.cgi | head
        Content-Type: text/html; charset=ISO-8859-1

     
        <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
        <html>
            <head>
               <title>
                  Search our site
               </title>
            </head>
            <body>

    Refer to the swish.cgi documentation if you have any problems with running the CGI script.

  6. Configure Apache

     
        ~/swishtest $ su -c "ln -s $HOME/swishtest /usr/local/apache/htdocs/swishdocs"
        Password: *********

     
        ~/swishtest $ cat .htaccess
        # Deny everything by default
        Deny From All

     
        # But allow just the CGI script
        <files swish.cgi>
           Options ExecCGI
           Allow From All
           SetHandler cgi-script
        </files>

  7. Test from the command line

    This uses the GET program that is part of the LWP perl library. You may also test with the ``wget'' program, for example.

     
        ~/swishtest $ GET http://localhost/swishdocs/swish.cgi?query=install | head 
        <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
        <html>
            <head>
               <title>
                  43 Results for [install]
               </title>
            </head>
            <body>

Now you are ready to search.

[ TOC ]


Document Info

$Id: INSTALL.pod,v 1.22 2002/09/11 00:54:08 whmoseley Exp $

. [ TOC ]