How does indexer walk through the hypertext links ================================================= When indexer trying to insert new URL into database or trying to index the existing one it does first of all discovers whether does this URL have correspondent Server record given in indexer.conf. When indexer seeks for the correspondent to an URL Server command it compares first bytes of being discovered URL and start URL of Server command given in it's argument. During startup indexer sorts all servers by start URL's length so the longest one will be found first. This schema allow to give different parameters to for example whole server and it's subsection. Imagine that we have server subdirectory which contains news articles. Surely those articles are to be reindexed faster than the rest of the server. This combination will be usefull for such case: Period 600000 Server http://www/ Period 200000 Server http://www/news/ These commands give different reindexing period for /news/ subdirectory comparing with the period of whole server. indexer will choose the second Server record for the http://www/news/page1.html because this server record will be in memory first due to sorting order. There are actually three different types of indexer behavour when it makes a desition whether index URL or not. 1) Default rules The defalt behavour of indexer is to follow through those links which have correspondent Server command in the indexer.conf file. It also jumps between servers if both of them are represented in indexer.conf. For example, imagine that we have two Server commands: Server http://www/ Server http://web/ When indexing http://www/page1.html indexer WILL follow the link http://web/page2.html if the last one has been found. Note that these pages are on different servers, but BOTH of them have correspondent Server record. If one of the Server command will be deleted, indexer will remove all expired URLs from this server during next reindexing. 2) Using "FollowOutside yes" The first way to change described default behavour is to use "FollowOutside yes" indexer.conf command. indexer will walk through the ANY found URLs and will jump between different servers. Theoretically, it will index all Internet in this case if there are no harware limits :-) When "FollowOutside yes" command is specified, indexer just add in memory one server record with the empty start URL during loading indexer.conf. According to the sort order, this empty server will be found only in the case when no other Server records with longer start URL are found. 3) Using "DeleteNoServer no". The second way to change default behavour is to use "DeleteNoServer no" command. This command means that URLs which are already in database will not be deleted even if they have not correspondent Server command. "DeleteNoServer no" is implemented by addition one empty server just like "FollowOutside yes". The difference of those two commands is that in the case of "DeleteNoServer no" indexer will follow ONLY through the links INSIDE the servers and will not jump between different servers. Imagine this commands sequence: DeleteNoServer no Server http://www/ Server http://web/ When indexing http://www/page1.html indexer WILL follow the link http://www/page2.html but WILL NOT follow http://web/page2.html because http://www/page1.html and http://web/page2.html are on different servers. 4) Using "indexer -f " The third schema is very usefull for "indexer -i -f url.txt" running. You may maitain required servers in the url.txt. When new URL is added into url.txt indexer will index the server of this URL during next startup. It does not matter whether you have passed the root URL (http://www/) of the server or one of the internal pages (http://www/path/to/some/page.html). Indexer will index all server http://www/. Note that if you delete URL from the list in url.txt using the schema with "DeleteNoServer no", indexer WILL NOT delete URLs from the same server. Imagine that you have removed http://www/ from url.txt. To remove all URLs of this server from the database you'll have to run "indexer -C -u http://www/%".