Character sets ============== Supported character sets ------------------------ mnoGoSearch supports the following character sets: Cyrillic group: koi8-r, windows-1251, iso-8859-5, cp866, x-mac-cyrillic Western group: iso-8859-1 Central Europe group: windows-1250, iso-8859-2 Arabic group: windows-1256 Greek group: windows-1253, iso-8859-7 Hebrew group: iso-8859-8, windows-1255 Baltic group: iso-8859-4, iso-8859-13, windows-1257 Turkish group: iso-8859-9, windows-1254 Recoding -------- indexer recodes all documents to the character set specified in the "LocalCharset" indexer.conf command. Recoding only inside character set group is available. This is currently implemented for "Cyrillic","Central Europe","Greek","Baltic" groups. Hebrew iso-8859-8 and windows-1255 character sets are letters compatible, Turkish iso-8859-9 and windows-1254 character sets are letters compatible, i.e. no recoding is required for Hebrew and Turkish character sets. Recoding between character sets from different groups, for example, from Cyrillic koi8-r into Western iso-8859-1 will never be done by indexer. Character sets aliases ---------------------- Web servers can return the same charset in different notation. For example, iso-8859-2, iso8859-2, latin2 are the same charsets. There is support for charsets names aliases which search engine can understand: 1. Aliases for all ISO charsets (using iso-8859-2 as an example): iso-8859-2, iso8859-2, iso8859.2, iso-8859.2, iso_8859-2:1988, iso_8859-2, iso_8859.2 2. Aliases for all MS charsets (using windows-1250 as an example): windows-1250, cp-1250, cp1250, windows1250, x-cp1250 3. Aliases for Cyrillic koi8-r: koi8-r, koi8r, koi-8-r, koi8, koi-8, koi 4. Aliases for x-mac-cyrillic x-mac-cyrillic, mac 5. Aliases for DOS cp-866 Cyrillic cp-866, cp866, csibm866, 866, ibm866, x-cp866, x-ibm866, alt 6. Aliases for some latin character sets: latin1 for iso-8859-1 latin2 for iso-8859-2 latin4 for iso-8859-4 latin5 for iso-8859-9 latin7 for iso-8859-13 Document charset detection -------------------------- indexer detects document character set in this order: 1) "Content-type: text/html; charset=xxx" 2) 3) Defaults from "Charset" indexer.conf command (user preferences) Automatic charset guesser ------------------------- There is also automatic cyrillic charset guesser which is not compiled by default. You may activate it using "--with-charset-guesser" configure argument. If the automatic character set guesser was built at installation time, the above three methods of charset detection will be used only in the case when automatic guessing fails. Default Language ---------------- You can set default language for Servers by using DefaultLang indexer.conf variable. This is useful while restricting search by URL language.