IExtract ======== Utility to extract the description out of certain file-types Copyright (C) 2002 - 2007 Markus Schwab ----------------------------------------------------------------------- At the moment the following documents are supported (unless disabled while configuring). - PNG: PNG images can contain text-chunks with keyword-value pairs. The content of the keywords "Title", "Author" and "Description" is extracted. - GIF: GIF images can contain a "comment extension". The contents of this block(s) is extracted. There are no title or author entries. - JPEG: The program recognizes three types of comments: - After an JPEG comment marker (0xfffe) - In an APP1-Exif marker (as used by Windows XP) - In an APPD marker (as written by PhotoShop) - HTML: The text in between and and/or the contents of the meta tags (supported are the tags according HTML 4.0 and Dublin Core) are extracted. - PDF: The content of the "document information dictionary" is extracted (the content of the "Subject" key is returned as comment). Note that encrypted information is not decrypted! - OpenOffice documents (*.sxw, *.sxc, *.sxd, *.sdi, *.sxm); the entries of the properties dialog are extracted - OpenOffice 2 documents (*.odt, *.ods, *.odp, *.odg); the entries of the properties dialog are extracted - StarOffice documents (*.sdw, *.sdc, *.sdd, *.sda); the entries of the properties dialog are extracted - MS office documents (*.doc, *.xls, *.ppt); the entries of the properties dialog are extracted. Thanks to the Apache Jakarta POI project for publishing their insights about the MS Office format(s) (see http://jakarta.apache.org/poi/index.html for details) - MP3 files: Extracts the contents of the ID3 tag. - OGG vorbis files: Extracts the contents of the comment header - RTF documents: Extracts the contents of the \info section - Abiword documents (*.abw): Extracts the contents of the property dialog - Other types can easily be added by the use of plugins (see below) Feel free to report/send me any not-working documents or suggest other files, which you like to be supported. The found results can be written either in (human-readable) text (separated by spaces), quoted comma-separated text (to be machine-interpreted, like imported into a database or a spreadsheet), HTML (table), XML (defaults to XHTML) or LaTeX (tabular) format. Note that some extra formatting of the text might be done (like checking for LaTeX or HTML special characters). The extraction of the description can be performed with threads. Note that this can actually cause the program to be slower, if the thread searching for files doesn't find enough files to be processed. And it does definitely *not* speed things up on single-processor systems! Plugins ------- IExtract can use plugins to support further file-formats. A plugin is a shared library (DLL), which must contain a function called "processFile" and - if the contents of the file determines its type - another one called "getFileType" - which either extracts respectively checks a file. An example can be found in src/Plugins/Text.cpp These shared libraries are added by a "Handler"-section in an INI-file: [Handler] txt=libText db=libDB Installation ------------ See the file INSTALL Windows ------- Is supported (principally), though because of missing standards not that easy (at least, if you have a different setup than me). See the (end of the) file INSTALL. Documentation ------------- Can be found in the doc subdirectory (in HTML-format). How to report bugs and/or send patches -------------------------------------- Bug reports and patches should be send to the e-mail address of the author (g17m0@lycos.com). Feel also free to send comments. If you report a bug, please be sure to add anything which might be of use! Like - The version of the utility. - The version of the used libYGP library. - How to reproduce the bug; a file provocing it would be great. - In case of a crash there should also be a stackdump in your systemlog which might help in localising the bug. - Anything else you think might be helpful.