[Main]
[FAQ]
[Glossary]
[Tutorial]
[User's Manual]
[Developer's Guide]
Welcome. Clara OCR is a free OCR, written for systems supporting
the C library and the X Windows System. Clara OCR is intended for the
cooperative OCR of books. There are some screenshots available at
http://www.claraocr.org/.
This documentation is extracted automatically from the comments
of the Clara OCR source code. It is known as "The Clara OCR
Tutorial". There is also an advanced manual known as "The Clara
OCR Advanced User's Manual" (man page clara-adv(1), also
available in HTML format). Developers must read "The Clara OCR
Developer's Guide" (man page clara-dev(1), also available in HTML
format).
This section is a tutorial on the basic OCR features offerred by
Clara OCR. Clara OCR is not simple to use. A basic knowledge
about how it works is required for using it. Most complex
features are not covered by this tutorial. If you need to compile
Clara from the source code, read the INSTALL file and check (if
necessary) the compilation hints on the Clara OCR Advanced User's
Manual.
So let's try it. Of course we need a scanned page to do so. Clara
OCR requires graphic format PBM or PGM (TIFF, PBM, and others
must be converted, the netpbm package contains various conversion
tools). The Clara distribution package contains one small PBM
file that you can use for a first test. The name of this file is
imre.pbm. If you cannot locate it, download it or other files
from http://www.claraocr.org/. Alternatively, you can produce your own 600-dpi
PBM or PGM files scanning any printed document (hints for
scanning pages and converting them to PBM are given on the
section "Scanning books" of the Clara OCR Advanced User's
Manual).
Once you have a PBM or PGM file to try, cd to the directory where
the file resides and fire up Clara. Example:
$ cd /tmp/clara
$ clara &
|
In order to make OCR tests, Clara will need to write files on
that directory, so write permission is required, just like some
free space.
Obs. As to version 0.9.9, Clara OCR heuristics are tuned
to handle 600 dpi bitmaps. When using a different resolution,
inform it using the -y switch:
Then a window with menus and buttons will appear on your X
display:
+-----------------------------------------------+
| File Edit OCR ... |
+-----------------------------------------------+
| +--------+ +----+ +--------+ +-------+ |
| | zoom | |page| |patterns| | tune | |
| +--------+ +-+ +-+ +-+ +-+ |
| +--------+ | +-------------------------+ | |
| | zone | | | | | |
| +--------+ | | | | |
| +--------+ | | | | |
| | OCR | | | WELCOME TO | | |
| +--------+ | | | | |
| +--------+ | | C L A R A O C R | | |
| | stop | | | | | |
| +--------+ | | | | |
| . | | | | |
| . | | | | |
| | | | | |
| | | | | |
| | +-------------------------+ | |
| +-----------------------------+ |
| |
| (status line) |
+-----------------------------------------------+
|
Welcome aboard! The rectangle with the welcome message is called
"the plate". As you already guessed, the small rectangles with
the labels "zoom", "OCR", "stop", etc, are "the buttons". The
"tabs" are those flaps labelled "page", "patterns"
and "tune". On the menu bar you'll find the File menu, the Edit
menu, and so on. Popup the "Options" menu, and change the current
font size for better visualization, if required.
Press "L" to read the GPL, or select the "page" tab, and
subsequently, select on the plate the imre.pbm page (or any other
PBM or PGM file, if any). The OCR will load that file showing the
progress of this operation on the status line on the bottom of
the window.
note: the "page" tab is the flap labelled "page". This is
unrelated to the "tab" key.
When the load operation completes, Clara will display the
page. Press the OCR button and wait a bit. The letters will
become grayed and the plate will split into three windows. Move
the pointer along the plate and you'll see the tab label follow
the current window: "page", "page (output)" or "page
(symbol)". Move the pointer along the entire application window,
and, for most components, you'll see a short context help message
on the status line when the pointer reaches it (the buttons, for
instance). Dialogs (user confirmations) also use the status line
(like Emacs), instead of dialog boxes.
You can resize both the Clara application window or each of the
three windows currently on the plate ("page", "page (output)" and
"page (symbol)"). To resize the windows, select any point between
two of them and drag the mouse. The scrollbars can become hidden
(use the "hide scrollbars" on the View menu).
When the tab label is "page", press the "zoom" button using the
mouse button 1 and the scanned image will zoom out. If you use
the mouse button 3, the image will zomm in (the behaviour of the
"zoom" button depends on the current window).
Now try selecting the "page" tab many times, and you will
circulate the various display modes shared by this tab. These
modes are and will be referred as "PAGE", "PAGE (fatbits)" and
"PAGE (list)". Each display mode may have one or more windows
We've chosen this uncommon approach because an excess of tabs
transforms them in a useless decoration. The other tabs also
offer various modes, some will be presented later by this
tutorial.
1.2 Some few command-line switches |
Besides the -y option used in the last subsection, Clara accepts
many others, documented on the Clara OCR Advanced User's
Manual. By now, from the various different ways to start Clara,
we'll limit ourselves to some few examples:
In the first case, Clara is just started. On the second, it will
display a short help and exit.
clara -f path
clara -f path -w workdir
|
The option -f informs the relative or absolute path of a scanned
page or a directory with scanned pages (PBM or PGM files). The
option -w informs the relative or absolute path of a work
directory (where Clara will create the output and data files).
clara -i -f path -w workdir
clara -b -f path -w workdir
|
The option -i activates dead keys emulation for composition of
accents and characters. The -b switch is for batch
processing. Clara will automatically perform one OCR run on the
file informed through -f (or on all files found, if it is the
path of a directory) and exit without displaying its window.
Clara will start with the smallest possible window size.
A full reference of command-line switches is given on the section
"Reference of command-line switches" of the Clara OCR Advanced
User's Manual.
Yes, Clara OCR must be trained. Training is a tedious procedure,
but it's a must for those who need a customizable OCR, apt to
adapt to a perhaps uncommon printing style.
Before training, a process called segmentation must be
performed. Press the right button of the mouse over the OCR
button, select "Segmentation" on the menu that will pop out and
wait the operation complete.
Now, on the "page" tab, observe the image of the document
presented on the top window. You'll see the symbols greyed,
because the OCR currently does not know their
transliterations. Try to select one symbol using the mouse (click
the mouse button 1 over it). A black elliptic cursor will appear
around that symbol. This cursor is called the "graphic
cursor". You can move the graphic cursor around the document
using the arrow keys.
Now observe the bottom window on the "page" tab. That window
presents some detailed information on the current symbol (that
one identified by the graphic cursor). When the "show web clip"
option on the "View" menu is selected, a clip of the document
around the current symbol, is displayed too. In some cases, this
clip is useful for better visualization. The name "web clip" is
because this same image is exported to the Clara OCR web
interface when cooperative training and revision through the
Internet is being performed.
To inform the OCR about the transliteration of one symbol, just
type the corresponding key. For instance, if the current symbol
is a letter "a", just type the "a" key. Observe that the trained
symbol becomes black. Each symbol trained will be learned by the
OCR, its bitmap will be called a "pattern", and it will be used
as such when trying to deduce the transliteration of unknown
symbols.
Obs. in our test, the user chose the symbol to be trained. However,
Clara OCR can choose by itself the symbols to be trained. This feature
is called "build the bookfont automatically" (found on the "tune"
tab). To use it, select the corresponding checkbos and classify the
symbols as explained later.
Finally, when the transliteration cannot be informed through one
single keystroke or composition (for instance when you wish to
inform a TeX macro as being the transliteration of the current
symbol), write down the transliteration using the text input
field on the bottom window (select it using the mouse before).
Before going further, it's important to know how to save your
work. The file menu contains one item labelled "save
session". When selected, it will create or overwrite three files
on the working directory: "patterns", "acts" and "page.session",
where "page" is the name of the file currently loaded, without
the "pbm" or "pgm" tag (in out example, "imre"). So, to remove
all data produced by OCR sessions, remove manually the files
"*.session", "patterns" and "acts".
Note that the files "patterns" and "acts" are shared by all PBM
or PGM pages, so a symbol trained from one page is reused on the
other pages. The ".session" files however are per-page. Pages
with the same graphic characteristics, and only them, must be put
on one same directory, in order to share the same patterns.
When the "quit" option of the "File" menu is selected, the OCR
prompts the user for saving the session (answer pressing the key
"y" or "n"), unless there are no unsaved changes.
The OCR process is divided into various steps, for instance
"classification", "build", etc. These steps are acessible clicking
the mouse button 3 over the OCR button. Each one can be started
independently and/or repeated at any moment. In fact, the more
you know about these steps, the better you'll use them.
Clicking the "OCR" button with the mouse button 1, all steps will
be started in sequence. The "OCR" button remains on the
"selected" state while some step is running.
Yet we won't cover this stuff in the tutorial, a basic knowledge
on what each step perform is required for fine-tuning Clara OCR.
The tuning is an interactive effort where the usage of the
heuristics alternates with training and revision, guided by the
user experience and feeling.
After training some symbols, we're ready to apply the just
acquired knowledge to deduce the transliteration of non-trained
symbols. For that, Clara OCR will compare the non-trained symbols
with those trained ("patterns"). Clara OCR offers nice visual
modes to present the comparison of each symbol with each
pattern. To activate the visual modes, enter the View menu and
select (for instance) the "show comparisons" option.
Now start the "classification" step (click the mouse button 3
over the OCR button and select the "classification" item) and
observe what happens. Depending on your hardware and on the size
of the document, this operation may take long to complete
(e.g. 5 minutes). Hopefully it'll be much faster (say, 30
seconds).
When the classification finishes, observe that some nontrained
symbols became black. Each such symbol was found similar to some
pattern. Select one black symbol, and Clara will draw a gray
ellipse around each class member (except the selected symbol,
identified by the black graphic cursor). You can switch off this
feature unselecting the "Show current class" item on the "View"
menu.
In some cases, Clara will classify incorrectly some symbols. For
instance, a defective "e" may be classified as "c". If that
happens, you can inform Clara about the correct transliteration
of that symbol training it as explained before (in this example,
select the symbol and press "e"). This action will remove that
symbol from its current class, and will define a new class,
currently unitary and containing just that symbol.
1.7 Note about how Clara OCR classification works |
The usual meaning of "classification" for OCRs is to deduce for
each symbol if it is a letter "a" or the letter "b", or a digit
"1", etc. As the total number of different symbols is small (some
tenths), there will be a small quantity of classes.
However, instead of classifying each symbol as being the letter
"a", or the digit "1", or whatever, Clara OCR builds classes of
symbols with similar shapes, not necessarily assigning a
transliteration for each symbol. So as sometimes the bitmap
comparison heuristics consider two true letters "a" dissimilar
(due to printing differences or defects), the Clara OCR
classifier will brake the set of all letters "a" in various
untransliterated subclasses.
Therefore, the classification result may be a much larger number
of classes (thousands or more), not only because of those small
differences or defects, but also because the classification
heuristics are currently unable to scale symbols or to "boldfy"
or "italicize" a symbol.
Note that each untransliterated subclass of letters "a" depends
on a punctual human revision effort to become transliterated
(trained). This is not an absurd strategy, because the revision
of each subset corresponds to part of the unavoidable human
revision effort required by any real-life digitalization
project. This is one of the principles that make possible to see
Clara OCR not as a traditional OCR, but as a productivity tool
able to reduce costs. Anyway, we expect to the future more
improvements on the Clara OCR classifier, in order to lessen the
number of subclasses created.
Now we're ready to build the OCR output. Just start the
"build" step. The action performed will be basically
to detect text words and lines, and output the transliterations,
trained or deduced, of all symbols. The output will be presented
on the "PAGE (output)" window.
Each character on the "PAGE (output)" window behaves like a
HTML hyperlink. Click it to select the current symbol both
on the "PAGE" window and on the "PAGE (symbol)" window. Note
that the transliteration of unknow symbols is substituted by
their internal IDs (for instance "[133]").
The result of the word detection heuristic can be visualized
checking the "show words" item on the "View" menu.
1.9 Handling broken symbols |
Obs. As to version 0.9.9 the merging heristics are only
partially implemented, and in most cases they won't produce any effect.
The build heuristics also try to merge the pieces of broken
symbols, just like the "u", the "h" and the "E" on the figure
(observe the absent pixels). Some letters have thin parts, and
depending on the paper and printing quality, these parts will
brake more or less frequently.
XXX XXXXXXXXXXX
XX XXX X
XX XXX
XX XXX
XXX XXX XX XXX XXX X
XX XX XXX X XXX XXXX
XX XX XX XX XXX X
XX XX XX XX XXX
XX XX XX XX XXX
XX XX XX XX XXX X
XX XXXX XXXX XXX XXXXXXXXXXX
|
Clara OCR offers three symbol merging heuristics:
geometric-based, recognition-based and learned. Each one may be
activated or deactivated using the "tune" tab.
Geometric merging applies to fragments on the interior of the
symbol bounding box, like the "E" on the figure, and to some other
cases too.
The recognition merging searches unrecognized
symbols and, for each one, tries to merge it with some
neighbour(s), and checks if the result becomes similar to some
pattern.
Finally, learned merging will try to reproduce the
cases trained by the user. To train merging, just select the
symbol using the mouse button 1
(say, the left part of the "u" on the figure), click the mouse
button 3 on the fragment (the right part of the "u"), and select
the "merge with current symbol" entry. On the other hand, the
"disassemble" entry may be used to break a symbol into its
components.
Obs. do not merge the "i" dot with the "i" stem. See the
subsection "handling accents" for details.
Now let's talk about accents.
As a general rule, Clara OCR does not consider accents as parts
of letters, so merging does not apply to accents. Accents are
considered individual symbols, and must be trained
separately. The "i" dot is handled as an accent. Clara OCR will
compose accents with the corresponding letters when generating
the output. The exception is when the accent is graphically
joined to the letter:
XXX
XX XXX
XX XX
XX
XXXX XXXX
XX XX XX XX
XX XX XX XX
XXXXXXXXXX XXXXXXXXXX
XX XX
XX XX
XX XX XX XX
XXXX XXXX
|
In the figure we have two samples of "e" letter with acute
accent. In the first one, the accent is graphically separated
from the letter. So the accent transliteration will be trained or
deduced as being "'", the letter transliteration
will be trained or deduced as beig "e". When generating the output,
Clara OCR will compose them as the macro "\'e" (or as the ISO
character 233, as soon as we provide this alternative behaviour).
On the second case the accent isn't graphically separable from
the letter, so we'll need to train the accented character as the
corresponding ISO character (code 233) or as the macro "\'e". As
the generation of accented characters depend on the local X
settings, the "Emulate deadkeys" item on the "Options" menu may
be useful in this case. It will enable the composition of accents
and letters performed directly by Clara OCR (like Emacs
iso-accents-mode feature).
1.11 Browsing the book font |
As explained earlier, trained symbols become patterns (unless you
mark it "bad"). The collection of all patterns is called "book
font" (the term "book" is to distinguish it from the GUI
font). Clara OCR stores all pattern in the "patterns" file on the
work directory, when the "save session" entry on the "File" menu
is selected.
Clara OCR itself can choose the patterns and populate the book
font. To do so, just select the "Build the font automatically"
item on the "tune" tab, and classify the symbols.
To browse the patterns, click the "pattern" tab one or more times
to enter the "Pattern (list)" window. The "PATTERN (list)" mode
displays the bitmap and the properties
of each pattern in a (perhaps very long) form.
Click the "zoom" button to
adjust the size of the pattern bitmaps. Use the scroolbar or
the Next (Page Down) or Previous (Page Up) keys to navigate. Use
the sort options on the "Edit" menu to change the presentation order.
Now press the "pattern" tab again to reach the "Pattern" window. It
presents the "current" pattern with detailed properties. try
activating the "show web clip" option on the "View" menu to
visualize the pattern context. The left and
right arrows will move to the previous and to the next patterns. To
train the current pattern (being exhibited on the "Pattern" window),
just press the key corresponding to its transliteration (Clara will
automatically move to the next pattern) or fill the
input field. There is no need to press ENTER to submit the input
field contents.
If the GUI becomes trashed or blank, press C-l to redraw it.
By now, the GUI do not support cut-and-paste. To save to a file
the contents of the "PAGE (list)" window, use the "Write report"
item on the "File" menu.
The "OCR" button will enter "pressed" stated in some unexpected
situations, like during dialogs. This behaviour will be fixed
soon.
The "STOP" button do not stop immediately the OCR operation in
course (e.g. classification). Clara OCR only stops the operation
in course in "secure" points, where all data structures are
consistent.
The OCR output is automatically saved to the file page.html (or
page.txt if the option -o was used), where "page" is the name of
the currently loaded page, without the "pbm" or "pgm" tag. This
file is created by the "generate output" step on the menu that
appears when the mouse button 3 is pressed over the OCR button.
Some OCR steps are currently unfinished and perform no
action at all.
Clara OCR "fun codes" are similar to videogame "codes" (for those
who have never heard about that, videogame "codes" are special
sequences of mouse or key clicks that make your player
invulnerable, or obtain maximum energy, or perform an unexpected
action, etc).
The difference is that Clara OCR "fun codes" are not secret
(videogame "codes" are normally secret and very hard to discover
by chance). Clara OCR contains no secret feature. Fun codes are
intended to be used along public presentations. By now there is
only one fun code: just click one or more times the banner on the
welcome window to make it scroll.
Clara OCR is free software. Its source code is distributed under
the terms of the GNU GPL (General Public License), and is
available at http://www.claraocr.org/. If you don't know what is the GPL,
please read it and check the GPL FAQ at
http://www.gnu.org/copyleft/gpl-faq.html. You should have
received a copy of the GNU General Public License along with this
software; if not, write to the Free Software Foundation, Inc., 59
Temple Place - Suite 330, Boston, MA 02111-1307, USA. The Free
Software Foundation can be found at http://www.fsf.org.
Clara OCR was written by Ricardo Ueda Karpischek. Giulio Lunati
wrote the internal preprocessor. Clara OCR includes bugfixes
produced by other developers. The Changelog
(http://www.claraocr.org/CHANGELOG) acknowledges all them (see
below). Imre Simon contributed high-volume tests, discussions
with experts, selection of bibliographic resources, propaganda
and many ideas on how to make the software more useful.
Ricardo authored various free materials, some included (at least)
in Conectiva, Debian, FreeBSD and SuSE (the verb conjugator
"conjugue", the ispell dictionary br.ispell and the proxy
axw3). He recently ported the EiC interpreter to the Psion 5
handheld and patched the Xt-based vncviewer to scale framebuffers
and compute image diffs. Ricardo works as an independent
developer and instructor. He received no financial aid to develop
Clara OCR. He's not an employee of any company or organization.
Imre Simon promotes the usage and development of free
technologies and information from his research, teaching and
administrative labour at the University.
Roberto Hirata Junior and Marcelo Marcilio Silva contributed
ideas on character isolation and recognition. Richard Stallman
suggested improvements on how to generate HTML output. Marius
Vollmer is helping to add Guile support. Jacques Le Marois helped
on the announce process. We acknowledge Mike O'Donnell and Junior
Barrera for their good criticism. We acknowledge Peter Lyman for
his remarks about the Berkeley Digital Library, and Wanderley
Antonio Cavassin, Janos Simon and Roberto Marcondes Cesar Junior
for some web and bibliographic pointers. Bruno Barbieri Gnecco
provided hints and explanations about GOCR (main author: Jorg
Schulenburg). Luis Jose Cearra Zabala (author of OCRE) is gently
supporting our tentatives of using portions of his code. Adriano
Nagelschmidt Rodrigues and Carlos Juiti Watanabe carefully tried
the tutorial before the first announce. Eduardo Marcel Macan
packaged Clara OCR for Debian and suggested some
improvements. Mandrakesoft is hosting claraocr.org. We
acknowledge Conectiva and SuSE for providing copies of their
outstanding distributions. Finally, we acknowledge the late Jose
Hugo de Oliveira Bussab for his interest in our work.
The fonts used by the "view alphabet map" feature came from
Roman Czyborra's "The ISO 8859 Alphabet Soup" page at
http://czyborra.com/charsets/iso8859.html.
The names cited by the CHANGELOG and not cited before follow
(small patches, bug reports, specfiles, suggestions,
explanations, etc).
Brian G.,
Bruce Momjian,
Charles Davant (server admin),
Daniel Merigoux,
De Clarke,
Emile Snider (preprocessor, to be released),
Erich Mueller,
groggy,
Harold van Oostrom,
Ho Chak Hung,
Jeroen Ruigrok,
Laurent-jan,
Nathalie Vielmas,
Romeu Mantovani Jr (packager),
Ron Young,
R P Herrold,
Sergei Andrievskii,
Stuart Yeates,
Terran Melconian,
Thomas Klausner (packager),
Tim McNerney,
Tyler Akins.
|