
|
[Main]
[FAQ]
[Glossary]
[Tutorial]
[User's Manual]
[Developer's Guide]
This is the Clara OCR glossary. It's somewhat specific to Clara
OCR. The entries that do not refer an author were written by
Ricardo Ueda Karpischek. Send new entries or suggestions to
claraocr@claraocr.org. This glossary is part of the Clara OCR
documentation. Clara OCR is distributed under the terms of the
GNU GPL.
a well defined procedure. The term "algorithm" is usually
reserved for procedures whose properties can be assured,
generally through a rigorous mathematical proof. For instance,
the procedure learned by children to multiply two numbers from
their multi-digit decimal representations is an algorithm (see
heuristic).
the conversion from color or grayscale (PGM) to
black-and-white. The Clara OCR classification heuristics
currently available require black-and-white input, so when the
input is grayscale (PGM), Clara OCR needs to convert it to
black-and-white before OCR. Note that to binarize an image, some
choice must be done on how to map colors or graylevels to either
black or white. Also and mainly, and the OCR results depends
strongly on that choice.
The Clara OCR documentation tries to use the term "bitmap" to
mean only rectangular, black-and-white digital images. Grayscale
rectangular digital images are called "graymaps" (see also
pixel).
any method intended to decide if two given bitmaps are
similar. Clara OCR implements three such methods: skeleton
fitting, border mapping and pixel distance.
the line formed by the bitmap black pixels that have white
neighbours. Note that the definition of "neighbour" may
vary. Clara OCR generally consider that the neighbours of one
pixel are all 8 pixels contiguous to it (top left, top, top
right, left, right, bottom left, bottom, bottom right).
a bitmap comparison technique that builds a mapping from the
border pixels of one bitmap to the border pixels of another
bitmap. If this mapping is found to satisfy certain mathematical
properties, the bitmaps are considered similar.
Cooperative Lightweight Recognizer. "Clara" is also a personal
name: Clara (Latin, Portuguese, Spanish), "Chiara" (Italian),
Claire (English).
the process that recognizes a given bitmap as being the letter
"a" or the digit "5", etc. Instead of saying that the bitmap was
"recognized" as a letter "a", it's common to say that it was
"classified" as a letter "a". All Clara OCR classification
methods are currently based on bitmap comparison techniques.
see dpi.
the number of bits available to store the color of each pixel.
Black-and-white images have depth 1. Graymaps use to have depth
8 (256 graylevels). The larger the depth, the larger will be the
amount of disk or ram space required to store a digital image.
For instance, an image of size 100x100 and depth 8 requires
100*100*8 = 80000 bits = 8000 bytes to be stored.
see pixel.
dots-per-inch. A measure of linear image density. Example:
scanning an A4 (210x297mm) page at 300 dpi results an image of
size 2481x3508 (remember that 1 inch equals 25.4 millimeters). In
most cases, all relevant visual details from printed characters
can be conveniently captured at 600dpi (in some cases, 300dpi
suffices). Some file formats, like TIFF or JPEG, include density
information. Others, like PBM, PGM or PPM, don't. So when
converting from TIFF to PGM, remember that the density
information is dropped. So if, for instance, you ask SANE to scan
a page creating a TIFF file, and subsequently convert it to PPM,
and from PPM to TIFF again, the last file will not be equal to
the first one. Density information uses to be irrelevant when
displaying images on the computer monitor, because in this case a
1-1 mapping between image pixels and display pixels is
assumed. However, density information is quite important when
printing an image on paper, or when performing OCR. Clara OCR
expects to be informed explicitly about the image density
(default 600 dpi).
a rule that assigns, for each given element, another element, in
a unique fashion. For instance, the equation y = x+1 defines a
function that assigns to each number x the number x+1. A 2d
digital image may be seen as a function that assigns to each dot,
given by its horizontal and vertical coordinates, a color
("black", "white", "green", etc). Functions are also called
"mappings".
A standardised way to store the color of each pixel from a digital
image in a disk file. The graphic format may include other
information, like density and image annotations. Some graphic
formats include a provision to compress the data. In some cases,
this compression, if used, may change the color of some pixels
or regions to colors close to the original ones, but different.
So the usage of some graphic formats may imply in data loss.
Examples of graphic formats are TIFF, JPEG, GIF, BMP, PNM, etc.
see bitmap.
a procedure whose properties are not assured. Heuristics are
generally the expression of some more or less vague feeling, or a
naive, initial approch for a complex problem. If an heuristic can
be proven to satisfy some interesting property, then it can be
referred as an algorithm (in regard of that property). Some
experts say that OCR is an engeneering field, not a mathematical
field. Perhaps we can express this same idea saying that by its
own nature, OCR is a field where nothing else than heuristics can
be stated.
As a digital image uses to be a rectangular matrix of pixels, its
size in pixels can be conveniently described giving the rectangle
width and height, usually in the form WxH. For instance, a 200x100
image is a rectangle of pixels having width 200 and height 100.
see function.
Optical Character Recognition. Some people feel hard to
understand conveniently what OCR is due to the lack of knowledge
on how computers store and process text and image data. Most
users think OCR as being a required step before editing and
spell-checking documents got from the scanner (it's not wrong,
though).
a scanned document. The Clara OCR documentation tries to avoid
using terms like "document", "image" or "file" to signify a
scanned document. "Page" is used instead.
in the Clara OCR context, it's a letter, digit or accent
instance, used to classify the page symbols through bitmap
comparison. Clara OCR builds a set of patterns based on manual
training or automatic selection, and uses it to classify all page
symbols.
each one of the individual dots that compose a digital image
(quite frequently, the term "pixel" is used to refer only the
non-white dots of an image). A digital image uses to be a
rectangular matrix of dots. To each one it's possible to assign
one from many available colors, in order to form an image. If the
available colors are only "black" and "white", the image thus
formed is a "black-and-white image". As the representation of one
from two possible values may be done using a bit, and the
assignment of geometrically well positioned dots to colors may be
seen as a function or mapping, a black-and-white image is also
called a "bitmap". Similarly, if the colors available are only
gray levels, usually from 0 (black) to 255 (white), then the
image is a "grayscale image" or a graymap, and a generic
assignment of pixels to colors is called a "pixmap".
a bitmap comparison technique that builds a mapping from all
pixels of one bitmap to the pixels of another bitmap. If this
mapping is found to satisfy certain mathematical properties, the
bitmaps are considered similar.
see pixel.
see PNM.
see PNM.
Portable aNyMap. PNM is a generic reference to the graphic file
formats PBM, PGM and PPM defined by Jef Poskanzer. In other
words, to say that a program supports PNM means that it handles
PBM, PGM and PPM. PBM (Portable BitMap) files are black-and-white
images, 1 bit per pixel. PGM (Protable GrayMap) files are
grayscale images, 8 bits per pixel. PPM (Portable PixMap) files
are color images, 24 bits per pixel. Currently Clara OCR likes
PBM and PGM files only. A scanned page stored in some format
other than PBM or PGM can be converted to PBM or PGM using the
netpbm tools, ImageMagick or others. PNM files may be "raw" or
"plain". The plain versions are rarely used. Clara OCR does not
support plain PBM nor plain PGM.
see PNM.
this term is used along the Clara OCR documentation to refer
either the image size (for instance: 640x480 pixels) or the image
density (for instance: 300 pixels per inch).
ideally, it's a minimal structural bitmap. From an algorithmic
standpoint, the skeleton of a symbol is the bitmap obtained
clearing a number of its peripheric pixels, whose remotion does
not destroy the symbol shape.
a bitmap comparison technique that decides that two given bitmaps
are similar if and only if the skeleton of each one fits into the
other.
an instance of a letter or digit in a page. So if the word
"classical" occurs in a page, all its letters ("c", "l", "a",
"s", "s", "i", "c", "a", "l") are individual symbols. At the
source code level, things that are not letters not digits are
sometimes called symbols (for instance, pieces of broken symbols,
dots, accents, noise, etc).
a simple binarization method. It decides to map each pixel from a
graymap to either black or white just testing if its gray level
is smaller or larger than a given threshold. So, if the threshold
is, say, 171, then all gray levels from 0 to 170 are mapped to 0
(black) and all graylevels from 171 to 255 are mapped to 255
(white). The thresholding is said to be global if one fixed
(per-page) binarization threshold is used to decide the mapping
of all page pixels. The thresholding is said to be local if the
threshold is allowed to vary along the page, due to irregular
printing intensity.
the low-level, standard, Xwindows library. It offers
basic graphic primitives, similar to others found on most graphic
environments, like "draw line", "draw pixel", "get next event",
etc, as well as services more specific to the Xwindows way of
doing things, like "connect to an X display", properties
(resources) handling, etc. The Xlib does not include facilities
to create menus, buttons, etc. Application programs usually take
these facilities from "toolkits" like Xt, GTK, Qt and
others. Clara OCR creates the few facilities it needs using
the Xlib primitives.
|