clara-dev

Langue: en

Version: 110736 (mandriva - 01/05/08)

Section: 1 (Commandes utilisateur)

NAME

clara - a cooperative OCR

SYNOPSIS

clara [options]

DESCRIPTION

Welcome. Clara OCR is a free OCR, written for systems supporting the C library and the X Windows System. Clara OCR is intended for the cooperative OCR of books. There are some screenshots available at http://www.claraocr.org/.

This documentation is extracted automatically from the comments of the Clara OCR source code. It is known as "The Clara OCR Developer's Guide". It's currently unfinished. First-time users are invited to read "The Clara OCR Tutorial". There is also an advanced manual known as "The Clara OCR Advanced User's Manual".

CONTENTS

1. Introducing the source code


    1.1 Language and environment
    1.2 Modularization
    1.3 The memory allocator
    1.4 Security notes
    1.5 Runtime index checking
    1.6 Background operation
    1.7 Global variables
    1.8 Path variables
    1.9 Bitmaps
    1.10 Execution model
    1.11 Return codes

2. Internal representation of pages


    2.1 Closures
    2.2 Symbols
    2.3 The sdesc structure and the mc array
    2.4 The preferred symbols
    2.5 Font size
    2.6 Symbol alignment
    2.7 Words and lines
    2.8 Acts and transliterations
    2.9 Symbol transliterations
    2.10 Transliteration preference
    2.11 Transliteration class computing
    2.12 The zones

3. Heuristics


    3.1 Skeleton pixels
    3.2 Symbol pairing
    3.3 The build step
    3.4 Resetting
    3.5 Synchronization
    3.6 The function list_cl

4. The GUI


    4.1 Main characteristics
    4.2 Geometry of the application window
    4.3 Geometry of windows
    4.4 Scrollbars
    4.5 Displaying bitmaps
    4.6 HTML windows overview
    4.7 Graphic elements
    4.8 XML support
    4.9 Auto-submission of forms

5. The Clara API


    5.1 Redraw flags
    5.2 OCR statuses
    5.3 The function setview
    5.4 The function redraw (to be written)
    5.5 The function show_hint
    5.6 The function start_ocr

6. How to change the source code (examples)


    6.1 How to add a bitmap comparison method
    6.2 How to write a bitmap comparison function
    6.3 How to add an application button

7. Bugs and TODO list

8. AVAILABILITY

9. CREDITS

1. Introducing the source code

This Guide is a collection of entry points to the Clara OCR source code. Some notes explain punctual details about how this or that feature was implemented. Others are higher-level descriptions about how one entire subsystem works.

1.1 Language and environment

Clara OCR is written in ANSI C (with some GNU extensions) and requires the services of the C library and the Xlib. The development is using 32-bit Intel GNU/Linux (various different distributions), GCC, Gnu Make, Bash, XFree86 and Perl 5 (required for producing the documentation).

1.2 Modularization

Clara source code started, of course, as being one only file named clara.c. At some point we divided it into smaller pieces. Currently there are 18 files:


  book.c     .. Documentation only
  build.c    .. The function build
  clara.c    .. Startup and OCR run control
  cml.c      .. ClaraML dumper and recover
  common.h   .. Common declarations
  consist.c  .. Consistency tests
  event.c    .. GUI initialization and event handler
  gui.h      .. Declarations that depend on X11
  html.c     .. HTML generation and parse
  pattern.c  .. Book font stuff
  pbm2cl.c   .. Import PBM
  pgmblock.c .. grayscale loading and blockfinding
  preproc.c  .. internal preprocessor
  redraw.c   .. The function redraw
  revision.c .. Revision procedures
  skel.c     .. Skeleton computation
  symbol.c   .. Symbol stuff
  welcome.c  .. Welcome stuff

Along this document we'll not refer these files, but the identifiers (names of functions and variables).

Note that there are only two headers: common.h and gui.h. It's complex to maintain one header for each module. Most functions are not prototyped, but we intend to prototype all them in the near future.

1.3 The memory allocator

Clara OCR relies on the memory allocator both for allocation or resizing of some large blocks used by the main data structures, and for allocation of a large number of small arrays. Currently Clara OCR does not include or use an special memory allocator, but implements an interface to realloc. The alloca function is also used sometimes along the code, generally to allocate buffers for sorting arrays.

The interface is the function c_realloc. The function c_free must be used to free the blocks allocated or resized by c_realloc. In the near future, c_realloc will build a list of the currently allocated blocks, their sizes and some bits more in order to help to trace flaws.

1.4 Security notes

Concerning security, the following criteria is being used:

1. string operations are generally performed using services that accept a size parameter, like snprint or strncpy, except when the code itself is simple and guarantees that a overflow won't occur.

2. The CGI clara.pl invokes write privileges through sclara, a program specially written to perform only a small set of simple operations required for the operation of the Clara OCR web interface.

The following should be done:

1. Memory blocks should be cleared before calling free().

1.5 Runtime index checking

A naive support for runtime index checking is provided through the macro checkidx. This checking is performed only if the code is compiled with the macro MEMCHECK defined and the command-line switch

In fact, only those points on the source code where the macro checkidx is explicitly used will perform index checking. We've added calls to checkidx on some critical functions due to its complexity, or because segfaults were already were detected there.

1.6 Background operation

Clara OCR decides at runtime if the GUI will be used or not. So even when using Clara OCR in batch mode (-b command-line switch), linking with the X libraries is required.

When the -b command-line switch is used, Clara OCR just won't make calls to X services. The source code tests the flag "batch_mode" before calling X services. So it won't create the application window on the X display, and automatically starts a full OCR operation on all pages found (as if the "OCR" button was pressed with the "work on all pages" option selected). Upon completion, Clara OCR will exit.

1.7 Global variables

Clara OCR uses a lot of global variables. Large data structures, flags, paths, etc, use stored on global variables. In some cases we use a naming strategy to make the code more readable. The important cases are:

a. The main data structures of Clara OCR are global arrays that grow as required. The following a convention was created for the names associated with these arrays:


    structure    type    array    top    size
   --------------------------------------------
    act          adesc   act      topa   actsz
    closure      cldesc  cl       topcl  clsz
    symbol       sdesc   mc       tops   ssz
    pattern      pdesc   pattern  topp   psz
    link         ldesc   lk       toplk  lksz
    ptype        ptdesc  pt       toppt  ptsz

The "top" is the last used entry (initial value -1). The "size" is the total size of the allocate memory block for that array (initial value 0). So the relation (top < size) must always be true.

b. Menus are referred by their registration indexes. These indexes are stored on variables named CM_X. The menu items registration indexes are stored on variables named CM_X_SOMETHING (all capital). If the item has an associated flag, the flag is named cm_x_something (all small).

1.8 Path variables

Most path variables are computed from the path given through the -f command line option. The variable "pagename" is the filename of the PBM image of the page being processed, not including the path eventually specified through the -f switch. For instance, if the OCR is started with


    clara -f mydocs/test.pbm

Then the value of the variable "pagename" will be just "test.pbm". The variable pagebase is pagename without the suffix ".pbm" ("test", in the example).

Clara stores on the variable pagelist the null-separated list of all names of pbm files found on this directory. Even in this case, the variable pagename will store the filename of the page being processed (at any moment Clara will be processing one and only one page).

The directory that contains the pbm files that Clara will process is stored on the variable pagesdir. In the example above, the value of the variable pagesdir is "mydocs/".

The variable workdir stores the path of the directory where Clara will create the files *.html, *.session, "patterns" and "acts". This path is assumed to be equal to pagesdir, unless another path is given through the -w switch. The variable doubtsdir will be the concatenation of workdir with the string "doubts/" (doubtsdir is ignored if -W is not used).

1.9 Bitmaps

Clara stores bitmaps in a linear array of bytes, following closely the pbm raw format. The first line of a bitmap with width w is stored on the first (w/8)+((w%8)!=0) bytes of the array. The remaining bits (if any) are left blank, and so on. The leftmost bit on each byte is the most significative one (black, or "on", is 1, and white, or "off" is 0). An example follows:


       76543210765432
      +--------------+
      |              | 00000000 00000000 =  0   0
      |   XX XXXX    | 00011011 11000000 = 27 192
      |    XX   XX   | 00001100 01100000 = 12  96
      |    XX   XX   | 00001100 01100000 = 12  96
      |    XX   XX   | 00001100 01100000 = 12  96
      |    XX   XX   | 00001100 01100000 = 12  96
      |              | 00000000 00000000 =  0   0
      +--------------+


      stored as: 0 0 27 192 12 96 12 96 12 96 12 96 0 0

Note that the array of bytes that encodes one bitmap does not contain the bitmap width nor the height. So bitmaps must be stored together with other data. This is done by structures where the bitmap is one field and the geometric information is stored on other fields. There are two such structures: bdesc and cldesc.

1.10 Execution model

In order to allow the GUI to refresh the application window while one OCR run is in course, Clara does not use multiple threads. The main function alternates calls to xevents() to receive input and to continue_ocr() to perform OCR. As the OCR operations may take long to complete, a very simple model was implemented to allow the OCR services to execute only partially.

Such services (for instance load_page()) accept a "reset" parameter to allow resetting all static data, and they're expected to return 0 when finished, or nonzero otherwise. Therefore, a call to such services must loop until completion. The continue_ocr() calls the OCR steps using this model, and some OCR steps call other services (like load_page()) that implement this model too.

1.11 Return codes

When Clara OCR exits, the exit code will diagnose the finalization status:


  0 clean
  1 data inconsistency
  2 buffer overflow
  3 invalid field
  4 internal error
  5 memory exhausted
  6 X error
  7 I/O error
  8 bad input

2. Internal representation of pages

Even for non-developers, a knowledge of the internal data structures used by Clara OCR is required for fine tuning and to make simple diagnostics.

The basic elements stored are the "closures". Sets of one or more closures are called "symbols". Symbols are arranged in lists forming "words". The words are arranged in lists forming "lines".

2.1 Closures

Closures of black pixels by contiguity are a first attempt to identify the atomic symbols of the document. The name "closure" is of course due to the consideration of the contiguity as a relation (in the mathematical sense of the word). Starting (for instance) from (i,j), we compute the set of black pixels ("X" and "*" in the figure). The limits (l,r,t,b) define the bounding box of the closure.


          l i    r
      +---+-+----+---+
      |              |
    t +   XX XXXX    |
      |    XX   XX   |
    j +    X*   XX   |
      |    XX   XX   |
    b +    XX   XX   |
      |              |
      +--------------+

When loading a document, the OCR computes all its closures and use an array to store them. When the session file is written, the closures are stored in CML format. Note that, if required, the closures may be recomputed from the document, because the document and the closure computing algorithm determine the index that each closure will have on the array.

2.2 Symbols

As one character of the document may be composed by two or more closures (for instance when it's broken), it's convenient to work not with closures, but with sets of closures. So we define the concept of "symbol" as being a set of one or more closures. Initially, the OCR generates one unitary symbol for each closure. Subsequent steps may define new symbols composed by two or more closures.

For instance, let's present three closures that do not correspond to atomic symbols: "a" and "i" linked (one closure) and a broken "u" (two closures). As a principle, Clara OCR do not try to break closures into smaller closures. Instead of that, the classification heuristic try to compose various patterns to resolve symbols like the "ai" in the figure. Concerning the "u", the classification heuristic is expected to merge the two closures into one symbol and apply a "u" pattern to resolve it.


            l            r     l r l    r
      +-----+------------+-----+-+-+----+--+
      |                                    |
    t +                XX                  |
      |                XX                  |
      |                                    |
      |      XXXXX    XXX      XXX   XXX   + t
      |     X     XX   XX       XX    XX   |
      |           XX   XX       XX    XX   |
      |      XXXXXXX   XX       XX    XX   |
      |     X     XX   XX       XX    XX   |
      |     X     XX   XX       XX    XX   |
    b +      XXXXX XXXXXXX       XX  XXXX  + b
      |                                    |
      +------------------------------------+

As a principle, Clara OCR won't merge dots and accents into characters. So an "i" will generally be formed by two individual symbols (the dot and the body). The heuristics that build the OCR output are expected to compose these two symbols into one ASC character. The same applies for "j" and the accents (acute, grave, tilde, etc) found on various european languages.


          l  r
      +---+--+-------+
      |              |
    t +    XX        |
      |    XX        |
      |              |
      |   XXX        |
      |    XX        |
      |    XX        |
      |    XX        |
      |    XX        |
    b +   XXXX       |
      |              |
      +--------------+

2.3 The sdesc structure and the mc array

Each symbol is stored in a sdesc structure. Those structures form the mc array. Once created, a symbol is never deleted. So it's index on the mc array identifies it (this is important for the web-based revision procedure). Note that closures and symbols are numbered on a document-related basis. The set of closures that define one symbol never changes. So the symbol bounding box and the total number of black pixels also won't change either.

So two different entries of the mc array never have the same set of closures. The entries of the mc array are created by the new_mc service. When some procedure tries to create a new symbol informing a list of closures for which already exists a symbol, the service new_mc detects it and returns to the caller not the index of a newly created symbol, but the index of that already created one.

2.4 The preferred symbols

One same closure may belong to more than one symbol. This is important in order to allow various heuristic trials. For instance, the left closure of the "u" on the preceding section could be identified as the body of an "i". In this case however we would not find its dot. So the heuristic could try by chance another solution, for instance to join it with the nearest closure (in that case, the right closure of the "u") and try to match it with some pattern of the font.

So the OCR will need to choose, from all symbols that contain a given closure, the one to be preferred. In fact, Clara OCR maintains dynamically a partition of the set of closures on "preferred" symbols. This is the ps array. Some manual operations, like fragment merging and symbol disassembling (activated by the context menu on the page tab), change that partition dinamically, as well as some automatic procedures, like the merge step on the OCR run.

2.5 Font size

The font size is important for classifying all book symbols on pattern "types". For instance, books generally use smaller letters for footnotes. This classification is performed automatically by Clara OCR and presented by the "PATTERN (types)" window.

Clara OCR generally uses millimeters for presenting sizes, but we'll soon express sizes in "points". Let's see an example. One inch corresponds to 72.27 printer's point (pt) (The METAFONTBook pg 21, note). So when using 600 dpi, each pt will correspond to 600/72.27 = 8.3 pixels. For 10 point roman characters, Knuth defines the height of lowercase letters as being 155/36 pt, so 35.7 pixels for us. Therefore, to compute the font size (f) from the height in pixels (h) of one lowercase letter, the formula is f = 10*h/35.7.

2.6 Symbol alignment

The vertical alignment of symbols is important for various heuristics. For instance, the vertical line from a broken "p" matches an "l", but using alignement tests we're able to refuse this match.

The current Clara OCR alignment support was developed for the Latin alphabet, and is being adapted for other alphabets. Four vertical alignemnt positions are considered. These positions are referred as usual (ascent, baseline and descent). We use the Knuth's identifier "x_height" to refer the height of lowercase letters without ascenders.


  A XXX                     XXXXXXXXX         
     XX                      XX      X         

     XX                      XX      XX       
     XX                      XX      XX       
  X  XX XXXXX   XX  XXXXX    XX      X      XXXX
     XXX     X   XXX     X   XXXXXXXX     XX    XX
     XX      XX  XX      XX  XX      X   XX      XX
     XX      XX  XX      XX  XX      XX  XXXXXXXXXX
     XX      XX  XX      XX  XX      XX  XX   
     XX      XX  XX      XX  XX      XX  XX   
     XXX     X   XXX     X   XX      X    XX    XX  XX
  B  XX XXXXX    XX XXXXX   XXXXXXXXX       XXXX    XXX
                 XX                                   X
                 XX                                   X
                 XX                                  X
  D             XXXX                          


  A (0) .. ascent (Knuth asc_height)
  X (1) .. x_height
  B (2) .. baseline
  D (3) .. descent (Knuth desc_depth)

So in the figure we say that the alignment of "b" and "B" is 02, the alignment of "p" is 13, the alignment of "e" is 12, and the alignment of the comma is 23. A period has alignment 22. The dot of an "i" and accents have alignment 00. In fact, the positions 1 and 2 use to be well defined: all lowercase letters have the same height, and all symbols use the same baseline. However, positions 0 and 3 are not so well defined. For instance, on some printed books "t" and "l" have different heights.

2.7 Words and lines

Clara OCR applies The concept of "symbol" to atomic symbols like letters, digits or punctuation signs. Words (as "house" or "peace"), are handled by Clara OCR as sequences of symbols.

It's very important to compute the words of the page. They provide a context both to the OCR and to the reviewer. For instance, if the known symbols of some word were identified as bold, then Clara will automatically make the bold button on when someone tries to review the unknown symbols of that word. The same applies to prefer the recognition of one symbol as the digit "1" instead of the character "l" if the known symbols of the "word" are digits. Words are also the basis for revision based on spelling. Each words is stored on a wdesc structure on the "word" array.

When building the OCR output, Clara will combine words in lines. Each line is a sequence of words (that is, wdesc structures). The array "line" is the sequence of the heads of the detected lines. Each entry of this array is a lndesc structure. The left and right limits of words must be carefully computed and compared in order to the OCR partitionate then in columns, when dealing with multi-column pages.

2.8 Acts and transliterations

The "acts" or "revision acts" are the human interventions for training a symbol, merging a fragment to one symbol, etc.

As the human interventions are the more precious source of information, Clara logs all revision acts, in order to be able to reuse them.

The transliterations are obtained from the revision acts, so each transliteration refers one (or more) revision acts, and also inherits some properties from that act (or those acts).

The acts are on the book scope, and not on the page scope. The acts are stored on the file "acts" on the work directory.

Each act stores some data about the reviewer and also the submission date. As we plan to reuse revision data, each act also stores some data about the "original reviewer" and the "original submission date". These fields are meaningful only for reused acts.

2.9 Symbol transliterations

Clara OCR maintains a list of 0 or more proposed or deduced transliterations for each symbol. Along the OCR process, each transliteration receives "votes" from reviewers (REVISION votes) or from machine deduction heuristics, based on shape similarity (SHAPE votes) or on spelling (SPELLING votes).

So the choice of the "best" transliteration is performed through election. Votes are stored on structures of type vdesc, and transliterations are stored on structures of type trdesc. Each symbol stores a pointer for a (possibly empty) list of transliterations and each transliteration stores a pointer for a (possibly empty) list of votes.

So, for instance, when one classifier deduces that one symbol is "a", it generates a "vote" for the transliteration of that symbol to be "a". At the same time, another heuristic could generate another vote for the transliteration to be, say, "o". The diagram illustrates this situation:


   sdesc  ---> trdesc ("a")  ---> trdesc ("o")
                 |                  |
                 +-vdesc            + vdesc
                 |
                 +-vdesc

In this case, the transliteration "a" has two votes, one from the classifier and another from, say, revision and the transliteration "o" has one vote.

As the total stored information about one symbol may be large, Clara maintains for each symbol its "transliteration class", used by the heuristics to categorize each symbol and also to test the current transliteration status (is it known? is it dubious?), frequently used along the source code.

2.10 Transliteration preference

The election process used to choose the "best" transliteration for one symbol (from those obtained through human revision or heuristics based on shape similarity or spelling) consists in computing the "preference" of each transliteration and choosing the one with maximum preference.

The transliteration preference is the integer


    UTSEAN

where

U is 1 if the transliteration was confirmed by the arbiter, or 0 otherwise.

T is 0 if this transliteration was confirmed by no trusted source, 1 if it was confirmed by some trusted source.

S is 0 if this transliteration was not shape-deduced from trusted input, or 1 if it was shape-deduced from trusted input.

E is 1 if this transliteration was deduced from spelling, or 0 otherwise.

A is 0 if this transliteration was confirmed by no anon source, 1 if it was confirmed by some anon source.

N is 0 if this transliteration was not shape-deduced from anon input, or 1 if it was shape-deduced from anon input.

2.11 Transliteration class computing

Once we have computed the "best" transliteration, we can compute its transliteration class, important for various heuristics. From the transliteration class it's possible test things like "do we know the transliteration of this symbol?" or "is it an alphanumeric character?" or "concerning dimension and vertical alignment could it be an alphanumeric character?", and others.

There are two moments where the transliteration class is computed. The first is when a transliteration is added to the symbol, and the second is when the CHAR class is propagated.

The first uses the following criteria to compute the transliteration class:

1. If the symbol has no transliteration at all, its class is UNDEF.

2. On all other cases, the transliteration with largest preference will be classified as DOT, COMMA, NOISE, ACCENT and others. This search is implemented by the classify_tr function in a straightforward way.

Just before the distribution of all symbols on words we propagate CHARs. All CHAR symbols are searched, and for each one we look its neighbours that seem to compose with it one same word. Such neighbours, if untransliterated, will be classified as SCHARs.

2.12 The zones

Clara OCR allows to create "zones". Zones are usually used to identify one text block in the page. For instance, a page containing two text columns should use one zone to limit each column. The zone limits are given by the "limits" array. The top left is (limits[0],limits[1]) as presented by the figure:


    +---------------------------+
    | (0,1)       (6,7)         |
    |  +-----------+            |
    |  |this is a  |            |
    |  |text block |            |
    |  |identifyed |            |
    |  |by a       |            |
    |  |rectangular|            |
    |  |zone.      |            |
    |  +-----------+            |
    | (2,3)       (4,5)         |
    |                           |
    +---------------------------+

Multiple zones are supported simultaneously, and each one is handled separately when building words and lines and generating the output. The limits of the second zone are limists[8..15], and so on. Also, non-rectangular zones are supported, in order to cover nonrectangular (skewed) text blocks.

3. Heuristics

3.1 Skeleton pixels

The first method implemented by Clara OCR for symbol classification was skeleton fitting. Two symbols are considered similar when each one contains the skeleton of the other.

Clara OCR implements five heuristics to compute skeletons. The heuristic to be used is informed through the command-line option -k as the SA parameter. The value of SA may be 0, 1, 2, 3 or 4.

Heuristics 0, 1 and 2 consider a pixel as being a skeleton pixel if it is the center of a circle inscribed within the closure, and tangent to the pattern boundary in more than one point.

The discrete implementation of this idea is as follows: for each pixel p of the closure, compute the minimum distance d from p to some boundary pixel. Now try to find two pixels on the closure boundary such that the distance from each of them to p does not differ too much from d (must be less than or equal to RR). These pixels are called "BPs".

To make the algorithm faster, the maximum distance from p to the boundary pixels considered is RX. In fact, if there exists a square of size 2*BT+1 centered at p, then p is considered a skeleton pixel.

As this criteria alone produces fat skeletons and isolated skeleton pixels along the closure boundary, two other conditions are imposed: the angular distance between the radiuses from p to each of those two pixels must be "sufficiently large" (larger than MA), and a small path joining these two boundary pixels (built only with boundary pixels) must not exist (the "joined" function computes heuristically the smallest boundary path between the two pixels, and that distance is then compared to MP).

The heuristics 1 and 2 are variants of heuristic 0:

1. (SA = 1) The minimum linear distance between the two BPs is specified as a factor (ML) of the square of the radius. This will avoid the conversion from rectangular to polar coordinates and may save some CPU time, but the results will be slightly different.

2. (SA = 2) No minimum distance checks are performed, but a minimum of MB BPs is required to exist in order to consider the pixel p a skeleton pixel.

The heuristic 3 is very simple. It computes the skeleton removing BT times the boundary.

The heuristic 4 uses "growing lines". For angles varying in steps of approximately 22 degrees, a line of lenght RX pixels is drawn from each pixel. The heuristic check if the line can or cannot be entirely drawn using black pixels. Depending on the results, it decides if the pixel is an skeleton pixel or not. For instance: if all lines could be drawn, then the pixel is center of an inscribed circle, so it's considered an skeleton pixels. All considered cases can be found on the source code.

The heuristic 5 computes the distance from each pixel to the border, for some definition of distance. When the distance is at least RX, it is considered a skeleton pixel. Otherwise, it will be considered a skeleton pixel if its distance to the border is close to the maximum distance around it (see the code for details).

All parameters for skeleton computation are informed to Clara through the -k command-line option, as a list in the following order: SA,RR,MA,MP,ML,MB,RX,BT. For instance:


    clara -k 2,1.4,1.57,10,3.8,10,4,4

The default values and the valid ranges for each parameter must be checked on the source code (see the declaration of the variables SA, RR, MA, MP, ML, MB, RX, and BT, and the function skel_parms). Note that BT must be at most RX.

3.2 Symbol pairing

Pairing applies to letters and digits. We say that the symbols a and b (in this order) are paired if the symbol b follows the symbol a within one same word. For instance, "h" and "a" are paired on the word "that", "3" and "4" are paired on "12345", but "o" and "b" are not paired on "to be" (because they're not on the same word).

The function s_pair tests symbol pairing, and returns the following diagnostics:

0 .. the symbols are paired 1 .. insuficcient vertical intersection 2 .. one or both symbols above ascent 3 .. one or both symbols below descent 4 .. maximum horizontal distance exceeded 5 .. incomplete data 6 .. different zones

If p is nonzero, then store the inferred alignment for each symbol (a and b) on the va field of these symbols, except when these symbols have the va field already defined.

If rd is non-null, returns the dot diameter in *rd. If an estimative for the dot diameter cannot be computed, does not change *rd.

3.3 The build step

The "build" OCR step, implemented by the "build" function, distributes the symbols on words (analysing the distance, heights and relative position for each pair of symbols), and the words on lines (analysing the distance, heights and relative position for each pair of words). Various important heuristics take effect here.

0. Preparation

The first step of build is to distribute the symbols on words. This is achieved by:

a. Undefining the next-symbol ("E" field) and previous-symbol ("W" field) links for each symbol, the surrounding word ("sw" field) of each symbol, and the next signal ("sl" field) for each symbol.

Remark: The next-symbol and previous symbol links are used to build the list of symbols of each word. For instance, on the word "goal", "o" is the next for "g" and the previous for "a", "g" has no previous and "l" has no next).

b. Undefining the transliteration class of SCHARs and the uncertain alignment information.

2. Distributing symbols on words

The second step is, for each CHAR not in any word, define a unitary word around it, and extend it to right and left applying the symbol pairing test.

3. Computing the alignment using the words

Some symbols do not have a well-defined alignment by themselves. For instance, a dot may be baseline-aligned (a final dot) or 0-aligned (the "i" dot). So when computing their alignments, we need to analyse their neighborhoods. This is performed in this step.

4. Validating the recognition

Shape-based recognitions must be validated by special heuristics. For instance, the left column of a broken "u" may be recognized as the body of an "i" letter. A validation heuristic may refuse this recognition for instance because the dot was not found. These heuristics are per-alphabet.

5. Creating fake words for punctuation signs

To produce a clean output, symbols that do not belong to any word are not included on the OCR output. So we need to create fake words for punctuation signs like commas of final dots.

6. Aligning words

Words need to be aligned in order to detect the page text lines. This is perfomed as follows:

a. Undefine the next-word and previous-word links for each word. These are links for the previous and next word within lines. For instance, on the line "our goal is", "goal" is the next for "our" and the previous for "is", "our" has no previous and "is" has no next.

b. Distribution of the words on lines. This is just a matter of computing, for each word, its "next" word. So for each pair of words, we test if they're "paired" in the sense of the function w_pair. In affirmative case, we make the left word point to the right word as its "next" and the rigth point to the left as its "previous".

The function w_pair does not test the existence of intermediary words. So on the line "our goal is" that function will report pairing between "our" and "is". So after detecting pairing, our loop also checks if the detected pairing is eventually "better" than those already detected.

c. Sort the lines. The lines are sorted based on the comparison performed by the function "cmpln".

7. Computing word properties

Finally, word properties can be computed once we have detected the words. Some of these properties are applied to untransliterated symbols. The properties are:

1. The baseline left and right ordinates.

2. The italic and bold flags.

3. The alphabet.

4. The word bounding box.

All these properties are computed by the function wprops.

3.4 Resetting

3.5 Synchronization

3.6 The function list_cl

The function list_cl lists all closures that intersect the rectangle of height h and width w with top left (x,y). The result will be available on the global array list_cl_r, on indexes 0..list_cl_sz-1. This service is used to clip the closures or symbols (see list_s) currently visible on the PAGE window. It's also used by OCR operations that require locating the neighbours of one closure or symbol (see list_s).

The parameter reset must be zero on all calls, except on the very first call of this function after loading one page.

Every time a new page is loaded, this service must be called informing a nozero value for the reset parameter. In this case, the other parameters (x, y, w and h) are ignored, and the effect will be preparing the page-specific static data structures used to speed up the operation.

Closures are located by list_cl from the static lists of closures clx and cly, ordered by leftmost and topmost coordinates. Small and large closures are handled separately. The number of closures with width larger than FS is counted on nlx. The number of closures with height larger than FS is counted on nly.

The clx array is divided in two parts. The left one contains (topcl+1)-nlx indexes for the closures with width not larger than FS, sorted by the leftmost coordinate. The right one contains the other indexes, in descending order.

The cly array is divided in two parts. The left one contains (topcl+1)-nly indexes for the closures with height not larger than FS, sorted by the topmost coordinate. The right one contains the other indexes, in descending order.

So the small closures on the rectangle (x,y,w,h) may be located through a combination of bynary searches on both axis. The large closures are located by a brute-force linear loop. As nlx and nly are expected to be very small, this brute force loop won't waste CPU time.

4. The GUI

4.1 Main characteristics

1. Clara OCR GUI uses only 5 colors: white, gray, darkgray, verydarkgray and black. The RGB value for each one is customizable at startup (-c command-line option). On truecolor displays, graymaps are displayed using more graylevels than the 5 listed above.

2. The X I/O is not buffered. Buffered X I/O is implemented but it's not being used.

3. Only one X font is used for all needs (button lables, menu entries, HTML renderization, and messages).

4. Asynchronous refresh. The OCR operations just set the redraw flags (redraw_button, redraw_wnd, redraw_int, etc) and let the redraw() function make its work.

5. No toolkit is used. The graphic code is very specific to Clara, and it was not written to be reusable. So it's very small. The disadvantage of this approach is that Clara look and behaviour will be slightly different from the typical ones found on popular environments like GNOME or KDE.

4.2 Geometry of the application window

The source code frequently refers some global variables that define the position and size of the main componts (the plate, buttons, etc). Most of these variables are set by comp_wnd_size. The variables are:


    WH  .. application window height
    WW  .. application window width
    PH  .. plate height
    PW  .. plate width
    BW  .. button width
    BH  .. button width
    MRF .. maximum reduction factor
    TW  .. tab width
    TH  .. tab height
    PM  .. plate horizontal margin
    PT  .. plate top margin
    RW  .. scrollbar width
    MH  .. menubar heigth

MRF applies to the scanned document and to the web clip.

4.3 Geometry of windows

The current window is informed through the CDW global variable (set by the setview function). The variable CDW is an index for the dw array of dwdesc structs. Some macros are used to refer the fields of the structure dw[CDW]. The list of all them can be found on the headers under the title "Parameters of the current window".

The portion of the document being displayed is defined by the macros X0, Y0, HR and VR, where (X0,Y0) is the top left and HR and VR are the width and heigth, measured in pixels (graphic documents) or characters (text documents):


         X0  X0+HR-1
         |     |
    +----+-----+--+
    |             |
    |             |
    |    +-----+  +- Y0
    |    |     |  |
    |    |     |  |
    |    |     |  |
    |    +-----+  +- Y0+VR-1
    |             |
    |             |
    |             |
    |             |
    |             |
    |             |
    +-------------+
     The document

Regarding the application window, the document window is a portion of the plate, defined by DM, DT, DW and DH, where (DM,DT) is the top left and DW and DH are the width and heigth measured in display pixels.


          DM              DM+DW-1
          |                 |
    +-----+-----------------+----+
    |                            |
    |                            |
    |                            |
    |     +-----------------+    +- DT
    |     |                 | |  |
    |     |                 | X  |
    |     |                 | X  |
    |     |    Document     | X  |
    |     |     window      | |  |
    |     |                 | |  |
    |     |                 | |  |
    |     |                 | |  |
    |     |                 | |  |
    |     +-----------------+    +- DT+DH-1
    |      -----XXXXXXXXXXX-     |
    |                            |
    |                            |
    +----------------------------+
         Application window

The rectangle (X0,Y0,HR,VR) from the document is exhibited into the display rectangle (DM,DT,DW,DH). When displaying the scanned page, the reduction factor RF applies. Each square RFxRF of pixels from the document will be mapped to one display pixel. When displaying the scanned page in fat bit mode, each document pixel will be mapped to a square ZPSxZPS of display pixels, and a grid will be displayed too.

4.4 Scrollbars

The scrollbars inform the relative portion of the document being exhibited. The viewable region of the document (in the sense just defined) is defined by X0, Y0, HR and VR:


              Y0    Y0+HR-1


         +----+-------+-------+ - 0
         |                    |
      X0 +    +-------+       |
         |    |       |       |
         |    |       |       |
         |    |       |       |
         |    |       |       |
 X0+VR-1 +    +-------+       |
         |                    |
         |                    |
         |                    |
         |                    |
         +--------------------+ - GRY-1


         |                    |
         0                   GRX-1

The variables GRX and GRY contain the total width and height of the full document, measured in pixels. The interpretation of the contents of the variables X0, Y0, HR and VR is not simple. In some cases, they will contain values measured in pixels. In other cases, in characters. The variables HR and VR define the size of the window. However, in some cases this size is the size from the viewpoint of the document and, in others, of the display (the difference is a reduction factor).


            +------------+  -
            |            |  |
            |            |  |
            |            |  X
            |            |  X
            |            |  X
            |            |  |
            |            |  |
            +------------+  -


            |---XXXX-----|

Note that the parameters X0, Y0, HR, VR, GRX and GRY are macros that refer the corresponding fields of the structure dw[CDW], that stores the parameters of the current DW.

4.5 Displaying bitmaps

The Bitmaps on HTML windows and on the PAGE window are exhibited in "reduced" fashion (a square RFxRF of pixels from the bitmap is mapped to one display pixel). If RF=1, then each bitmap pixel will map to one display pixel.

The windows PATTERN, PAGE_FATBITS, and PAGE_MATCHES exhibit bitmaps in "zoomed" mode (one bitmap pixel maps to a ZPSxZPS square of display pixels). In this case a grid is displayed to make easier to distinguish each pixel. The variables GW and GS contain the grid width and the "grid separation" (GS=ZPS+GW).


                   ZPS     GS              GW
                |<---->|<----->|   --->||<---


               ++------++------++------++----
               ++------++------++------++----
               ||      ||      ||      ||
               ||      ||      ||      ||
               ||      ||      ||      ||
               ++------++------++------++----
               ++------++------++------++----
               ||      ||      ||      ||
               ||      ||      ||      ||
               ||      ||      ||      ||

Note that the parameters RF, GS and GW are macros that refer the corresponding fields of the structure dw[CDW], that stores the parameters of the current DW.

4.6 HTML windows overview

Clara is able to read a piece of HTML code, render it, display the rendered code, and attend events like selection of an anchor, filling a text field, or submitting a form. Note that anchor selection and form submission activate internal procedures, and won't call external customizable CGI programs.

Most windows displayed by Clara are produced using this HTML support. When the "show HTML source" option on the "View" menu is selected, Clara will display unrendered HTML, and it will become easier to identify the HTML windows. Note that all HTML is produced by Clara itself. Clara won't read HTML from files or through HTTP.

Perhaps you are asking why Clara implements these things. Clara does not intend to be a web browser. Clara supports HTML because we were in need of a forms interface, and the HTML forms is around there, ready to be used, and extensively proved on practice as an easy and effective solution. Note that we're not trying to achieve completeness. Clara HTML support is partial. There is only one font available, tables cannot be nested and most options are unavailable, PBM is the only graphic format supported, etc. However, it attends our needs, and the code is surprisingly small.

Let's understand how the HTML windows work. First of all, note that there is a html flag on the structure that defines a window (structure dwdesc). For instance, this flag is on for the window OUTPUT (initializition code at function setview).

When the function redraw is called and the window OUTPUT is visible on the plate, the service draw_dw will be called informing OUTPUT through the global variable CDW (Current Window). However, before making that, redraw will test the flag RG to check if the HTML contents for the OUTPUT window must be generated again, calling a function specific to that window. For instance, when a symbol is trained, this flag must be set in order to notify asynchronously the need to recompute the window contents, and render it again.

HTML renderization is performed by the function html2ge. It will create an array of graphic entities. Each such entity is a structure informing the geometric position (x,y,width,height) of something, and this something (a piece of text, a button and its label and state, a PBM image, etc). Finally, the function draw_dw will search the elements currently visible on the portion of the document clipped by the window, and display them.

4.7 Graphic elements

The rendering of each element on the HTML page creates one graphic element ("GE" for short).

Free text is rendered to one GE of type GE_TEXT per word. This is a "feature". The rendering procedures are currently unable to put more than one text word per GE.

IMG tags are rendered to one GE of type GE_IMG. Note that the value of the SRC element cannot be the name of a file containing a image, but must be "internal" or "pattern/n". These are keywords to the web clip and the bitmap the pattern "n". The value of the SRC attribute is stored on the "txt" field of the GE.

INPUT tags with TYPE=TEXT are rendered to one GE of type GE_INPUT. The predefined value of the field (attribute VALUE) is stored on the field "txt" of the GE. The name of the field (attribute NAME) is stored on the field "arg" of the GE.

The Clara OCR HTML support added INPUT tags with TYPE=NUMBER. They're rendered like TYPE=TEXT, but two steppers are added to faster selection. So such tags will create three GEs (left stepper, input field, and right stepper).

INPUT tags with TYPE=CHECKBOX are rendered to one GE of type GE_CBOX. The variable name (attribute NAME) is stored on the "arg" field. The argument to VALUE is stored on the field "txt". The status of the checkbox is stored on the "iarg" field (2 means "checked", 0 means "not checked").

INPUT tags with TYPE=RADIO are rendered just like CHECKBOX. The only difference is the type GE_RADIO instead GE_CBOX.

SELECT tags (starting a SELECT element) are rendered to one GE of type SELECT. In fact, the entire SELECT element is stored on only one GE. Each SELECT element originates one standard context menu, as implemented by the Clara GUI. The "iarg" field stores the menu index. The free text on each OPTION element is stored as an item label on the context menu. The implementation of the SELECT element is currently buggy: (a) for each renderization, one entry on the array of context menus will be allocated, and will never be freed, and (b) The attribute NAME of the SELECT won't be stored anywhere.

INPUT tags with TYPE=SUBMIT are rendered to one GE of type GE_SUBMIT. The value of the attribute VALUE is stored on the "txt" field. The value of the ACTION attribute is stored on the field "arg". The field "a" will store HTA_SUBMIT.

TD tags are rendered to one GE of type GE_RECT. The value of the BGCOLOR attribute is stored on the "bg" field as a code (only the colors known by the Clara GUI are supported: WHITE, BLACK, GRAY, DARKGRAY and VDGRAY). The coordianates of the cell within the table are stored on the fields "tr" and "tc".

All other supported tags do not generate GEs.

4.8 XML support

We decided to use XML because of the facilities of using non-binary encodings to store, analyse, change and transmit information, and also because XML is a standard. Currently we do not have DTDs, and until now we didn't try to load, using the Clara parser, XML code not produced by Clara itself.

4.9 Auto-submission of forms

The Clara OCR GUI tries to apply immediately all actions taken by the user. So the HTML forms (e.g. the PATTERN window) do not contain SUBMIT buttons, because they're not required (some forms contain a SUBMIT button disguised as a CONSIST facility, but this is just for the user's convenience).

The editable input fields make auto-submission mechanisms a bit harder, because we cannot apply consistency tests and process the form before the user finishes filling the field, so auto-submission must be triggered on selected events. The triggers must be a bit smart, because some events must be attended before submission (for instance toggle a CHECKBOX), while others must be attended after submission (for instance changing the current tab). So auto-submission must be carefully studied. The current strategy follows:

a. When the window PAGE (symbol) or the window PATTERN is visible, auto-submit just after attending the buttons that change the current symbol/pattern data (buttons BOLD, ITALIC, ALPHABET or PTYPE).

b. When the window PAGE (symbol) or the window PATTERN is visible, auto-submit just before attending the left or right arrows.

c. When the user presses ENTER and an active input field exists, auto-submit.

d. Auto-submit as the first action taken by the setview service, in order to flush the current form before changing the current tab or tab mode.

e. Auto-submit just after opening any menu, in order to flush data before some critic action like quitting the program or starting some OCR step.

f. Auto-submit just after attending CHECKBOX or RADIO buttons.

Auto-submission happens when the service auto_submit_form is called, so it's easy to locate all triggering points (just search the string auto_submit_form). This service takes no action when the current form is unchanged.

5. The Clara API

This section describes the variables and functions exported by Clara OCR for extensionability purpuses. Note that Clara OCR currently does not have an interface for extensions. The first such interface planned to be added will use the Guile interpreter, available from the GNU Project.

5.1 Redraw flags

The redraw flags inform the function redraw about which portions of the application window must be redraw. The precise meaning of each flag depends on the implementation of the redraw function, that can be analysed directly on the source code.


    redraw_button .. one specific button or all buttons
    redraw_bg     .. redraw background
    redraw_grid   .. the grid on fatbits windows
    redraw_stline .. the status line
    redraw_dw     .. all visible windows
    redraw_inp    .. all text input fields
    redraw_tab    .. tabs and their labels
    redraw_zone   .. rectangle that defines the zone
    redraw_menu   .. menu bar and currently open menu
    redraw_j1     .. redraw junction 1 (page tab)
    redraw_j2     .. redraw junction 2 (page tab)
    redraw_pbar   .. progress bar
    redraw_map    .. alphabet map
    redraw_flea   .. the flea

An individual button may be redraw to reflect a status change (on/off). The junction 1 is the junction of the top and middle windows on the page tab, and the junction 2 is the junction of the middle and bottom window on the page tab. The correspondig flags are used when resizing some window on the page tab.

If redraw_menu is 2, the menu is entirely redrawn. If redraw_menu is 1, then the draw_menu function will redraw only the last selected item and the newly selected item, except if the menu is being drawn by the first time.

The progress bar is displayed on the bottom of the window to reflect the progress of some slow operation. By now, the progress bar is unused.

5.2 OCR statuses

The OCR run in course (if any) stores various statuses on global variables. For instance, the ocring macro will be nonzero if one OCR run is in course. The GUI informs the OCR control routines about what to do along the OCR run using various global variables. Some of them drive the classification procedures:


  justone      .. Classify only one symbol
  this_pattern .. Use only one pattern to classify
  recomp_cl    .. Ignore current classes

The first two are used for testing purposes, for instance when checking why the classification routines classified some symbol unexpected way.

The stop_ocr variable is set by the GUI when the STOP button is pressed. Its status will be tested by the routines that control the OCR run in course. Note that the variable cannot_stop may be set by the current OCR step in course. It's effect is to inhibit the GUI setting the stop_ocr status. It's used by routines that cannot be stopped, otherwise the data structures they're handling would rest in a irrecuperable inconsistency.

The OCR control routines handle the following statuses:


  ocr_all  .. OCR all pages
  starting .. continue_ocr was not called until now
  onlystep .. run only this OCR step

The buttons CLASSIFY, BUILD, etc, start one specific OCR step. The OCR step to be executed is stored on onlystep. The to_ocr variable stores the page where the OCR run will be executed.

The other to_* variables together with nopropag store information about the revision operation requested from the GUI:


  to_tr    .. the transliteration to submit to the current symbol
  to_rev   .. the type of revision
  nopropag .. propagation flag for the result
  to_arg   .. integer argument to revision operation

The types of revision are: transliteration submission (1), fragment merging (2), symbol disassemble (3) and word extension (4).

The to_arg variable stores the flagment to merge to the current symbol or the symbol to add to the current word.

The variable ocr_other stores which operation to perform by the OCR_OTHER step. This step is reserved to operations that are outside the OCR run main stream, but require the control provided by the continue_ocr function.

The variable text_switch redirects (if nonzero) the DEBUG window output to an internal array.

5.3 The function setview

As each window is displayed on only one mode and each mode belongs to only one tab, in order to set a given mode or a given tab, just call setview informing one window present on that mode as parameter. That is the only parameter received by setview. The geometry of each window will be re-computed by setview, so setview is not called only to change the current mode, but also after operations that change the geometry of the windows, just like resizing the application X window or hiding the scrollbars, or resizing the PAGE window, etc.

5.4 The function redraw (to be written)

5.5 The function show_hint

Messages are put on the status line (on the bottom of the application X window) using the show_hint service. The show_hint service receives two parameters: an integer f and a string (the message).

If f is 0, then the message is "discardable". It won't be displayed if a permanent message is currently being displayed.

If f is 1, then the message is "fixed". It won't be erased by a subsequent show_hint call informing as message the empty string (in practical terms, the pointer motion won't clear the message).

If f is 2, then the message is "permanent" (the message will be cleared only by other fixed or permanent message).

If f is 3, clear any current message.

5.6 The function start_ocr

Starts a complete OCR run or some individual OCR step on one given page, or on all pages. For instance, start_ocr is called by the GUI when the user presses the "OCR" button or when the user requests loading of one specific page.

In fact, almost all user requested operation is performed as an "ocr step"in order to take advantage from the execution model implemented by the function continue_ocr. So start_ocr is the starting point for attending almost all user requests.

If p is -1, process all pages, if p < -1, process only the current page (cpage) otherwise process only the page p. If s>=0 then run only step s, otherwise run all steps.

If the flag r is nonzero, will ignore the current classes (if any) and recompute them again (this is meaningful only to the symbol classification step).

6. How to change the source code (examples)

6.1 How to add a bitmap comparison method

It's not hard to add a bitmap comparison method to Clara OCR. This may become very important when the available heuristics are unable to classify the symbols of some book, so a new heuristic must be created. In order to exemplify that, we'll add a naive bitmap comparison method. It'll just compare the number of black pixels on each bitmap, and consider that the bitmaps are similar when these numbers do not differ too much.

Please remember that the code added or linked to Clara OCR must be GPL.

In order to add the new bitmap comparison method, we need to write a function that compares two bitmaps returning how similar they are, add this function as an alternative to the Options menu, and call it when classifying the page symbols. We'll perform all these steps adding a naive comparison method, step by step. The more difficult one is to write the bitmap comparison method. This step is covered on the subsection "How to write a bitmap comparison function".

Let's present the other two steps. To add the new method to the Options menu, we need:

a. Declare the macro CL_NBP:


  #define CL_BM 2

b. Include the new classifier on the tune form (function mk_tune):


  C = (classifier==CL_NBP) ? "CHECKED" : "";
  ..

Now add the call to this new method on the classifier. This is just a matter of adding one more item on the function selbc:


  else if (classifier == NBP)
      r = classify(c,bmpcmp_nbp,1);

Where bmpcmp_nbp is the function that will be discussed on the subsection "How to write a bitmap comparison function".

To use the new method, recompile the sources, start Clara OCR and select the new method on the tune tab.

6.2 How to write a bitmap comparison function

The bitmap comparison function required for the example we're presenting has the following prototype:


    int bmpcmp_nbp(int c,int st,int k,int d)

The first parameter (c) is the symbol being compared, the second (st) is the current status, the third (k) is the current pattern and the fourth (d) will be discussed later.

Clara OCR will call bmpcmp_nbp once informing status 1 every time a new symbol c is chosen, so bmpcmp_cp will be able to bufferize symbol data on static areas. Note that to classify each symbol, Clara will perform various calls to the bitmap comparison function, because it will check X events (like the STOP button), and, when visual modes are enabled, Clara will need to refresh the screen displaying the progress of the classification.

The block of bmpcmp_nbp corresponding to status 1 will merely store on the static variable np the value of the nbp field of the symbol structure.


    nbp = mc[c].nbp;

Before trying to classify the symbol c as similar to the pattern k, Clara allows the bitmap comparison method to apply simple heuristics to filter bad candidates in order to save CPU cycles. This is done informing status 2. The bitmap comparison function is expected, in this caso, to return 1 if the pattern was accepted for further processing, or 0 if it was rejected. For simplicity, the bmpcmp_nbp function will return 1 in all cases.

When Clara OCR wants to effectively ask if the pattern k matches the symbol c, it calls the bitmap comparison function informing status 3. The function must return a similarity index ranging from 0 (no similarity) to 10 (identity). Now we must take care of the fourth parameter (mode). It informs if Clara is asking for a direct (mode == 1) comparison or indirect (mode == 0) comparison. This applies for asymmetric comparison methods. For instance, when using skeleton fitting, "direct" means that the pattern skeleton fits the symbol, and "indirect" means that the symbol skeleton fits the pattern. Clara will make both calls to avoid false positives.

In our example we'll on both cases return 10 if the test 5 * abs(nbp-m) <= (nbp+m) results true, where m is the number of black pixels of the pattern.


    m = pattern[k].nbp;
    if ((5*abs(nbp-)) <= (d->bp+mc[c].nbp))
        return(10);
    else
        return(0)

Finally, every bitmap comparison method is expected to produce a graphic image of the current status of the comparison when called with status 0. That image must be an FSxFS bitmap where each pixel may assume the color WHITE, BLACK or GRAY. This bitmap must be stored on the cfont array of bytes. The pixel on line i and column j must be put on cfont[i+j*FS]. In our case we'll just call the services copy_mc and bm2byte. The effect is to copy the symbol bitmap to the cfont array:


    unsigned char mcbm[BMS];
    [...]
    copy_mc(mcbm,c);
    bm2byte(cb,mcbm);

6.3 How to add an application button

These are the steps to add a new button:

1. Create a new button macro after those already existing (bzoom, balpha, etc). Note that each button macro is defined as an unique integer (0, 1, 2, etc).


  #define bzoom 0
  [...]
  #define bfoo 13

2. Register the new button at init_ds(), together with its label. Multi-state buttons have multiple labels, specified as "state1:state2:state3":


    register_button(bzoom,"zoom");
    [...]
    register_button(bfoo,"foo");

The current state of the new button is stored by button[bfoo]. When the state is nonzero, the button is drawn using a dark background.

3. Add a new block to attend this button on mactions_b1 and, if desired, on mactions_b2 (just copy one existing block and adapt it). It's mandatory to attend help requests. On/off and multi-state buttons must circulate the acceptable values of the respective entry of the array "button" in order to change the current state, and set the redraw_button flag.


  if (i == bfoo) {
      if (help) {
          show_hint(0,"This is the FOO button");
          return;
      }
      show_hint("You pressed the FOO button");
  }

There is no need to inform the type of the button (on/off, multi-state or event catcher). The behaviour is defined by the label and by the attending block. If the attending block changes the button state, it must request redraw. Example:


      button[bfoo] = 1 - button[bfoo];
      redraw_button = bfoo;

7. Bugs and TODO list

(Some) Major tasks

1. Vertical segmentation (partially done).

2. Heuristics to merge fragments.

3. Spelling-generated transliterations

4. Geometric detection of lines and words

5. Finish the documentation

6. Simplify the revision acts subsystem

Minor tasks

1. Change sprintf to snprintf.

2. Fix assymetric behaviour of the function "joined".

3. Optimize bitmap copies to copy words, not bits, where possible (partially done).

4. Support Multiple OCR zones (partially done).

5. Make sure that the access to the data structures is blocked during OCR (all functions that change the data structures must check the value of the flag "ocring").

6. Use 64-bit integers for bitmap comparisons and support big-endian CPUs (partially done).

7. Clear memory buffers before freeing.

8. Allow the transliterations to refer multiple acts (partially done).

9. Rewrite composition of patterns for classification of linked symbols.

10. The flea stops but do not disappear when the window lost and regain focus.

11. Substitute various magic numbers by per-density and per-minimum-fontsize values.

12. Synchronization destroys the result of partial matching because partial matching assigns to the symbol only one pattern as its best match.

8. AVAILABILITY

Clara OCR is free software. Its source code is distributed under the terms of the GNU GPL (General Public License), and is available at http://www.claraocr.org/. If you don't know what is the GPL, please read it and check the GPL FAQ at http://www.gnu.org/copyleft/gpl-faq.html. You should have received a copy of the GNU General Public License along with this software; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. The Free Software Foundation can be found at http://www.fsf.org.

9. CREDITS

Clara OCR was written by Ricardo Ueda Karpischek. Giulio Lunati wrote the internal preprocessor. Clara OCR includes bugfixes produced by other developers. The Changelog (http://www.claraocr.org/CHANGELOG) acknowledges all them (see below). Imre Simon contributed high-volume tests, discussions with experts, selection of bibliographic resources, propaganda and many ideas on how to make the software more useful.

Ricardo authored various free materials, some included (at least) in Conectiva, Debian, FreeBSD and SuSE (the verb conjugator "conjugue", the ispell dictionary br.ispell and the proxy axw3). He recently ported the EiC interpreter to the Psion 5 handheld and patched the Xt-based vncviewer to scale framebuffers and compute image diffs. Ricardo works as an independent developer and instructor. He received no financial aid to develop Clara OCR. He's not an employee of any company or organization.

Imre Simon promotes the usage and development of free technologies and information from his research, teaching and administrative labour at the University.

Roberto Hirata Junior and Marcelo Marcilio Silva contributed ideas on character isolation and recognition. Richard Stallman suggested improvements on how to generate HTML output. Marius Vollmer is helping to add Guile support. Jacques Le Marois helped on the announce process. We acknowledge Mike O'Donnell and Junior Barrera for their good criticism. We acknowledge Peter Lyman for his remarks about the Berkeley Digital Library, and Wanderley Antonio Cavassin, Janos Simon and Roberto Marcondes Cesar Junior for some web and bibliographic pointers. Bruno Barbieri Gnecco provided hints and explanations about GOCR (main author: Jorg Schulenburg). Luis Jose Cearra Zabala (author of OCRE) is gently supporting our tentatives of using portions of his code. Adriano Nagelschmidt Rodrigues and Carlos Juiti Watanabe carefully tried the tutorial before the first announce. Eduardo Marcel Macan packaged Clara OCR for Debian and suggested some improvements. Mandrakesoft is hosting claraocr.org. We acknowledge Conectiva and SuSE for providing copies of their outstanding distributions. Finally, we acknowledge the late Jose Hugo de Oliveira Bussab for his interest in our work.

Adriano Nagelschmidt Rodrigues donated a 15" monitor.

The fonts used by the "view alphabet map" feature came from Roman Czyborra's "The ISO 8859 Alphabet Soup" page at http://czyborra.com/charsets/iso8859.html.

The names cited by the CHANGELOG (and not cited before) follow (small patches, bug reports, specfiles, suggestions, explanations, etc).

Brian G. (win32), Bruce Momjian, Charles Davant (server admin), Daniel Merigoux, De Clarke, Emile Snider (preprocessor, to be released), Erich Mueller, Franz Bakan (OS/2), groggy, Harold van Oostrom, Ho Chak Hung, Jeroen Ruigrok, Laurent-jan, Nathalie Vielmas, Romeu Mantovani Jr (packager), Ron Young, R P Herrold, Sergei Andrievskii, Stuart Yeates, Terran Melconian, Thomas Klausner (NetBSD), Tim McNerney, Tyler Akins.