ocrodjvu

Langue: en

Version: 05/24/2010 (ubuntu - 24/10/10)

Section: 1 (Commandes utilisateur)

NAME

ocrodjvu - OCR for DjVu files

SYNOPSIS

ocrodjvu {-o | --save-bundled} output-djvu-file [option...] djvu-file
ocrodjvu {-i | --save-indirect} index-djvu-file [option...] djvu-file
ocrodjvu --save-script script-file [option...] djvu-file
ocrodjvu --in-place [option...] djvu-file
ocrodjvu --dry-run [option...] djvu-file
ocrodjvu {--version | --help | -h | --list-engines | --list-languages}

DESCRIPTION

ocrodjvu is a wrapper for OCR systems that allows you to perform OCR on DjVu files.

The following OCR engines are supported:

• m[blue]OCRopusm[][1] (internally, ocrodjvu calls ocroscript's recognize (or rec-tess) command, so that ultimately Tesseract acts as the OCR backend);
• m[blue]Cuneiform for Linuxm[][2].

OPTIONS

OCR engine options

--engine=engine-id

Use this OCR engine. The default is 'ocropus' (OCRopus).

--list-engines

Print list of available OCR engines.

Options controlling output

It is mandatory to use exactly one of the following options:

-o, --save-bundled=output-djvu-file

Save OCR results as a bundled multi-page document into output-djvu-file.

-i, --save-indirect=index-djvu-file

Save OCR results as an indirect multi-page document. Use index-djvu-file as the index file name; put the component files into the same directory. The directory must exist and be writable.

--save-script=script-file

Save a djvused script with OCR results into script-file.

--in-place

Save OCR results in place.
(Use this option to retain compatibility with ocrodjvu < 0.2.)

--dry-run

Don't change any files, throw OCR results away.

Text segmentation options

-t lines, --details lines

Record location of every line. Don't record locations of particular words or characters.
This is the default for OCRopus 0.2.

-t words, --details=words

Record location of every line and every word. Don't record locations of particular characters.
This is the default for OCRopus ≥ 0.3.1 and for Cuneiform.
This option is ineffective with OCRopus 0.2.

-t chars, --details=chars

Record location of every line, every word and every character.
This option is ineffective with OCRopus 0.2.

--word-segmentation=simple

Consider each non-empty sequence of non-whitespace characters a single word.
This is the default, despite being linguistically incorrect.

--word-segmentation=uax29

Use the m[blue]Unicode Text Segmentationm[][3] algorithm to break lines into words.
This option breaks assumptions of some DjVu tools that words are separated by spaces, and therefore is it not recommended.

Other options

--clear-text

Remove existing hidden text if present in the pages not selected for OCR.
(Use this option to retain compatibility with ocrodjvu < 0.2.)

--ocr-only

Don't save pages that were not processed.

--language=language-id

Set recognition language. language-id is typically an ISO 639-2 three-letter code.
For OCRopus, the default is 'eng' (English), unless the tesslanguage environment variable is set. For other OCR engines, the default is always 'eng'.

--list-languages

Print list of available languages for the currently selected OCR engine.

--render=mask

Render only masks of page images.
This is the default.

--render=foreground

Render only foreground layers of page images.

--render=all

Render all layers of page images.
This option is necessary to OCR DjVu files with invalid foreground/background separation.

-p, --pages=page-range

Specifies pages to process. page-range is a comma-separated list of sub-ranges. Each sub-range is either a single page (e.g. 17) or a contiguous range of pages (e.g. 37-42). Pages are numbered from 1.
The default is to process all pages.

-j, --jobs=n

Start up to n OCR processes.

-D, --debug

To ease debugging, don't delete intermediate files.

--version

Output version information and exit.

-h, --help

Display help and exit.

ENVIRONMENT

The following environment variables affects ocrodjvu:

tesslanguage

Recognition language for Tesseract.
(Use this variable is deprecated in favor of the --language option.)

TMPDIR

Directory for temporary files. The default is /tmp.

SEE ALSO

djvu(1), ocroscript(1), tesseract(1)

AUTHOR

Jakub Wilk <jwilk@jwilk.net>

Author.

Copyright © 2008, 2009, 2010 Jakub Wilk

NOTES

1.
OCRopus
http://ocropus.googlecode.com/
2.
Cuneiform for Linux
http://launchpad.net/cuneiform-linux
3.
Unicode Text Segmentation
http://unicode.org/reports/tr29/