Rechercher une page de manuel
ocrodjvu
Langue: en
Version: 05/24/2010 (ubuntu - 24/10/10)
Section: 1 (Commandes utilisateur)
Sommaire
NAME
ocrodjvu - OCR for DjVu filesSYNOPSIS
- ocrodjvu {-o | --save-bundled} output-djvu-file [option...] djvu-file
- ocrodjvu {-i | --save-indirect} index-djvu-file [option...] djvu-file
- ocrodjvu --save-script script-file [option...] djvu-file
- ocrodjvu --in-place [option...] djvu-file
- ocrodjvu --dry-run [option...] djvu-file
- ocrodjvu {--version | --help | -h | --list-engines | --list-languages}
DESCRIPTION
- ocrodjvu is a wrapper for OCR systems that allows you to perform OCR on DjVu files.
The following OCR engines are supported:
- • m[blue]OCRopusm[][1] (internally, ocrodjvu calls ocroscript's recognize (or rec-tess) command, so that ultimately Tesseract acts as the OCR backend);
- • m[blue]Cuneiform for Linuxm[][2].
OPTIONS
OCR engine options
--engine=engine-id
- Use this OCR engine. The default is 'ocropus' (OCRopus).
--list-engines
- Print list of available OCR engines.
Options controlling output
It is mandatory to use exactly one of the following options:
-o, --save-bundled=output-djvu-file
- Save OCR results as a bundled multi-page document into output-djvu-file.
-i, --save-indirect=index-djvu-file
- Save OCR results as an indirect multi-page document. Use index-djvu-file as the index file name; put the component files into the same directory. The directory must exist and be writable.
--save-script=script-file
- Save a djvused script with OCR results into script-file.
--in-place
- Save OCR results in place.
(Use this option to retain compatibility with ocrodjvu < 0.2.)
--dry-run
- Don't change any files, throw OCR results away.
Text segmentation options
-t lines, --details lines
- Record location of every line. Don't record locations of particular words or characters.
This is the default for OCRopus 0.2.
-t words, --details=words
- Record location of every line and every word. Don't record locations of particular characters.
This is the default for OCRopus ≥ 0.3.1 and for Cuneiform.
This option is ineffective with OCRopus 0.2.
-t chars, --details=chars
- Record location of every line, every word and every character.
This option is ineffective with OCRopus 0.2.
--word-segmentation=simple
- Consider each non-empty sequence of non-whitespace characters a single word.
This is the default, despite being linguistically incorrect.
--word-segmentation=uax29
- Use the m[blue]Unicode Text Segmentationm[][3] algorithm to break lines into words.
This option breaks assumptions of some DjVu tools that words are separated by spaces, and therefore is it not recommended.
Other options
--clear-text
- Remove existing hidden text if present in the pages not selected for OCR.
(Use this option to retain compatibility with ocrodjvu < 0.2.)
--ocr-only
- Don't save pages that were not processed.
--language=language-id
- Set recognition language. language-id is typically an ISO 639-2 three-letter code.
For OCRopus, the default is 'eng' (English), unless the tesslanguage environment variable is set. For other OCR engines, the default is always 'eng'.
--list-languages
- Print list of available languages for the currently selected OCR engine.
--render=mask
- Render only masks of page images.
This is the default.
--render=foreground
- Render only foreground layers of page images.
--render=all
- Render all layers of page images.
This option is necessary to OCR DjVu files with invalid foreground/background separation.
-p, --pages=page-range
- Specifies pages to process. page-range is a comma-separated list of sub-ranges. Each sub-range is either a single page (e.g. 17) or a contiguous range of pages (e.g. 37-42). Pages are numbered from 1.
The default is to process all pages.
-j, --jobs=n
- Start up to n OCR processes.
-D, --debug
- To ease debugging, don't delete intermediate files.
--version
- Output version information and exit.
-h, --help
- Display help and exit.
ENVIRONMENT
The following environment variables affects ocrodjvu:
tesslanguage
- Recognition language for Tesseract.
(Use this variable is deprecated in favor of the --language option.)
TMPDIR
- Directory for temporary files. The default is /tmp.
SEE ALSO
djvu(1), ocroscript(1), tesseract(1)
AUTHOR
Jakub Wilk <jwilk@jwilk.net>
- Author.
COPYRIGHT
Copyright © 2008, 2009, 2010 Jakub Wilk
NOTES
- 1.
- OCRopus
- http://ocropus.googlecode.com/
- 2.
- Cuneiform for Linux
- http://launchpad.net/cuneiform-linux
- 3.
- Unicode Text Segmentation
- http://unicode.org/reports/tr29/
Contenus ©2006-2024 Benjamin Poulain
Design ©2006-2024 Maxime Vantorre