hocr2pdf

Langue: en

Autres versions - même langue

Version: 253819 (debian - 07/07/09)

Section: 1 (Commandes utilisateur)

NAME

hocr2pdf - hOCR to PDF converter of the ExactImage library

SYNOPSIS

hocr2pdf [-c|--concurrent-lines NUMBER] [-d|--directions BITFIELD] [-s|--line-skip NUMBER] [-t|--threshold VALUE] FILE...FILE

hocr2pdf --help

DESCRIPTION

ExactImage is a fast C++ image processing library. Unlike ImageMagick, it allows operation in several color spaces and bit depths natively, resulting in much lower memory and computational requirements. Some optimized algorithms operate in 1/20 of the time ImageMagick requires, and displaying large images can be as fast as 1/10 of the time the "display" program takes.

hocr2pdf is a command line front-end for the image processing library to create perfectly layouted, searchable PDF files from hOCR, annotated HTML, input obtained from an OCR system.

OPTIONS

-i|--input FILE
Input image filename.
-o|--output FILE
Output PDF filename.
-n|--no-image
Do not place the image over the text.
-r|--resolution RESOLUTION
Resolution overwrite.
-s|--sloppy-text
Sloppily place text, group words, do not draw single glyphs.
-t|--text
Extract text, including trying to remove hyphens.
-h|--help
Show summary of options.

EXAMPLES

Creating a Searchable PDF from hOCR input

hOCR, annotated HTML, input must be provided to STDIN, and the image data is read using the filename from the -i or --input argument. For example:

$ hocr2pdf -i scan.tiff -o test.pdf < cuneiform-out.hocr

By default the text layer is hidden by the real image data. Including image data can be disabled via the -n, --no-image, so that just the recognized text from the OCR is visible - e.g. for debugging or to save storage space:

$ hocr2pdf -i scan.tiff -n -o test.pdf < cuneiform-out.hocr

Too many gabs between letters in individual words

This might be a problem with imprecise OCR data or justified text with huge gabs. ExactImage includes a special mode activated with the command line argument -s, --sloppy-text, to group glyphs between whitespace to words which can help PDF viewers to produce better results while cut and pasting text:

$ hocr2pdf -i scan.tiff -s -o test.pdf < cuneiform-out.hocr

SEE ALSO

exactimage(7)

bardecode(1)

e2mtiff(1)

econvert(1)

edentify(1)

empty-page(1)

optimize2bw(1)

HOMEPAGE

More information about hocr2pdf and the ExactImage project can be found at <http://www.exactcode.de/site/open_source/exactimage/>.

AUTHOR

ExactImage was written by ExactCODE GmbH <http://www.exactcode.de/>.

This manual page was written by Daniel Baumann <daniel@debian.org>, for the Debian project (but may be used by others).