Rechercher une page de manuel
hocr2pdf
Langue: en
Version: 253819 (debian - 07/07/09)
Section: 1 (Commandes utilisateur)
NAME
hocr2pdf - hOCR to PDF converter of the ExactImage librarySYNOPSIS
hocr2pdf [-c|--concurrent-lines NUMBER] [-d|--directions BITFIELD] [-s|--line-skip NUMBER] [-t|--threshold VALUE] FILE...FILEhocr2pdf --help
DESCRIPTION
ExactImage is a fast C++ image processing library. Unlike ImageMagick, it allows operation in several color spaces and bit depths natively, resulting in much lower memory and computational requirements. Some optimized algorithms operate in 1/20 of the time ImageMagick requires, and displaying large images can be as fast as 1/10 of the time the "display" program takes.hocr2pdf is a command line front-end for the image processing library to create perfectly layouted, searchable PDF files from hOCR, annotated HTML, input obtained from an OCR system.
OPTIONS
- -i|--input FILE
- Input image filename.
- -o|--output FILE
- Output PDF filename.
- -n|--no-image
- Do not place the image over the text.
- -r|--resolution RESOLUTION
- Resolution overwrite.
- -s|--sloppy-text
- Sloppily place text, group words, do not draw single glyphs.
- -t|--text
- Extract text, including trying to remove hyphens.
- -h|--help
- Show summary of options.
EXAMPLES
Creating a Searchable PDF from hOCR inputhOCR, annotated HTML, input must be provided to STDIN, and the image data is read using the filename from the -i or --input argument. For example:
$ hocr2pdf -i scan.tiff -o test.pdf < cuneiform-out.hocr
By default the text layer is hidden by the real image data. Including image data can be disabled via the -n, --no-image, so that just the recognized text from the OCR is visible - e.g. for debugging or to save storage space:
$ hocr2pdf -i scan.tiff -n -o test.pdf < cuneiform-out.hocr
Too many gabs between letters in individual words
This might be a problem with imprecise OCR data or justified text with huge gabs. ExactImage includes a special mode activated with the command line argument -s, --sloppy-text, to group glyphs between whitespace to words which can help PDF viewers to produce better results while cut and pasting text:
$ hocr2pdf -i scan.tiff -s -o test.pdf < cuneiform-out.hocr
SEE ALSO
exactimage(7)bardecode(1)
e2mtiff(1)
econvert(1)
edentify(1)
empty-page(1)
optimize2bw(1)
HOMEPAGE
More information about hocr2pdf and the ExactImage project can be found at <http://www.exactcode.de/site/open_source/exactimage/>.AUTHOR
ExactImage was written by ExactCODE GmbH <http://www.exactcode.de/>.This manual page was written by Daniel Baumann <daniel@debian.org>, for the Debian project (but may be used by others).
Contenus ©2006-2024 Benjamin Poulain
Design ©2006-2024 Maxime Vantorre