djvu2hocr

Langue: en

Version: 05/24/2010 (ubuntu - 24/10/10)

Section: 1 (Commandes utilisateur)

NAME

djvu2hocr - DjVu to hOCR converter

SYNOPSIS

djvu2hocr [option...] djvu-file
djvu2hocr {--version | --help | -h}

DESCRIPTION

djvu2hocr converts hidden text from a DjVu file to the m[blue]hOCRm[][1] format.

OPTIONS

Text segmentation options

--word-segmentation=simple

Use the same word segmentation as found in the DjVu file.
This is the default.

--word-segmentation=uax29

Use the m[blue]Unicode Text Segmentationm[][2] algorithm to break lines into words, possibly fixing word segmentation found in the DjVu file.

Other options

--version

Output version information and exit.

-h, --help

Display help and exit.

PORTABILITY

djvu2hocr uses a custom extension to hOCR to retain characters which cannot be directly represented in an HTML/XML document. For example, control character BEL (^G, U+0007), is converted into the following HTML chunk: <span class="djvu_char" title="#x07"> </span>

SEE ALSO

djvu(1)

AUTHOR

Jakub Wilk <jwilk@jwilk.net>

Author.

Copyright © 2009, 2010 Jakub Wilk

NOTES

1.
hOCR
http://docs.google.com/View?docid=dfxcv4vc_67g844kf
2.
Unicode Text Segmentation
http://unicode.org/reports/tr29/