Ocr engine

SPU and Vobsub subtitles conversion

Since release 0.6.3 encode2mpeg supports the conversion of graphic subtitles (SPU streams and Vobsub) to text subtitles (SubViewer). In order to achive this goal, encode2mpeg needs a good ocr engine. Currently the following ocr engines are supported: tesseract, gocr and ocrad. Each of them has its strongness and weakness and none of them can, at the moment, totally replace the others.
Subtitles conversion is enabled with the option -ocr and the list of subtitles to convert is selected with the option -addsub. For SPU subtitles specify the source stream (DVD, SVCD, mpeg), for vobsub subtitles use the option -vobsubsrc. With the option -addsla you can override the automatic detection of the language id. Subtitle conversion is possible only in Avi Mode, therefore the encode2mpeg command line must include the option -avionly.

Example:

encode2mpeg -o movie dvd://1 -avionly -ocr -addsub 0,1

This will create the files movie.tesseract_0_en.srt and movie.tesseract_1_de.srt assuming that tesseract was used and the language id of the two subtitles were english and german.

The following options can be used for subtitle conversion:

-ocropts I:II:III:IV:V:VI:VII:VIII:IX

I - select the ocr engine: auto (default), tesseract, gocr, ocrad, ocradi (ocrad with inverted image levels)
II - text to append to the name specifyed with -o, the default is %o_%i_%l (see below)
III - format of the subtitle file created: unix (default), dos
IV - add ;1 at the end of the subtitle file name: off (default), on
V - extension of the subtitle file name, the default is srt
VI - edit each subtitle that the ocr engine has converted: off (default), on
VII - terminal used for subtitle editing: auto (default), urxvt (rxvt-unicode), rxvt, Eterm, aterm, konsole, gnome-terminal
VIII - preferred size and position for the terminal used for subtitle editing, the default is do not set a preferred size and position
IX - trim the source subtitle image: off (default), on

If the first suboption is auto, encode2mpeg will look for a supported ocr engine in the following order: tesseract, gocr, ocrad, and will use the first one he founds (ocradi is never selected automatically). Ocradi is ocrad with inverted image levels. It appears that most of the times this option is required for ocrad in order to work properly.
The second suboption define a text string to be added to the subtitle file name before the suffix. In the text string some combination of text characters are expanded as following:

%o is replaced with the name of the ocr engine used
%i is replaced with the index of the subtitle converted: first subtitle is 0, second is 1 and so on
%l is replaced with the iso 639 language code as automatically detected by encode2mpeg or selected with -addsla
%% is replaced with a single % character
all the other characters are left unchanged

Because the ocr engine is not 100% error free, you may want to check and/or edit each subtitle just converted before the final subtitle file is created. Use the sixth suboption in order to activate subtitle editing. During subtitle editing a new terminal window running vi will be opened. In the terminal window you will have on top the text file created by the ocr engine ready to be edited with vi, below there will be the graphic subtitle image used by the ocr engine for the conversion as a reference for you. See an example here. When finished, exit vi with the usual commands: :wq or ZZ. If the seventh suboption is auto, encode2mpeg will look for a terminal compiled with support for backgroundPixmap in the following order: urxvt, rxvt, Eterm, aterm, konsole, gnome-terminal and will use the first one he founds. During subtitle editing, the terminal that is opened is placed in a place choosen by the window manager. You may override that and also specify a different size providing a geometry argument as eighth suboption. If the spu subtitles are encoded in full size (720x576) the graphic text in the terminal window may be outside the window boundaries, instead of resizing the window use the nineth suboption to make the text visible.

You can omit the default values from the argument of -ocropts. For example:

-ocropts ocrad::dos:on

will use ocrad as ocr engine, terminate the lines with cr/lf and end the file name with ;1. The other parameters will keep their default values.

OCR engines

By the time encode2mpeg 0.6.3 has been released, the status of the supported ocr engines was the following:

tessercat usually recognises 97% or more of the input, undestands italics text, but only supports ASCII characters (it should change quite soon).
gocr has an higher error rate, understands italics and non ASCII text.
ocrad has an high error rate, understands non ASCII text but it has more trouble with italics.

My recomendation is to use tesseract unless you need support for non ASCII text.
If tesseract does not seem to work, use ./configure --with-libtiff=no during tessercact's compilation/installation process.

Top