Training Tesseract 4 on Windows
Created: 29 Apr 2021, Modified: 22 Jul 2021This article covers training Tesseract 4.1.1 for a particular font.
Put a otf or ttf of the font in some folder and make note of the path it is in.
Add the build folder with all the Tesseract executables to your PATH environment variable if it is not there already.
Training scripts
Tesseract source src/training
subdirectory has a script tesstrain.sh
that already does most of the steps of the training for us. You can copy it to elsewhere if you wish, but you also need these other scripts it depends on: tesstrain_utils.sh
, language-specific.sh
.
Tesstrain arguments
You can run tesstrain.sh
on its own to print usage text [if not, you can check the source and it is in a string right at the top after the licence]. The important arguments are:
1. --fonts_dir FONTS_PATH
A path to a directory that contains the otf or ttf file of the font you want to train for.
2. --fontlist FONTS
The name or list of names of the font(s) you want to train for. That’s not necessarily the name of the file, nor even the ‘Font name’ Windows gives when you open it. A way to find out is run:
text2image --list_available_fonts --fonts_dir=i:/path/to/folder --fontconfig_tmpdir=i:/any/path/doesnt/matter
My font is SofiaPro-Black, and this command gives me the output 0: Sofia Pro Heavy
. So ‘Sofia Pro Heavy’ is the font name to use.
3. --lang LANG_CODE
The language you are training for, e.g. eng
.
4. --langdata_dir DATADIR
From langdata_lstm repository you need the folder of the language you are training for (e.g. eng
), and radical-stroke.txt
. You can download this whole thing, but note it is over 1 GB.
5. [OPTIONAL] --tessdata_dir TESSDATADIR
This is only needed if you want to refine an existing model, rather than training from scratch. Download the .traineddata
file for the language you are training for, e.g. eng.traineddata
. You can get it from tessdata, tessdata_best, tessdata_fast, or other places (you may be able to find people who share the models they’ve trained).
6. [OPTIONAL] training_text TEXTFILE
text2image
by default will make the training images based on the training text for the language you are training for which comes with langdata_lstm
(e.g. eng.training_text
). If you like, you can provide your own text instead, for example if you need only particular characters or only digits. You need quite a lot of text (the default text has almost 200,000 lines), so you’ll probably need to write a simple script to generate some text for you into a text file and provide it with this option.
7. [OPTIONAL] --maxpages N
As mentioned, the training text tends to be very long, so you can limit the number of pages text2image
will generate with this option. Pages refers to pages the tif image that is used for training will have. Seems to do 55 lines per page but I am not sure if that is always the case.
8. --linedata_only
We need to provide this flag in order to train for Tesseract 4 LSTM training rather than the legacy box training which was used in Tesseract 3.
9. [OPTIONAL] --save_box_tiff
If you provide this flag, it will save the tif image that is used for training in the output folder, so you can see what it was using.
10. --output_dir OUTPUTDIR
Needless to say, path to the folder it will output stuff to.
Optional: Customise text2image output
You can modify tesstrain_utils.sh
to pass different arguments to text2image. I didn’t want it to rotate the images because I will be using my model for screenshots which are never going to be rotated. In line 283 I added to the text2image call the following arguments: --rotate_image=false --degrade_image=false
.
Step 1: Initial tesstrain command
You need to preface tesstrain.sh
with sh
if you are running the script in cmd or PowerShell.
Example command:
sh tesstrain.sh --fonts_dir fonts --fontlist 'Sofia Pro Heavy' --lang eng --linedata_only --langdata_dir langdata_lstm --tessdata_dir tessdata --save_box_tiff --maxpages 10 --output_dir train
With the following subfolders in the working directory:
fonts
: containing the otf/ttf of the fontlangdata_lstm
: containing the aforementioned needed data from langdata_lstmtessdata
: containing theeng.traineddata
model to refinetrain
: empty folder for the output
Step 2: Extract lstm from model
The next step is to extract the lstm portion of eng.traineddata
(or whatever .traineddata
file you are refining). You can skip this step if you are not refining an existing model.
combine_tessdata -e tessdata/eng.traineddata train/eng.lstm
The first path needs to link to the .traineddata
file of the model you are refining. The second path can point anywhere; this is where to output the lstm portion of it. The -e
stands for ‘extract’.
Step 3: Combine
Again, you can skip this step if you are not refining an existing model.
lstmtraining --continue_from train/eng.lstm --model_output output/Sofia --traineddata tessdata/eng.traineddata --train_listfile train/eng.training_files.txt --debug_interval -1 --max_iterations 500
- the first path (
--continue_from
) is to thelstm
file we just extracted - the second path (
--model_output
) can point anywhere; this is where to output your model and checkpoints of it during the training - the third path (
--traineddata
) is to the model you are refining - the fourth path (
--train_listfile
) is to a txt file that will have been created in Step 1 in the output folder you specified
Other things to note:
--debug interval -1
is a workaround
Step 4: Finalise
This last step is to stop the training and produce our final .traineddata
file, which is the new model we’ve produced.
lstmtraining --stop_training --continue_from output/Sofia_checkpoint --traineddata tessdata/eng.traineddata --model_output output/Sofia.traineddata
- the first path (
--continue_from
) is to the latest checkpoint of your model. changeoutput/Sofia
to whatever you set the second path to in Step 3 - the second path (
--traineddata
) is to the model you are refining (optional) - the third path (
--model_output
) is where to output your finished model
Using your new model
Your new model will be in whatever path you provided in Step 4; output/Sofia.traineddata
in my case. You can also see the checkpoints there and continue the training whenever you wish.
Tesseract uses the .traineddata
models that are in the path in the TESSDATA_PREFIX
environment variable, or in the path you provide it with --tessdata-dir
(argument you can give tesseract when you call it). If you want to use your new model for an existing language, for example English, rename it eng.traineddata
and place it there. Otherwise you can use it by providing in language whatever your traineddata file is called, for example in my case -l Sofia
.
You can see the languages it detects by running tesseract --list-langs
.