OCR models for Occitan (standard spelling)

2019-12-18T15:41:37Z (GMT) by Marianne Vergez-Couret

This dataset provides trained Tesseract (https://github.com/tesseract-ocr/tesseract) and Jochre (https://github.com/urieli/jochre) OCR models for Occitan ( for the standard spelling and two dialects, Gascon and Lengadocian). These models were developed in the context of the RESTAURE project, funded by the French ANR. 

Two models are provided. They were presented in the following article https://hal.archives-ouvertes.fr/hal-01252241 and also re-evaluated for the creation of another corpus in https://www.openscience.fr/Constitution-et-annotation-d-un-corpus-ecrit-de-contes-et-recits-en-occitan.

The first model for Jochre, JOCHRE_2015, has been trained for Jochre 1.1.2b. The training images and corresponding texts were manually annotated using a Jochre online platform (excerpts from 7 different printed works, totalling about 20,000 words)

The second model for Tesseract, TESS_2015, was trained using the jTessBoxEditor tool (http://vietocr.sourceforge.net/training.html), Version 1.4 (2 May 2015), based on images automatically generated from the training texts (the one used for Jochre). The generation of the images used a 36pt font size, and two fonts were used (Arial and Times New Roman), with their normal and italic variants. The Tesseract model can be used with Tesseract 3.0x.

List of words was also used for those two trainings. We conflated Occitan words found in several lexicons, dictionaries and corpora for the two dialects, Gascon and Lengadocian:

  • Lexicon extracted from 60 literary works (from 29 different authors) gathered in the BaTelÒc project.
  • Dictonary entries from Dictionnaire Français/Occitan Gascon Toulousain de Nicolau Rei Bèthvéder, 2004, IEO Edicions
  • Dictonary entries from Dictionnaire Français/Occitan de Cristian Laus, 2004, IEO/IDECO
  • Dictonary entries from Dictionnaire Français/Occitan (Gascon) de Miquèu Grosclaude, Gilabèrt Nariòo e Patric Guilhemjoan, 2007, Per Noste Edicions
  • Conjugated forms from Verb’Òc (designed by the Congrès permanent de la lenga occitana (http://www.locongres.org))
  • List of proper nouns extracted from the Apertium (free/open-source machine translation platform) Occitan lexicon.

The jochre model can be used with the Jochre software (https://github.com/urieli/jochre). See also Jochre wiki (https://github.com/urieli/jochre/wiki).

The Tesseract models can be used  for instance using the gImageReader tool (https://github.com/manisandro/gImageReader), which provides a graphical user interface for the Tesseract tool. 

When evaluated against the same test corpus (four extracts from four different authors from two dialects, Gascon and Lengadocian), the Jochre model achieves better performance levels.