Geeking Out: Cloning the Tesseract OCR Engine

What is Tesseract and why am I cloning it?

The real definition of a Tesseract is a "4 Dimensional Analog of a Cube" -- read more about it at this wikipedia page.

In this context, Tesseract is the name of the Optical Character Recognition (OCR) engine, originally developed at HP between 1984 and 1995 and then later on enhanced by Google and released under the Apache License 2.0. Here is the GitHub page for it.

Here is some formal documentation from the README.md, once you clone and unpack it:

=============================================================

History

=======

The engine was developed at Hewlett-Packard Laboratories Bristol and

at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some

more changes made in 1996 to port to Windows, and some C++izing in 1998.

A lot of the code was written in C, and then some more was written in C++.

Since then all the code has been converted to at least compile with a C++

compiler. Currently it builds under Linux with gcc 4.4.3 and under Windows

with VC++2010. The C++ code makes heavy use of a list system using macros.

This predates stl, was portable before stl, and is more efficient than stl

lists, but has the big negative that if you do get a segmentation violation,

it is hard to debug.

The most recent change is that Tesseract can now recognize 39 languages,

including Arabic, Hindi, Vietnamese, plus 3 Fraktur variants,

is fully UTF8 capable, and is fully trainable. See TrainingTesseract for

more information on training.

Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy.

Results were available on https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf.

With Tesseract 2.00, scripts were included to allow anyone to reproduce

some of these tests. See TestingTesseract for more details.

About the Engine

================

This code is a raw OCR engine. It has limited PAGE LAYOUT ANALYSIS, simple

OUTPUT FORMATTING (txt, hocr/html), and NO UI.

Having said that, in 1995, this engine was in the top 3 in terms of character

accuracy, and it compiles and runs on both Linux and Windows.

As of 3.01, Tesseract is fully unicode (UTF-8) enabled, and can recognize 39

languages "out of the box." Code and documentation is provided for the brave

to train in other languages.

See [Tesseract Training wiki](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract)

for more information on training. Additional [code and extracted documentation](http://tesseract-ocr.github.io/) was generated by Doxygen.

===============================================

My goal is to try and build an app that utilizes this engine and hence, I'm "checking out" the code as below. I'm hoping to write the wrapper in Python (another item on my learning list), hence the "pytesseract" reference.

Anils-MacBook-Air:Projects anilmurty$ mkdir pytesseract

Anils-MacBook-Air:Projects anilmurty$ cd pytesseract/

Anils-MacBook-Air:pytesseract anilmurty$ git clone https://github.com/tesseract-ocr/tesseract.git
Cloning into 'tesseract'...
remote: Counting objects: 11607, done.
remote: Compressing objects: 100% (28/28), done.
remote: Total 11607 (delta 6), reused 0 (delta 0), pack-reused 11579
Receiving objects: 100% (11607/11607), 32.35 MiB | 1.38 MiB/s, done.
Resolving deltas: 100% (9073/9073), done.
Checking connectivity... done.

Anils-MacBook-Air:pytesseract anilmurty$ 

Geeking Out

Saturday, October 31, 2015

Cloning the Tesseract OCR Engine

Blog Archive

About Me