Saturday, October 31, 2015

Cloning the Tesseract OCR Engine

What is Tesseract and why am I cloning it?

The real definition of a Tesseract is a "4 Dimensional Analog of a Cube" -- read more about it at this wikipedia page.

In this context, Tesseract is the name of the Optical Character Recognition (OCR) engine, originally developed at HP between 1984 and 1995 and then later on enhanced by Google and released under the Apache License 2.0. Here is the GitHub page for it.

Here is some formal documentation from the README.md, once you clone and unpack it:

=============================================================
History
=======
The engine was developed at Hewlett-Packard Laboratories Bristol and
at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some
more changes made in 1996 to port to Windows, and some C++izing in 1998.
A lot of the code was written in C, and then some more was written in C++.
Since then all the code has been converted to at least compile with a C++
compiler. Currently it builds under Linux with gcc 4.4.3 and under Windows
with VC++2010. The C++ code makes heavy use of a list system using macros.
This predates stl, was portable before stl, and is more efficient than stl
lists, but has the big negative that if you do get a segmentation violation,
it is hard to debug.

The most recent change is that Tesseract can now recognize 39 languages,
including Arabic, Hindi, Vietnamese, plus 3 Fraktur variants, 
is fully UTF8 capable, and is fully trainable. See TrainingTesseract for
more information on training.

Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy. 
Results were available on https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf.
With Tesseract 2.00, scripts were included to allow anyone to reproduce 
some of these tests. See TestingTesseract for more details. 


About the Engine
================
This code is a raw OCR engine. It has limited PAGE LAYOUT ANALYSIS, simple
OUTPUT FORMATTING (txt, hocr/html), and NO UI. 
Having said that, in 1995, this engine was in the top 3 in terms of character
accuracy, and it compiles and runs on both Linux and Windows.
As of 3.01, Tesseract is fully unicode (UTF-8) enabled, and can recognize 39
languages "out of the box." Code and documentation is provided for the brave
to train in other languages. 
See [Tesseract Training wiki](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract) 
for more information on training. Additional [code and extracted documentation](http://tesseract-ocr.github.io/) was generated by Doxygen.

===============================================

My goal is to try and build an app that utilizes this engine and hence, I'm "checking out" the code as below. I'm hoping to write the wrapper in Python (another item on my learning list), hence the "pytesseract" reference.



Anils-MacBook-Air:Projects anilmurty$ mkdir pytesseract
Anils-MacBook-Air:Projects anilmurty$ cd pytesseract/
Anils-MacBook-Air:pytesseract anilmurty$ git clone https://github.com/tesseract-ocr/tesseract.git
Cloning into 'tesseract'...
remote: Counting objects: 11607, done.
remote: Compressing objects: 100% (28/28), done.
remote: Total 11607 (delta 6), reused 0 (delta 0), pack-reused 11579
Receiving objects: 100% (11607/11607), 32.35 MiB | 1.38 MiB/s, done.
Resolving deltas: 100% (9073/9073), done.
Checking connectivity... done.
Anils-MacBook-Air:pytesseract anilmurty$