Tuesday, November 3, 2015

Testing OCR with Tesseract

Moment of truth, let's see how well the "stock" tesseract install performs.

For reference, here are the versions of the various libraries I am testing with

Anils-MacBook-Air:tesseract-test anilmurty$ tesseract -v
tesseract 3.02.02
 leptonica-1.71
  libgif 4.2.3 : libjpeg 9a : libpng 1.6.18 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0
Anils-MacBook-Air:tesseract-test anilmurty$ 

For testing the OCR capabilities, I went on google and found a few sample files to read. My ultimate goal is to be able to read receipts and invoices but I figure I'll start with something more basic:

TEST #1: A PNG file with lots of special characters but with no crazy formatting, like you would find on a bill or an invoice

OUTPUT: Pretty impressive. Only messed up uber
Anils-MacBook-Air:tesseract-test anilmurty$ tesseract /Users/anilmurty/Desktop/ocr-test-image-1.png test-png-1
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
Anils-MacBook-Air:tesseract-test anilmurty$ cat test-png-1.txt 
The (quick) [brown] {fox} jumps!
Over the $43,456.78 #90 dog
& duck/goose, as 12.5% of E-mail
from aspammer@website.com is spam.
Der ,,schnelle” braune Fuchs springt
fiber den faulen Hund. Le renard brun
«rapide» saute par-dessus le chien
paresseux. La volpe marrone rapida
salta sopra il cane pigro. El zorro
marrén répido salta sobre el perro
perezoso. A raposa marrom rzipida
salta sobre o cfio preguicoso.
Anils-MacBook-Air:tesseract-test anilmurty$ 



TEST #2: A PNG Format of my Blog's logo:



OUTPUT: Totally messed up the tagline!
Anils-MacBook-Air:tesseract-test anilmurty$ tesseract /Users/anilmurty/Desktop/Geeking-Out.png Geeking-out
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
Anils-MacBook-Air:tesseract-test anilmurty$ cat Geeking-out.txt 
Geeking Out

’caz Fm sun a geek at man .)

Anils-MacBook-Air:tesseract-test anilmurty$ 

Monday, November 2, 2015

Installing Tesseract using Macports

Follow these steps to install Tesseract using Macports:

1. Install Tesseract dependencies: autoconf, automake, libtool, libpng (with support for jpeg and tiff) and leptonica.
2. Install tesseract. (I installed with just the english language support).
3. Set the TESSDATA_PREFIX env variable to point to the location of parent directory that contains the "tessdata" folder, which contains the eng.traineddata file (you may need to do a "find" to locate this file and point it to the correct path).



Last login: Mon Nov  2 10:37:56 on ttys000
Anils-MacBook-Air:~ anilmurty$ 
Anils-MacBook-Air:~ anilmurty$ 
Anils-MacBook-Air:~ anilmurty$ sudo port install autoconf
--->  Computing dependencies for autoconf
--->  Cleaning autoconf
--->  Scanning binaries for linking errors
--->  No broken files found.
Anils-MacBook-Air:~ anilmurty$ sudo port install automake
--->  Cleaning automake
--->  Scanning binaries for linking errors
--->  No broken files found.
Anils-MacBook-Air:~ anilmurty$ sudo port install libtool
--->  Cleaning libtool
--->  Scanning binaries for linking errors
--->  No broken files found.
Anils-MacBook-Air:~ anilmurty$ sudo port install jpeg tiff libpng
--->  Cleaning jpeg
--->  Computing dependencies for tiff
--->  Cleaning tiff
--->  Computing dependencies for libpng
--->  Cleaning libpng
--->  Scanning binaries for linking errors
--->  No broken files found.
Anils-MacBook-Air:~ anilmurty$ sudo port install leptonica
--->  Computing dependencies for leptonica
--->  Cleaning leptonica
--->  Scanning binaries for linking errors
--->  No broken files found.
Anils-MacBook-Air:~ anilmurty$ sudo port selfupdate
--->  Updating MacPorts base sources using rsync
MacPorts base version 2.3.4 installed,
MacPorts base version 2.3.4 downloaded.
--->  Updating the ports tree
--->  MacPorts base is already the latest version

The ports tree has been updated. To upgrade your installed ports, you should run
  port upgrade outdated
Anils-MacBook-Air:~ anilmurty$ port upgrade outdated
Nothing to upgrade.
Anils-MacBook-Air:~ anilmurty$ 


Anils-MacBook-Air:~ anilmurty$ sudo port install tesseract-eng
--->  Computing dependencies for tesseract-eng
--->  Dependencies to be installed: tesseract
--->  Fetching archive for tesseract
--->  Attempting to fetch tesseract-3.02.02_2.darwin_14.x86_64.tbz2 from http://packages.macports.org/tesseract
--->  Attempting to fetch tesseract-3.02.02_2.darwin_14.x86_64.tbz2.rmd160 from http://packages.macports.org/tesseract
--->  Installing tesseract @3.02.02_2
--->  Activating tesseract @3.02.02_2
--->  Cleaning tesseract
--->  Fetching archive for tesseract-eng
--->  Attempting to fetch tesseract-eng-3.02_1.darwin_14.noarch.tbz2 from http://packages.macports.org/tesseract-eng
--->  Attempting to fetch tesseract-eng-3.02_1.darwin_14.noarch.tbz2.rmd160 from http://packages.macports.org/tesseract-eng
--->  Installing tesseract-eng @3.02_1
--->  Activating tesseract-eng @3.02_1
--->  Cleaning tesseract-eng
--->  Updating database of binaries
--->  Scanning binaries for linking errors
--->  No broken files found.
Anils-MacBook-Air:~ anilmurty$ 

Anils-MacBook-Air:/ anilmurty$ export TESSDATA_PREFIX="/opt/local/share"



 Quick Test

Anils-MacBook-Air:/ anilmurty$ tesseract
Usage:tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]

pagesegmode values are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
-l lang and/or -psm pagesegmode must occur before anyconfigfile.

Single options:
  -v --version: version info
  --list-langs: list available languages for tesseract engine
Anils-MacBook-Air:/ anilmurty$ 

Saturday, October 31, 2015

Installing Xcode, Python, MacPorts, Pip and more

By now I've figured out that if I'm going to be developing a Python app using a Mac and leveraging a lot of opensource code, I need the following (I'll add to this list as I go):

1. Xcode: Apples Development environment that feature a bunch of dev tools (including swift) and an IDE. Install from here.
2. Python: I am running Mac OS X Yosemite, so it is already installed with python version 2.7.10, which is the latest in the 2.x trail and is more than sufficient for what I'm doing. I may consider upgrading to 3.x later, but good for now.
4. Macports: Is an open source initiative that makes it easy to install open source code in the Mac OS X environment: Download and install from here.
3. Pip: Pip is the PyPa recommended tool for installing Python Packages.


If you are running python version 2.9 or newer, you will already have pip by default. check by typing “pip” at the command line. If you don’t already have pip, follow these steps to install:

1. Copy code from https://bootstrap.pypa.io/get-pip.py into your local file
Anils-MacBook-Air:Python anilmurty$ touch get-pip.py
Anils-MacBook-Air:Python anilmurty$ tw get-pip.py
Anils-MacBook-Air:Python anilmurty$ 
2. Install as follows:
Anils-MacBook-Air:Python anilmurty$ sudo python get-pip.py

WARNING: Improper use of the sudo command could lead to data loss
or the deletion of important system files. Please double-check your
typing when using sudo. Type "man sudo" for more information.

To proceed, enter your password, or type Ctrl-C to abort.
Password:
The directory '/Users/anilmurty/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/anilmurty/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting pip
  Downloading pip-7.1.2-py2.py3-none-any.whl (1.1MB)
    100% |████████████████████████████████| 1.1MB 291kB/s
Collecting wheel
  Downloading wheel-0.26.0-py2.py3-none-any.whl (63kB)
    100% |████████████████████████████████| 65kB 3.5MB/s
Installing collected packages: pip, wheel
Successfully installed pip-7.1.2 wheel-0.26.0
Anils-MacBook-Air:Python anilmurty$ 

3. Confirm Installation:

Anils-MacBook-Air:Python anilmurty$ pip

Usage:
  pip [options]

Commands:
  install                     Install packages.
  uninstall                   Uninstall packages.
  freeze                      Output installed packages in requirements format.
  list                        List installed packages.
  show                        Show information about installed packages.
  search                      Search PyPI for packages.
  wheel                       Build wheels from your requirements.
  help                        Show help for commands.

General Options:
  -h, --help                  Show help.
  --isolated                  Run pip in an isolated mode, ignoring environment variables and user
                              configuration.
  -v, --verbose               Give more output. Option is additive, and can be used up to 3 times.
  -V, --version               Show version and exit.
  -q, --quiet                 Give less output.
  --log                 Path to a verbose appending log.
  --proxy             Specify a proxy in the form [user:passwd@]proxy.server:port.
  --retries         Maximum number of retries each connection should attempt (default 5 times).
  --timeout             Set the socket timeout (default 15 seconds).
  --exists-action     Default action when a path already exists: (s)witch, (i)gnore, (w)ipe,
                              (b)ackup.
  --trusted-host   Mark this host as trusted, even though it does not have valid or any HTTPS.
  --cert               Path to alternate CA bundle.
  --client-cert         Path to SSL client certificate, a single file containing the private key and
                              the certificate in PEM format.
  --cache-dir

          Store the cache data in .
  --no-cache-dir              Disable the cache.
  --disable-pip-version-check
                              Don't periodically check PyPI to determine whether a new version of pip is
                              available for download. Implied with --no-index.
Anils-MacBook-Air:Python anilmurty$

Cloning the Tesseract OCR Engine

What is Tesseract and why am I cloning it?

The real definition of a Tesseract is a "4 Dimensional Analog of a Cube" -- read more about it at this wikipedia page.

In this context, Tesseract is the name of the Optical Character Recognition (OCR) engine, originally developed at HP between 1984 and 1995 and then later on enhanced by Google and released under the Apache License 2.0. Here is the GitHub page for it.

Here is some formal documentation from the README.md, once you clone and unpack it:

=============================================================
History
=======
The engine was developed at Hewlett-Packard Laboratories Bristol and
at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some
more changes made in 1996 to port to Windows, and some C++izing in 1998.
A lot of the code was written in C, and then some more was written in C++.
Since then all the code has been converted to at least compile with a C++
compiler. Currently it builds under Linux with gcc 4.4.3 and under Windows
with VC++2010. The C++ code makes heavy use of a list system using macros.
This predates stl, was portable before stl, and is more efficient than stl
lists, but has the big negative that if you do get a segmentation violation,
it is hard to debug.

The most recent change is that Tesseract can now recognize 39 languages,
including Arabic, Hindi, Vietnamese, plus 3 Fraktur variants, 
is fully UTF8 capable, and is fully trainable. See TrainingTesseract for
more information on training.

Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy. 
Results were available on https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf.
With Tesseract 2.00, scripts were included to allow anyone to reproduce 
some of these tests. See TestingTesseract for more details. 


About the Engine
================
This code is a raw OCR engine. It has limited PAGE LAYOUT ANALYSIS, simple
OUTPUT FORMATTING (txt, hocr/html), and NO UI. 
Having said that, in 1995, this engine was in the top 3 in terms of character
accuracy, and it compiles and runs on both Linux and Windows.
As of 3.01, Tesseract is fully unicode (UTF-8) enabled, and can recognize 39
languages "out of the box." Code and documentation is provided for the brave
to train in other languages. 
See [Tesseract Training wiki](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract) 
for more information on training. Additional [code and extracted documentation](http://tesseract-ocr.github.io/) was generated by Doxygen.

===============================================

My goal is to try and build an app that utilizes this engine and hence, I'm "checking out" the code as below. I'm hoping to write the wrapper in Python (another item on my learning list), hence the "pytesseract" reference.



Anils-MacBook-Air:Projects anilmurty$ mkdir pytesseract
Anils-MacBook-Air:Projects anilmurty$ cd pytesseract/
Anils-MacBook-Air:pytesseract anilmurty$ git clone https://github.com/tesseract-ocr/tesseract.git
Cloning into 'tesseract'...
remote: Counting objects: 11607, done.
remote: Compressing objects: 100% (28/28), done.
remote: Total 11607 (delta 6), reused 0 (delta 0), pack-reused 11579
Receiving objects: 100% (11607/11607), 32.35 MiB | 1.38 MiB/s, done.
Resolving deltas: 100% (9073/9073), done.
Checking connectivity... done.
Anils-MacBook-Air:pytesseract anilmurty$ 

Installing and setting up Git for Mac OSX

Having been out of the weeds on the technical stuff for a couple years, I decided to document my journey in building a OCR reader app. The first step in that was to find a code repository to store all the stuff as I build it out.
I've used subversion, CVS and Perforce in the past for work and git for fun in the past (on Linux). I figured I'd go with Git but since I've moved to Mac OSX a few years ago, here is the dump of what it takes to get set up:

1. Git comes preinstalled on a Mac running OSX 10.6 or newer. Type “git” in a terminal to confirm or to install.
2. Check the version “git —version” and compare against the latest here: http://git-scm.com/download/mac
3. If older, then download .dmg and install it.
4. Restart terminal and run “git —version” again to confirm new version.
5. Set up your identity for commits:
    Last login: Sat Oct 31 11:32:25 on ttys002
Anils-MacBook-Air:~ anilmurty$ git --version
git version 2.6.2
Anils-MacBook-Air:~ anilmurty$ git config --global user.name "Anil Murty"
Anils-MacBook-Air:~ anilmurty$ git config --global user.email anil.codemonkey@gmail.com

6. Set up default text editor. For Mac: Vim, Emacs or TextWrangler are good options. I've used vi and emacs on linux so I decided to try something new and use TextWrangler this time. TextWrangler does not have command line tools by default so you may have to install them, alternatively if you just wish to open files from the command line and then use the GUI, you can modify your .bash_profile file and add a line:
Add this to your .bash_profile file under your user directory on Mac OS X (e.g “/users/anilmurty")
# Type 'tw' on the terminal to open TextWrangler
alias tw='open -a /Applications/TextWrangler.app'
Then set tw as the default editor for git:
Anils-MacBook-Air:~ anilmurty$ git config --global core.editor tw

7. Check all your config settings:

Anils-MacBook-Air:~ anilmurty$ git config --list
core.excludesfile=~/.gitignore
core.legacyheaders=false
core.quotepath=false
core.pager=less -r
mergetool.keepbackup=true
push.default=simple
color.ui=auto
color.interactive=auto
repack.usedeltabaseoffset=true
alias.s=status
alias.a=!git add . && git status
alias.au=!git add -u . && git status
alias.aa=!git add . && git add -u . && git status
alias.c=commit
alias.cm=commit -m
alias.ca=commit --amend
alias.ac=!git add . && git commit
alias.acm=!git add . && git commit -m
alias.l=log --graph --all --pretty=format:'%C(yellow)%h%C(cyan)%d%Creset %s %C(white)- %an, %ar%Creset'
alias.ll=log --stat --abbrev-commit
alias.lg=log --color --graph --pretty=format:'%C(bold white)%h%Creset -%C(bold green)%d%Creset %s %C(bold green)(%cr)%Creset %C(bold blue)<%an>%Creset' --abbrev-commit --date=relative
alias.llg=log --color --graph --pretty=format:'%C(bold white)%H %d%Creset%n%s%n%+b%C(bold blue)%an <%ae>%Creset %C(bold green)%cr (%ci)' --abbrev-commit
alias.d=diff
alias.master=checkout master
alias.spull=svn rebase
alias.spush=svn dcommit
alias.alias=!git config --list | grep 'alias\.' | sed 's/alias\.\([^=]*\)=\(.*\)/\1\ => \2/' | sort
include.path=~/.gitcinclude
include.path=.githubconfig
include.path=.gitcredential
diff.exif.textconv=exif
credential.helper=osxkeychain
user.name=Anil Murty
user.email=anil.codemonkey@gmail.com core.editor=tw

Anils-MacBook-Air:~ anilmurty$