Monday, November 2, 2015

Installing Tesseract using Macports

Follow these steps to install Tesseract using Macports:

1. Install Tesseract dependencies: autoconf, automake, libtool, libpng (with support for jpeg and tiff) and leptonica.
2. Install tesseract. (I installed with just the english language support).
3. Set the TESSDATA_PREFIX env variable to point to the location of parent directory that contains the "tessdata" folder, which contains the eng.traineddata file (you may need to do a "find" to locate this file and point it to the correct path).



Last login: Mon Nov  2 10:37:56 on ttys000
Anils-MacBook-Air:~ anilmurty$ 
Anils-MacBook-Air:~ anilmurty$ 
Anils-MacBook-Air:~ anilmurty$ sudo port install autoconf
--->  Computing dependencies for autoconf
--->  Cleaning autoconf
--->  Scanning binaries for linking errors
--->  No broken files found.
Anils-MacBook-Air:~ anilmurty$ sudo port install automake
--->  Cleaning automake
--->  Scanning binaries for linking errors
--->  No broken files found.
Anils-MacBook-Air:~ anilmurty$ sudo port install libtool
--->  Cleaning libtool
--->  Scanning binaries for linking errors
--->  No broken files found.
Anils-MacBook-Air:~ anilmurty$ sudo port install jpeg tiff libpng
--->  Cleaning jpeg
--->  Computing dependencies for tiff
--->  Cleaning tiff
--->  Computing dependencies for libpng
--->  Cleaning libpng
--->  Scanning binaries for linking errors
--->  No broken files found.
Anils-MacBook-Air:~ anilmurty$ sudo port install leptonica
--->  Computing dependencies for leptonica
--->  Cleaning leptonica
--->  Scanning binaries for linking errors
--->  No broken files found.
Anils-MacBook-Air:~ anilmurty$ sudo port selfupdate
--->  Updating MacPorts base sources using rsync
MacPorts base version 2.3.4 installed,
MacPorts base version 2.3.4 downloaded.
--->  Updating the ports tree
--->  MacPorts base is already the latest version

The ports tree has been updated. To upgrade your installed ports, you should run
  port upgrade outdated
Anils-MacBook-Air:~ anilmurty$ port upgrade outdated
Nothing to upgrade.
Anils-MacBook-Air:~ anilmurty$ 


Anils-MacBook-Air:~ anilmurty$ sudo port install tesseract-eng
--->  Computing dependencies for tesseract-eng
--->  Dependencies to be installed: tesseract
--->  Fetching archive for tesseract
--->  Attempting to fetch tesseract-3.02.02_2.darwin_14.x86_64.tbz2 from http://packages.macports.org/tesseract
--->  Attempting to fetch tesseract-3.02.02_2.darwin_14.x86_64.tbz2.rmd160 from http://packages.macports.org/tesseract
--->  Installing tesseract @3.02.02_2
--->  Activating tesseract @3.02.02_2
--->  Cleaning tesseract
--->  Fetching archive for tesseract-eng
--->  Attempting to fetch tesseract-eng-3.02_1.darwin_14.noarch.tbz2 from http://packages.macports.org/tesseract-eng
--->  Attempting to fetch tesseract-eng-3.02_1.darwin_14.noarch.tbz2.rmd160 from http://packages.macports.org/tesseract-eng
--->  Installing tesseract-eng @3.02_1
--->  Activating tesseract-eng @3.02_1
--->  Cleaning tesseract-eng
--->  Updating database of binaries
--->  Scanning binaries for linking errors
--->  No broken files found.
Anils-MacBook-Air:~ anilmurty$ 

Anils-MacBook-Air:/ anilmurty$ export TESSDATA_PREFIX="/opt/local/share"



 Quick Test

Anils-MacBook-Air:/ anilmurty$ tesseract
Usage:tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]

pagesegmode values are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
-l lang and/or -psm pagesegmode must occur before anyconfigfile.

Single options:
  -v --version: version info
  --list-langs: list available languages for tesseract engine
Anils-MacBook-Air:/ anilmurty$ 

Saturday, October 31, 2015

Installing Xcode, Python, MacPorts, Pip and more

By now I've figured out that if I'm going to be developing a Python app using a Mac and leveraging a lot of opensource code, I need the following (I'll add to this list as I go):

1. Xcode: Apples Development environment that feature a bunch of dev tools (including swift) and an IDE. Install from here.
2. Python: I am running Mac OS X Yosemite, so it is already installed with python version 2.7.10, which is the latest in the 2.x trail and is more than sufficient for what I'm doing. I may consider upgrading to 3.x later, but good for now.
4. Macports: Is an open source initiative that makes it easy to install open source code in the Mac OS X environment: Download and install from here.
3. Pip: Pip is the PyPa recommended tool for installing Python Packages.


If you are running python version 2.9 or newer, you will already have pip by default. check by typing “pip” at the command line. If you don’t already have pip, follow these steps to install:

1. Copy code from https://bootstrap.pypa.io/get-pip.py into your local file
Anils-MacBook-Air:Python anilmurty$ touch get-pip.py
Anils-MacBook-Air:Python anilmurty$ tw get-pip.py
Anils-MacBook-Air:Python anilmurty$ 
2. Install as follows:
Anils-MacBook-Air:Python anilmurty$ sudo python get-pip.py

WARNING: Improper use of the sudo command could lead to data loss
or the deletion of important system files. Please double-check your
typing when using sudo. Type "man sudo" for more information.

To proceed, enter your password, or type Ctrl-C to abort.
Password:
The directory '/Users/anilmurty/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/anilmurty/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting pip
  Downloading pip-7.1.2-py2.py3-none-any.whl (1.1MB)
    100% |████████████████████████████████| 1.1MB 291kB/s
Collecting wheel
  Downloading wheel-0.26.0-py2.py3-none-any.whl (63kB)
    100% |████████████████████████████████| 65kB 3.5MB/s
Installing collected packages: pip, wheel
Successfully installed pip-7.1.2 wheel-0.26.0
Anils-MacBook-Air:Python anilmurty$ 

3. Confirm Installation:

Anils-MacBook-Air:Python anilmurty$ pip

Usage:
  pip [options]

Commands:
  install                     Install packages.
  uninstall                   Uninstall packages.
  freeze                      Output installed packages in requirements format.
  list                        List installed packages.
  show                        Show information about installed packages.
  search                      Search PyPI for packages.
  wheel                       Build wheels from your requirements.
  help                        Show help for commands.

General Options:
  -h, --help                  Show help.
  --isolated                  Run pip in an isolated mode, ignoring environment variables and user
                              configuration.
  -v, --verbose               Give more output. Option is additive, and can be used up to 3 times.
  -V, --version               Show version and exit.
  -q, --quiet                 Give less output.
  --log                 Path to a verbose appending log.
  --proxy             Specify a proxy in the form [user:passwd@]proxy.server:port.
  --retries         Maximum number of retries each connection should attempt (default 5 times).
  --timeout             Set the socket timeout (default 15 seconds).
  --exists-action     Default action when a path already exists: (s)witch, (i)gnore, (w)ipe,
                              (b)ackup.
  --trusted-host   Mark this host as trusted, even though it does not have valid or any HTTPS.
  --cert               Path to alternate CA bundle.
  --client-cert         Path to SSL client certificate, a single file containing the private key and
                              the certificate in PEM format.
  --cache-dir

          Store the cache data in .
  --no-cache-dir              Disable the cache.
  --disable-pip-version-check
                              Don't periodically check PyPI to determine whether a new version of pip is
                              available for download. Implied with --no-index.
Anils-MacBook-Air:Python anilmurty$

Cloning the Tesseract OCR Engine

What is Tesseract and why am I cloning it?

The real definition of a Tesseract is a "4 Dimensional Analog of a Cube" -- read more about it at this wikipedia page.

In this context, Tesseract is the name of the Optical Character Recognition (OCR) engine, originally developed at HP between 1984 and 1995 and then later on enhanced by Google and released under the Apache License 2.0. Here is the GitHub page for it.

Here is some formal documentation from the README.md, once you clone and unpack it:

=============================================================
History
=======
The engine was developed at Hewlett-Packard Laboratories Bristol and
at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some
more changes made in 1996 to port to Windows, and some C++izing in 1998.
A lot of the code was written in C, and then some more was written in C++.
Since then all the code has been converted to at least compile with a C++
compiler. Currently it builds under Linux with gcc 4.4.3 and under Windows
with VC++2010. The C++ code makes heavy use of a list system using macros.
This predates stl, was portable before stl, and is more efficient than stl
lists, but has the big negative that if you do get a segmentation violation,
it is hard to debug.

The most recent change is that Tesseract can now recognize 39 languages,
including Arabic, Hindi, Vietnamese, plus 3 Fraktur variants, 
is fully UTF8 capable, and is fully trainable. See TrainingTesseract for
more information on training.

Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy. 
Results were available on https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf.
With Tesseract 2.00, scripts were included to allow anyone to reproduce 
some of these tests. See TestingTesseract for more details. 


About the Engine
================
This code is a raw OCR engine. It has limited PAGE LAYOUT ANALYSIS, simple
OUTPUT FORMATTING (txt, hocr/html), and NO UI. 
Having said that, in 1995, this engine was in the top 3 in terms of character
accuracy, and it compiles and runs on both Linux and Windows.
As of 3.01, Tesseract is fully unicode (UTF-8) enabled, and can recognize 39
languages "out of the box." Code and documentation is provided for the brave
to train in other languages. 
See [Tesseract Training wiki](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract) 
for more information on training. Additional [code and extracted documentation](http://tesseract-ocr.github.io/) was generated by Doxygen.

===============================================

My goal is to try and build an app that utilizes this engine and hence, I'm "checking out" the code as below. I'm hoping to write the wrapper in Python (another item on my learning list), hence the "pytesseract" reference.



Anils-MacBook-Air:Projects anilmurty$ mkdir pytesseract
Anils-MacBook-Air:Projects anilmurty$ cd pytesseract/
Anils-MacBook-Air:pytesseract anilmurty$ git clone https://github.com/tesseract-ocr/tesseract.git
Cloning into 'tesseract'...
remote: Counting objects: 11607, done.
remote: Compressing objects: 100% (28/28), done.
remote: Total 11607 (delta 6), reused 0 (delta 0), pack-reused 11579
Receiving objects: 100% (11607/11607), 32.35 MiB | 1.38 MiB/s, done.
Resolving deltas: 100% (9073/9073), done.
Checking connectivity... done.
Anils-MacBook-Air:pytesseract anilmurty$ 

Installing and setting up Git for Mac OSX

Having been out of the weeds on the technical stuff for a couple years, I decided to document my journey in building a OCR reader app. The first step in that was to find a code repository to store all the stuff as I build it out.
I've used subversion, CVS and Perforce in the past for work and git for fun in the past (on Linux). I figured I'd go with Git but since I've moved to Mac OSX a few years ago, here is the dump of what it takes to get set up:

1. Git comes preinstalled on a Mac running OSX 10.6 or newer. Type “git” in a terminal to confirm or to install.
2. Check the version “git —version” and compare against the latest here: http://git-scm.com/download/mac
3. If older, then download .dmg and install it.
4. Restart terminal and run “git —version” again to confirm new version.
5. Set up your identity for commits:
    Last login: Sat Oct 31 11:32:25 on ttys002
Anils-MacBook-Air:~ anilmurty$ git --version
git version 2.6.2
Anils-MacBook-Air:~ anilmurty$ git config --global user.name "Anil Murty"
Anils-MacBook-Air:~ anilmurty$ git config --global user.email anil.codemonkey@gmail.com

6. Set up default text editor. For Mac: Vim, Emacs or TextWrangler are good options. I've used vi and emacs on linux so I decided to try something new and use TextWrangler this time. TextWrangler does not have command line tools by default so you may have to install them, alternatively if you just wish to open files from the command line and then use the GUI, you can modify your .bash_profile file and add a line:
Add this to your .bash_profile file under your user directory on Mac OS X (e.g “/users/anilmurty")
# Type 'tw' on the terminal to open TextWrangler
alias tw='open -a /Applications/TextWrangler.app'
Then set tw as the default editor for git:
Anils-MacBook-Air:~ anilmurty$ git config --global core.editor tw

7. Check all your config settings:

Anils-MacBook-Air:~ anilmurty$ git config --list
core.excludesfile=~/.gitignore
core.legacyheaders=false
core.quotepath=false
core.pager=less -r
mergetool.keepbackup=true
push.default=simple
color.ui=auto
color.interactive=auto
repack.usedeltabaseoffset=true
alias.s=status
alias.a=!git add . && git status
alias.au=!git add -u . && git status
alias.aa=!git add . && git add -u . && git status
alias.c=commit
alias.cm=commit -m
alias.ca=commit --amend
alias.ac=!git add . && git commit
alias.acm=!git add . && git commit -m
alias.l=log --graph --all --pretty=format:'%C(yellow)%h%C(cyan)%d%Creset %s %C(white)- %an, %ar%Creset'
alias.ll=log --stat --abbrev-commit
alias.lg=log --color --graph --pretty=format:'%C(bold white)%h%Creset -%C(bold green)%d%Creset %s %C(bold green)(%cr)%Creset %C(bold blue)<%an>%Creset' --abbrev-commit --date=relative
alias.llg=log --color --graph --pretty=format:'%C(bold white)%H %d%Creset%n%s%n%+b%C(bold blue)%an <%ae>%Creset %C(bold green)%cr (%ci)' --abbrev-commit
alias.d=diff
alias.master=checkout master
alias.spull=svn rebase
alias.spush=svn dcommit
alias.alias=!git config --list | grep 'alias\.' | sed 's/alias\.\([^=]*\)=\(.*\)/\1\ => \2/' | sort
include.path=~/.gitcinclude
include.path=.githubconfig
include.path=.gitcredential
diff.exif.textconv=exif
credential.helper=osxkeychain
user.name=Anil Murty
user.email=anil.codemonkey@gmail.com core.editor=tw

Anils-MacBook-Air:~ anilmurty$ 





Sunday, August 22, 2010

Android for MIPS

I recently learnt about this effort to port over Android to the MIPS platform at the this website and decided to investigate what it was about.

Before I could check out any source code, I had to install Curl and Git. For Git install see my previous post. Curl can be installed as follows:
sudo apt-get install curl

Next, install repo, which is a google tool (read the google blog post for details:
curl http://android.git.kernel.org/repo >/home/am/repo

Once repo is set up, run repo sync to check out the generic Android for MIPS source

Git Install

Git is an open source distributed version control system that hosts a lot of the open source code (including itself). So, installing Git on your Linux machine is a good idea, coz, you will likely end up using it to download something pretty soon

Follow these steps to install Git (I did this on Ubuntu 10.04):
1. Download latest tar or gz file from here.
2. untar it to a local directory.
3. cd to the git directory.
4. run ./configure (or sudo ./configure, if in user mode)
5. run make.
6. if you see the following error: fast-import.c:2848: error: ‘Z_BEST_COMPRESSION’ undeclared (first use in this function).... you are missing zlib1g-dev. If so, install it as follows:
sudo apt-get install zlib1g-dev
and then run make again, it should work this time.
7. run make install
Done!

VirtualBox For Windows 7 and Ubuntu 10.04

So, I decided to investigate setting up a virtual machine so I don't have to constantly reboot as I go between Windows and Linux.
After doing some quick research on all the various options out there (VMWare, VirtualBox, Microsoft Virtual PC,...) I decided to go with Sun VirtualBox.

The setup was fairly easy:
1. You decide a "host OS" that you want to use (in my case it was Windows 7).
2. Download and install VirtualBox from http://www.virtualbox.org/wiki/Downloads
3. Decide on the guest OS (or OSes) you want to install (in my case Ubuntu 10.04)
4. Run the VirtualBox console.
5. Select New Machine - this brings up the wizard.
6. Give it a name and select the operating system type from the dropdown.
7. Browse to the ISO image to install
8. Install the OS.

There is one quirk I've encountered so far - When I run my guest OS (Ubuntu) and try the switch to full screen - I see a full screen, but the actual OS itself, remains in the same size screen. I found the solution to this problem at this link :
This post was particularly useful:
Start the guest
  • Open a terminal window
  • sudo -i press enter
  • apt-get install build-essential
  • apt-get install Linux-headers-generic
Leave the terminal open you are going to need it again.

When this is done the devices tab is at the top left of the guest window (called a VM which stands for virtual machine) then click install guest additions.
This will put a CD on the desktop. Double click to open and then select the guest additions for your machine (see note below) and right click-drag and drop on the terminal window.
Click once in the terminal to get focus and then hit enter. Wait for everything to finish and then reboot the machine. Watch for any errors and if you have any report back here with the exact error message.

Note: Select VBoxLinuxAdditions-amd64.run for 64 bit or VBoxLinuxAdditions-x86.run for 32 bit

After you have the Guest additions installed you use the mouse to adjust the screen to the size you want or you can use the (Host+f) toggle for full screen or the (Host+L) toggle for seamless. The Host key is the right Ctrl key which a MAC does not have so to use this feature you will need to go into the main VirtualBox Program and at the top left click file, then preferences and then input. Click once where it says Right Control and then press the left ctrl key on your MAC keyboard. Close and then start the guest again for this to take effect.