Saturday, October 31, 2015

Installing Xcode, Python, MacPorts, Pip and more

By now I've figured out that if I'm going to be developing a Python app using a Mac and leveraging a lot of opensource code, I need the following (I'll add to this list as I go):

1. Xcode: Apples Development environment that feature a bunch of dev tools (including swift) and an IDE. Install from here.
2. Python: I am running Mac OS X Yosemite, so it is already installed with python version 2.7.10, which is the latest in the 2.x trail and is more than sufficient for what I'm doing. I may consider upgrading to 3.x later, but good for now.
4. Macports: Is an open source initiative that makes it easy to install open source code in the Mac OS X environment: Download and install from here.
3. Pip: Pip is the PyPa recommended tool for installing Python Packages.

If you are running python version 2.9 or newer, you will already have pip by default. check by typing “pip” at the command line. If you don’t already have pip, follow these steps to install:

1. Copy code from into your local file
Anils-MacBook-Air:Python anilmurty$ touch
Anils-MacBook-Air:Python anilmurty$ tw
Anils-MacBook-Air:Python anilmurty$ 
2. Install as follows:
Anils-MacBook-Air:Python anilmurty$ sudo python

WARNING: Improper use of the sudo command could lead to data loss
or the deletion of important system files. Please double-check your
typing when using sudo. Type "man sudo" for more information.

To proceed, enter your password, or type Ctrl-C to abort.
The directory '/Users/anilmurty/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/anilmurty/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting pip
  Downloading pip-7.1.2-py2.py3-none-any.whl (1.1MB)
    100% |████████████████████████████████| 1.1MB 291kB/s
Collecting wheel
  Downloading wheel-0.26.0-py2.py3-none-any.whl (63kB)
    100% |████████████████████████████████| 65kB 3.5MB/s
Installing collected packages: pip, wheel
Successfully installed pip-7.1.2 wheel-0.26.0
Anils-MacBook-Air:Python anilmurty$ 

3. Confirm Installation:

Anils-MacBook-Air:Python anilmurty$ pip

  pip [options]

  install                     Install packages.
  uninstall                   Uninstall packages.
  freeze                      Output installed packages in requirements format.
  list                        List installed packages.
  show                        Show information about installed packages.
  search                      Search PyPI for packages.
  wheel                       Build wheels from your requirements.
  help                        Show help for commands.

General Options:
  -h, --help                  Show help.
  --isolated                  Run pip in an isolated mode, ignoring environment variables and user
  -v, --verbose               Give more output. Option is additive, and can be used up to 3 times.
  -V, --version               Show version and exit.
  -q, --quiet                 Give less output.
  --log                 Path to a verbose appending log.
  --proxy             Specify a proxy in the form [user:passwd@]proxy.server:port.
  --retries         Maximum number of retries each connection should attempt (default 5 times).
  --timeout             Set the socket timeout (default 15 seconds).
  --exists-action     Default action when a path already exists: (s)witch, (i)gnore, (w)ipe,
  --trusted-host   Mark this host as trusted, even though it does not have valid or any HTTPS.
  --cert               Path to alternate CA bundle.
  --client-cert         Path to SSL client certificate, a single file containing the private key and
                              the certificate in PEM format.

          Store the cache data in .
  --no-cache-dir              Disable the cache.
                              Don't periodically check PyPI to determine whether a new version of pip is
                              available for download. Implied with --no-index.
Anils-MacBook-Air:Python anilmurty$

Cloning the Tesseract OCR Engine

What is Tesseract and why am I cloning it?

The real definition of a Tesseract is a "4 Dimensional Analog of a Cube" -- read more about it at this wikipedia page.

In this context, Tesseract is the name of the Optical Character Recognition (OCR) engine, originally developed at HP between 1984 and 1995 and then later on enhanced by Google and released under the Apache License 2.0. Here is the GitHub page for it.

Here is some formal documentation from the, once you clone and unpack it:

The engine was developed at Hewlett-Packard Laboratories Bristol and
at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some
more changes made in 1996 to port to Windows, and some C++izing in 1998.
A lot of the code was written in C, and then some more was written in C++.
Since then all the code has been converted to at least compile with a C++
compiler. Currently it builds under Linux with gcc 4.4.3 and under Windows
with VC++2010. The C++ code makes heavy use of a list system using macros.
This predates stl, was portable before stl, and is more efficient than stl
lists, but has the big negative that if you do get a segmentation violation,
it is hard to debug.

The most recent change is that Tesseract can now recognize 39 languages,
including Arabic, Hindi, Vietnamese, plus 3 Fraktur variants, 
is fully UTF8 capable, and is fully trainable. See TrainingTesseract for
more information on training.

Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy. 
Results were available on
With Tesseract 2.00, scripts were included to allow anyone to reproduce 
some of these tests. See TestingTesseract for more details. 

About the Engine
This code is a raw OCR engine. It has limited PAGE LAYOUT ANALYSIS, simple
OUTPUT FORMATTING (txt, hocr/html), and NO UI. 
Having said that, in 1995, this engine was in the top 3 in terms of character
accuracy, and it compiles and runs on both Linux and Windows.
As of 3.01, Tesseract is fully unicode (UTF-8) enabled, and can recognize 39
languages "out of the box." Code and documentation is provided for the brave
to train in other languages. 
See [Tesseract Training wiki]( 
for more information on training. Additional [code and extracted documentation]( was generated by Doxygen.


My goal is to try and build an app that utilizes this engine and hence, I'm "checking out" the code as below. I'm hoping to write the wrapper in Python (another item on my learning list), hence the "pytesseract" reference.

Anils-MacBook-Air:Projects anilmurty$ mkdir pytesseract
Anils-MacBook-Air:Projects anilmurty$ cd pytesseract/
Anils-MacBook-Air:pytesseract anilmurty$ git clone
Cloning into 'tesseract'...
remote: Counting objects: 11607, done.
remote: Compressing objects: 100% (28/28), done.
remote: Total 11607 (delta 6), reused 0 (delta 0), pack-reused 11579
Receiving objects: 100% (11607/11607), 32.35 MiB | 1.38 MiB/s, done.
Resolving deltas: 100% (9073/9073), done.
Checking connectivity... done.
Anils-MacBook-Air:pytesseract anilmurty$ 

Installing and setting up Git for Mac OSX

Having been out of the weeds on the technical stuff for a couple years, I decided to document my journey in building a OCR reader app. The first step in that was to find a code repository to store all the stuff as I build it out.
I've used subversion, CVS and Perforce in the past for work and git for fun in the past (on Linux). I figured I'd go with Git but since I've moved to Mac OSX a few years ago, here is the dump of what it takes to get set up:

1. Git comes preinstalled on a Mac running OSX 10.6 or newer. Type “git” in a terminal to confirm or to install.
2. Check the version “git —version” and compare against the latest here:
3. If older, then download .dmg and install it.
4. Restart terminal and run “git —version” again to confirm new version.
5. Set up your identity for commits:
    Last login: Sat Oct 31 11:32:25 on ttys002
Anils-MacBook-Air:~ anilmurty$ git --version
git version 2.6.2
Anils-MacBook-Air:~ anilmurty$ git config --global "Anil Murty"
Anils-MacBook-Air:~ anilmurty$ git config --global

6. Set up default text editor. For Mac: Vim, Emacs or TextWrangler are good options. I've used vi and emacs on linux so I decided to try something new and use TextWrangler this time. TextWrangler does not have command line tools by default so you may have to install them, alternatively if you just wish to open files from the command line and then use the GUI, you can modify your .bash_profile file and add a line:
Add this to your .bash_profile file under your user directory on Mac OS X (e.g “/users/anilmurty")
# Type 'tw' on the terminal to open TextWrangler
alias tw='open -a /Applications/'
Then set tw as the default editor for git:
Anils-MacBook-Air:~ anilmurty$ git config --global core.editor tw

7. Check all your config settings:

Anils-MacBook-Air:~ anilmurty$ git config --list
core.pager=less -r
alias.a=!git add . && git status!git add -u . && git status
alias.aa=!git add . && git add -u . && git status
alias.c=commit -m --amend!git add . && git commit
alias.acm=!git add . && git commit -m
alias.l=log --graph --all --pretty=format:'%C(yellow)%h%C(cyan)%d%Creset %s %C(white)- %an, %ar%Creset'
alias.ll=log --stat --abbrev-commit
alias.lg=log --color --graph --pretty=format:'%C(bold white)%h%Creset -%C(bold green)%d%Creset %s %C(bold green)(%cr)%Creset %C(bold blue)<%an>%Creset' --abbrev-commit --date=relative
alias.llg=log --color --graph --pretty=format:'%C(bold white)%H %d%Creset%n%s%n%+b%C(bold blue)%an <%ae>%Creset %C(bold green)%cr (%ci)' --abbrev-commit
alias.master=checkout master
alias.spull=svn rebase
alias.spush=svn dcommit
alias.alias=!git config --list | grep 'alias\.' | sed 's/alias\.\([^=]*\)=\(.*\)/\1\ => \2/' | sort
credential.helper=osxkeychain Murty core.editor=tw

Anils-MacBook-Air:~ anilmurty$