Tuesday, November 3, 2015

Testing OCR with Tesseract

Moment of truth, let's see how well the "stock" tesseract install performs.

For reference, here are the versions of the various libraries I am testing with

Anils-MacBook-Air:tesseract-test anilmurty$ tesseract -v
tesseract 3.02.02
 leptonica-1.71
  libgif 4.2.3 : libjpeg 9a : libpng 1.6.18 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0
Anils-MacBook-Air:tesseract-test anilmurty$ 

For testing the OCR capabilities, I went on google and found a few sample files to read. My ultimate goal is to be able to read receipts and invoices but I figure I'll start with something more basic:

TEST #1: A PNG file with lots of special characters but with no crazy formatting, like you would find on a bill or an invoice

OUTPUT: Pretty impressive. Only messed up uber
Anils-MacBook-Air:tesseract-test anilmurty$ tesseract /Users/anilmurty/Desktop/ocr-test-image-1.png test-png-1
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
Anils-MacBook-Air:tesseract-test anilmurty$ cat test-png-1.txt 
The (quick) [brown] {fox} jumps!
Over the $43,456.78 #90 dog
& duck/goose, as 12.5% of E-mail
from aspammer@website.com is spam.
Der ,,schnelle” braune Fuchs springt
fiber den faulen Hund. Le renard brun
«rapide» saute par-dessus le chien
paresseux. La volpe marrone rapida
salta sopra il cane pigro. El zorro
marrén répido salta sobre el perro
perezoso. A raposa marrom rzipida
salta sobre o cfio preguicoso.
Anils-MacBook-Air:tesseract-test anilmurty$ 



TEST #2: A PNG Format of my Blog's logo:



OUTPUT: Totally messed up the tagline!
Anils-MacBook-Air:tesseract-test anilmurty$ tesseract /Users/anilmurty/Desktop/Geeking-Out.png Geeking-out
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
Anils-MacBook-Air:tesseract-test anilmurty$ cat Geeking-out.txt 
Geeking Out

’caz Fm sun a geek at man .)

Anils-MacBook-Air:tesseract-test anilmurty$