## OCR for PDF in Ubuntu

By admin | July 8, 2008

To get OCR in Ubuntu, you need to use the Open-Source Tesseract OCR engine. However, it can only perform OCR on the TIFF format. In order to allow it to do PDF, we also need the Evince PDF reader to allow us to export a page to TIFF to feed Tesseract with.

Install tesseract in Ubuntu:
 $sudo apt-get install tesseract-ocr  Now, get the TIFF from Evince; Right click the page and click “Save image as…” and export the file. Then, to get a good OCR we need to convert it to monochrome. $ tesseract foo.tif bar Unable to load unicharset file /usr/share/tesseract-ocr/tessdata/eng.unicharset 

If the above bug happens, fix it by:
 $sudo apt-get install tesseract-ocr-eng  This is caused by a nasty bug in Ubuntu. And now finally, do: $ tesseract foo.tif bar# and without the .txt extension, or you will end up with bar.txt.txt 

to complete it!

Topics: Linux | 4 Comments »

### 4 Responses to “OCR for PDF in Ubuntu”

June 11th, 2009 at 05:06

Thanks mate, exactly what I was searching for.

2. Julian Burgess Says:
July 26th, 2009 at 17:23

I just got a 1 byte file 🙁 any ideas?

3. StoneCut Says:
February 9th, 2010 at 13:22

Thanks for the awesome bugfix for Ubuntu. I was banging my head what was wrong.