Home | Determine the OS and version of PHP »
OCR for PDF in Ubuntu
By admin | July 8, 2008
To get OCR in Ubuntu, you need to use the Open-Source Tesseract OCR engine. However, it can only perform OCR on the TIFF format. In order to allow it to do PDF, we also need the Evince PDF reader to allow us to export a page to TIFF to feed Tesseract with.
Install tesseract in Ubuntu:
$ sudo apt-get install tesseract-ocr
Now, get the TIFF from Evince; Right click the page and click “Save image as…” and export the file. Then, to get a good OCR we need to convert it to monochrome.
$ tesseract foo.tif bar
Unable to load unicharset file /usr/share/tesseract-ocr/tessdata/eng.unicharset
If the above bug happens, fix it by:
$ sudo apt-get install tesseract-ocr-eng
This is caused by a nasty bug in Ubuntu.
And now finally, do:
$ tesseract foo.tif bar# and without the .txt extension, or you will end up with bar.txt.txt
to complete it!
If you found this article helpful or interesting, please help Compdigitec spread the word. Don’t forget to subscribe to Compdigitec Labs for more useful and interesting articles!
Topics: Linux | 14 Comments »
June 11th, 2009 at 05:06
Thanks mate, exactly what I was searching for.
July 26th, 2009 at 17:23
I just got a 1 byte file 🙁 any ideas?
February 9th, 2010 at 13:22
Thanks for the awesome bugfix for Ubuntu. I was banging my head what was wrong.
June 29th, 2024 at 05:19
… [Trackback]
[…] Info on that Topic: compdigitec.com/labs/2008/07/08/ocr-for-pdf-in-ubuntu/ […]
July 16th, 2024 at 20:32
… [Trackback]
[…] Read More Information here on that Topic: compdigitec.com/labs/2008/07/08/ocr-for-pdf-in-ubuntu/ […]
August 14th, 2024 at 20:46
… [Trackback]
[…] Information to that Topic: compdigitec.com/labs/2008/07/08/ocr-for-pdf-in-ubuntu/ […]
August 19th, 2024 at 11:12
… [Trackback]
[…] Read More to that Topic: compdigitec.com/labs/2008/07/08/ocr-for-pdf-in-ubuntu/ […]
August 20th, 2024 at 22:23
… [Trackback]
[…] Here you can find 98364 more Information on that Topic: compdigitec.com/labs/2008/07/08/ocr-for-pdf-in-ubuntu/ […]
August 25th, 2024 at 19:53
… [Trackback]
[…] Info on that Topic: compdigitec.com/labs/2008/07/08/ocr-for-pdf-in-ubuntu/ […]
September 10th, 2024 at 01:06
… [Trackback]
[…] Read More here to that Topic: compdigitec.com/labs/2008/07/08/ocr-for-pdf-in-ubuntu/ […]
September 18th, 2024 at 11:18
… [Trackback]
[…] There you can find 20635 more Info to that Topic: compdigitec.com/labs/2008/07/08/ocr-for-pdf-in-ubuntu/ […]
September 23rd, 2024 at 21:43
… [Trackback]
[…] Read More on on that Topic: compdigitec.com/labs/2008/07/08/ocr-for-pdf-in-ubuntu/ […]
October 13th, 2024 at 04:08
… [Trackback]
[…] Read More here on that Topic: compdigitec.com/labs/2008/07/08/ocr-for-pdf-in-ubuntu/ […]
October 20th, 2024 at 19:45
… [Trackback]
[…] Read More Information here on that Topic: compdigitec.com/labs/2008/07/08/ocr-for-pdf-in-ubuntu/ […]