Compdigitec Labs

Home | »

OCR for PDF in Ubuntu

By admin | July 8, 2008

To get OCR in Ubuntu, you need to use the Open-Source Tesseract OCR engine. However, it can only perform OCR on the TIFF format. In order to allow it to do PDF, we also need the Evince PDF reader to allow us to export a page to TIFF to feed Tesseract with.

Install tesseract in Ubuntu:

$ sudo apt-get install tesseract-ocr

Now, get the TIFF from Evince; Right click the page and click “Save image as…” and export the file. Then, to get a good OCR we need to convert it to monochrome.


$ tesseract foo.tif bar
Unable to load unicharset file /usr/share/tesseract-ocr/tessdata/eng.unicharset

If the above bug happens, fix it by:

$ sudo apt-get install tesseract-ocr-eng

This is caused by a nasty bug in Ubuntu.

And now finally, do:

$ tesseract foo.tif bar# and without the .txt extension, or you will end up with bar.txt.txt

to complete it!

If you found this article helpful or interesting, please help Compdigitec spread the word. Don’t forget to subscribe to Compdigitec Labs for more useful and interesting articles!

Topics: Linux | 15 Comments »

15 Responses to “OCR for PDF in Ubuntu”

  1. Mr Surbade Says:
    June 11th, 2009 at 05:06

    Thanks mate, exactly what I was searching for.

  2. Julian Burgess Says:
    July 26th, 2009 at 17:23

    I just got a 1 byte file 🙁 any ideas?

  3. StoneCut Says:
    February 9th, 2010 at 13:22

    Thanks for the awesome bugfix for Ubuntu. I was banging my head what was wrong.

  4. ดูหนังออนไลน์ไม่มีโฆษณา Says:
    February 24th, 2022 at 04:53

    … [Trackback]

    […] Read More to that Topic: compdigitec.com/labs/2008/07/08/ocr-for-pdf-in-ubuntu/ […]

  5. 툰코 Says:
    February 27th, 2022 at 11:27

    … [Trackback]

    […] Information to that Topic: compdigitec.com/labs/2008/07/08/ocr-for-pdf-in-ubuntu/ […]

  6. สล็อตวอเลท Says:
    April 21st, 2022 at 19:03

    … [Trackback]

    […] Read More Info here to that Topic: compdigitec.com/labs/2008/07/08/ocr-for-pdf-in-ubuntu/ […]

  7. sbo Says:
    June 17th, 2022 at 03:05

    … [Trackback]

    […] Read More on to that Topic: compdigitec.com/labs/2008/07/08/ocr-for-pdf-in-ubuntu/ […]

  8. where to buy psilocybin denver​ Says:
    June 19th, 2022 at 20:26

    … [Trackback]

    […] Info to that Topic: compdigitec.com/labs/2008/07/08/ocr-for-pdf-in-ubuntu/ […]

  9. slotxoth Says:
    July 28th, 2022 at 13:07

    อาหารยามเช้า ปลุกสมอง ก่อนปั่น slotxoth คัดสรรเกม สล็อตออนไลน์ ยอดฮิต Slotxo แตกง่าย ทำเงินได้ไม่จำกัดสล็อต เป็นเกมที่มีผู้เล่นมีความสนใจเป็นลำดับแรกๆในบรรดาเกม pg slot สล็อตคัดสรรเกม

  10. Alphabay Market Says:
    August 31st, 2022 at 11:40

    … [Trackback]

    […] There you can find 25874 more Information to that Topic: compdigitec.com/labs/2008/07/08/ocr-for-pdf-in-ubuntu/ […]

  11. Firearms For Sale Says:
    September 12th, 2022 at 16:50

    … [Trackback]

    […] Find More to that Topic: compdigitec.com/labs/2008/07/08/ocr-for-pdf-in-ubuntu/ […]

  12. sbo Says:
    September 14th, 2022 at 00:03

    … [Trackback]

    […] Information to that Topic: compdigitec.com/labs/2008/07/08/ocr-for-pdf-in-ubuntu/ […]

  13. lsd blotter pics, Says:
    September 23rd, 2022 at 06:43

    … [Trackback]

    […] There you will find 69112 more Information on that Topic: compdigitec.com/labs/2008/07/08/ocr-for-pdf-in-ubuntu/ […]

  14. Henry Firearms For Sale Says:
    October 5th, 2022 at 04:36

    … [Trackback]

    […] Here you will find 69674 more Info on that Topic: compdigitec.com/labs/2008/07/08/ocr-for-pdf-in-ubuntu/ […]

  15. Golden teacher mushrooms for sale Says:
    November 15th, 2022 at 16:36

    … [Trackback]

    […] Read More Info here to that Topic: compdigitec.com/labs/2008/07/08/ocr-for-pdf-in-ubuntu/ […]

Comments