{"id":3,"date":"2008-07-08T20:53:42","date_gmt":"2008-07-09T00:53:42","guid":{"rendered":"http:\/\/www.compdigitec.com\/labs\/?p=3"},"modified":"2008-07-08T20:53:42","modified_gmt":"2008-07-09T00:53:42","slug":"ocr-for-pdf-in-ubuntu","status":"publish","type":"post","link":"http:\/\/www.compdigitec.com\/labs\/2008\/07\/08\/ocr-for-pdf-in-ubuntu\/","title":{"rendered":"OCR for PDF in Ubuntu"},"content":{"rendered":"<p>To get OCR in <a href=\"http:\/\/www.ubuntu.com\/\">Ubuntu<\/a>, you need to use the Open-Source Tesseract OCR engine. However, it can only perform OCR on the TIFF format. In order to allow it to do PDF, we also need the Evince PDF reader to allow us to export a page to TIFF to feed Tesseract with.<\/p>\n<p>Install tesseract in Ubuntu:<br \/>\n<code><br \/>\n$ sudo apt-get install tesseract-ocr<br \/>\n<\/code><\/p>\n<p>Now, get the TIFF from Evince; Right click the page and click &#8220;Save image as&#8230;&#8221; and export the file. Then, to get a good OCR we need to convert it to monochrome.<\/p>\n<p><code><br \/>\n$ tesseract <i>foo<\/i>.tif <i>bar<\/i><br \/>\nUnable to load unicharset file \/usr\/share\/tesseract-ocr\/tessdata\/eng.unicharset<br \/>\n<\/code><\/p>\n<p>If the above bug happens, fix it by:<br \/>\n<code><br \/>\n$ sudo apt-get install tesseract-ocr-eng<br \/>\n<\/code><br \/>\nThis is caused by a <a href=\"https:\/\/bugs.launchpad.net\/ubuntu\/+source\/tesseract\/+bug\/224264\">nasty bug in Ubuntu<\/a>.<\/p>\n<p>And now finally, do:<br \/>\n<code><br \/>\n$ tesseract <em>foo<\/em>.tif <em>bar<\/em># and without the .txt extension, or you will end up with <em>bar<\/em>.txt.txt<br \/>\n<\/code><\/p>\n<p>to complete it!<\/p>","protected":false},"excerpt":{"rendered":"<p>To get OCR in Ubuntu, you need to use the Open-Source Tesseract OCR engine. However, it can only perform OCR on the TIFF format. In order to allow it to do PDF, we also need the Evince PDF reader to allow us to export a page to TIFF to feed Tesseract with. Install tesseract in [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"http:\/\/www.compdigitec.com\/labs\/wp-json\/wp\/v2\/posts\/3"}],"collection":[{"href":"http:\/\/www.compdigitec.com\/labs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.compdigitec.com\/labs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.compdigitec.com\/labs\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.compdigitec.com\/labs\/wp-json\/wp\/v2\/comments?post=3"}],"version-history":[{"count":0,"href":"http:\/\/www.compdigitec.com\/labs\/wp-json\/wp\/v2\/posts\/3\/revisions"}],"wp:attachment":[{"href":"http:\/\/www.compdigitec.com\/labs\/wp-json\/wp\/v2\/media?parent=3"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.compdigitec.com\/labs\/wp-json\/wp\/v2\/categories?post=3"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.compdigitec.com\/labs\/wp-json\/wp\/v2\/tags?post=3"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}