Tutorial: Command-line OCR on a Mac

The best OCR engine out there is also free–it’s called tesseract, and should do a pretty good job converting your pdfs to readable text. But you have install it from the command line. Here are the steps that takes on a mac.

Step 1: Install XCode.

To install most open-source programs you need compilers that let you create computer programs from human-readable (sort of) code. Apple doesn’t include this on basic OS distributions anymore, so you have to download XCode from Apple (if you don’t have it already). After you’ve downloaded it (or installed it from the App Store in more recent versions of OS X), run the installer if necessary.

Step 2: Install Homebrew

Homebrew is a program that lets you build and install free programs on your Mac; it uses the compiler in XCode to keep you up-to-date with the freshest components.

Rather than download it, you’ll install it using the command line. Open up a terminal window (search for the application “Terminal” and open it) and type in the following line:

Things should churn for a bit, then it’s installed! If it doesn’t, note that homebrew seems to change the installation command every few months and these instructions may not keep up-to-date. Just go to the homebrew website for the latest instructions.

Step 3: Install Tesseract

Tesseract is the best OCR engine out there: but it has to be run on the command line. You’ll also need a program for working with pdfs called ghostscript. Now that you’ve downloaded homebrew, that’s easy: just type into a terminal window this command that tells homebrew to install tesseract:

Step 4: convert your pdf

Tesseract only works with a particular sort of image files, and not with raw pdfs. You need to break a pdf up into little files to perform OCR. This is a pain; but I’ve already written a little program that does it.

Download this file and put it in the same folder as your pdf; then open up a terminal window again. (Or use the one you already have open).

Inside the terminal, you need to switch to the proper directory. To do this, use the program “cd” (for “change directory”) to change your location: the easiest way to do this is to grab the folder icon at the top of the window and actually drag it into the terminal and release; on new versions of OS X, this will just deposit it. You should end up typing in something like the following, but with your own username in it:

Now you can run the program you’ve downloaded. “sh” will interpret the shell program as code. You need to create a new subfolder called “pdfs” in that directory.

It’s probably easiest to do that in the finder: but you could also type these two commands in:

To reiterate: you now should have a folder that has the file “TesseractPDF.sh” in it, and a folder called “pdfs” that has your pdf files in it. In the finder, that will look like this:

Screen Shot 2013-10-15 at 3.32.38 PM

If you type “ls” in the terminal, it should give a result like this:

In the terminal, you’ll then want to

And now your should have two new folders: one, “images,” contains picture files of each sheet; the other, “texts,” contains one text file for each page, nicely OCRed, that you can open and read.

Optionally, if you want a single text file from all the individual ones, you can lump them together with the following command.

6 thoughts on “Tutorial: Command-line OCR on a Mac

  1. I just wanted to say thanks for this article! I have a boat load of non-OCR’d PDFs to convert and this is making magic happen. I did run in to one issue with the shell script; when I ran it I’d get an ‘unexpected end of file’ error. I just had to add ‘done’ to the end of the do-while loop, and it worked great. Thanks again!

  2. I run the following in terminal yielding result listed below:
    sh TesseractPDF.sh ESP.pdf

    Unsupported image type.
    Tesseract Open Source OCR Engine v3.02.02 with Leptonica
    Error in findTiffCompression: function not present
    Error in pixReadStreamTiff: function not present
    Error in pixReadStream: tiff: no pix returned
    Error in pixRead: pix not read
    Unsupported image type.

    Oddly, the tiffs appear in images, but no files in the text folders?

  3. Hi! Thanks for this. I almost get there, but get the following message in Terminal:

    TesseractPDF.sh: line 25: [: missing `]’
    TesseractPDF.sh: line 29: gs: command not found

    In my directory, I now have the following structure:

    Both myFile folders are empty. Thoughts?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">