2007-11-27

From Paper To PDF

At the request of the wife, here is a quick OCR how-to.
  1. The first thing we're going to get is some OCR software. There are lots and lots of different possibilities here, but we're going to go with something that also has a Twain engine to handle our scanning. We'd also like something that has fairly accurate translation. My choice: FreeOCR, which uses Google's open source Tesseract OCR engine. Click here to download the software.
  2. Using FreeOCR is fairly simple. Make sure your scanner is set up properly. Click the Scan button to scan a page of your document; the scanned image will show up for verification in the left panel. If you like what you see, click the OCR button and the software will in due course display your text results in the right panel.
  3. There is an option in the File menu to save the results out as text, or you can copy it to clipboard.
Not so hard to get some text onto the computer without spending any money. Other possible software choices include Microsoft's MODI, which most current Office users will already have, or SimpleOCR. Of course, if you're going to be trying to translate complex documents with lots of columns or layout issues, or perhaps something in Japanese, you'll need more specialized (and commercial) software.

But what if you need to scan something -- let's say a veterinary journal article -- and your goal is for it to end up as an editable PDF document. Well. There are a number of commercial packages that claim to do something like this. For money. All are very thin on examples of their effectiveness (and one detailed exampled confirmed my suspicion that no OCR actually takes place in its PDF conversion), but except for Acrobat they offer trial downloads. Since I don't happen to have a scanner connected to my computer, I'm forced to stop here. Sorry, wife!

No comments: