Scan Text Documents With a Camera

This is a quick guide for scanning your personal books or documents. The process will take JPEG photos of text/images, flatten and resize around the text area, output to bi-level TIFF images, convert TIFF to multipage PDF file with JBIG2 compression, and OCR the final PDF. The end result will allow you to have small, portable, PDFs of any black and white text document.

To keep things legal, this should only be used on your personally owned books and documents. That is, of course, unless you’re Google.

Most of this process was inspired by information from here.

Required Items

Software (for use on a Windows OS, but can be duplicated in Linux)
- Scan Tailor
- Ant Renamer
- pdfbeads
  - Or install manually as indicated by this with RubyInstaller 1.8.6 and RMagick-2.12.0-ImageMagick-6.5.6-8-Q8
- JBIG2
- RMagick-2.12.0-ImageMagick-6.5.6-8-Q8
  - 2014-10-26
    This link is bad. The original file contained ImageMagick-6.5.6-8-Q8-windows-dll.exe and rmagick-2.12.0-x86-mswin32.gem. I'm not sure where to get them or if they can be substituted by newer versions.
- PDF-Xchange Viewer
Hardware
- Camera
- Good lighting
- Tripod (If scanning the whole book)
- remote shutter switch (If scanning the whole book)
- Something to anchor the book: tape, heavy object, … (If scanning the whole book)

1. Set the Stage

Scanning a whole book will take a while, so setup the lights and camera in a convenient location. Lights should be high color temperature, around 5000k.

2. Prepare the Camera

Make sure the camera is at a reasonably high zoom level. This helps flatten the page. Make sure it’s fully charged and the SD card is empty or has enough memory for the photos. It’s probably best to shoot with the highest resolution available.

If using a dslr or equivalent camera, use the manual settings so the camera will not have to focus, choose shutter speed, white balance, etc.

3. Prepare the Book

It's probably best to scan the book one side at a time. This means all even sided pages will be scanned, then all odd sided pages will be scanned.

Scan Tailor can actually recognize, split, and flatten a photo of a book open with both sides. Although this method can be quicker, it requires a better book stand (or a book willing to lay flat), increases the need for flattening and deskewing, and decreases the resolution per page by half.

You will need to measure the DPI of the photo. This can be done by using a ruler to measure the total length in inches of the photo area then dividing by the resolution length. Another method would be to find two points on a page that measures exactly one inch in length. Then use a tool like gimp to measure the DPI between those two points. Usually around 6 lines of text is an inch in length.

Secure the book so that it wont move around while flipping pages.

4. Scan and Prepare the Photos

Now start scanning! When finished, place even sided and odd sided photos in different folders. Use Ant Renamer to merge them together.

5. Run Scan Tailor

Scan Tailor is easy to use but this demonstration video explains the finer details.

6. Convert TIFF to PDF

Use pdfbeads to convert the TIFF files to a single PDF. The JBIG2 compression algorithm produces extremely small bi-level TIFF to PDF conversions. For example, a 230 page, 1.4MB, PDF before conversion was 19MB of 500 dpi .tif files.

pdfbeads runs through the command line, so take the following steps:

Copy jbig2.exe and pdfbeads.exe to C:/Windows
Install ImageMagick-6.5.6-8-Q8-windows-dll.exe. You may need to restart the computer after this step.
Open a command prompt
run the command pdfbeads your\path\to\*.tif > your\path\to\output.pdf

You will likely see errors in the command prompt, but the pdf should be made, and readable, regardless.

7. OCR the PDF

Select Document, then OCR Pages, in PDF-Xchange. Make sure to resave the PDF after adding the text layer. The previously mentioned 230 page 1.4MB file was 2MB after adding the text layer.

Published: 2013-06-01
Last Updated: 2014-10-26