Scan Text Documents With a Camera
This is a quick guide for scanning your personal books or documents. The process will take JPEG photos of text/images, flatten and resize around the text area, output to bi-level TIFF images, convert TIFF to multipage PDF file with JBIG2 compression, and OCR the final PDF. The end result will allow you to have small, portable, PDFs of any black and white text document.
To keep things legal, this should only be used on your personally owned books and documents. That is, of course, unless you’re Google.
Most of this process was inspired by information from here.
Software (for use on a Windows OS, but can be duplicated in Linux)
- Scan Tailor
- Ant Renamer
- Or install manually as indicated by this with RubyInstaller 1.8.6 and RMagick-2.12.0-ImageMagick-6.5.6-8-Q8
This link is bad. The original file contained
rmagick-2.12.0-x86-mswin32.gem. I'm not sure where to get them or if they can be substituted by newer versions.
- PDF-Xchange Viewer
- Good lighting
- Tripod (If scanning the whole book)
- remote shutter switch (If scanning the whole book)
- Something to anchor the book: tape, heavy object, … (If scanning the whole book)
1. Set the Stage
Scanning a whole book will take a while, so setup the lights and camera in a convenient location. Lights should be high color temperature, around 5000k.
2. Prepare the Camera
Make sure the camera is at a reasonably high zoom level. This helps flatten the page. Make sure it’s fully charged and the SD card is empty or has enough memory for the photos. It’s probably best to shoot with the highest resolution available.
If using a dslr or equivalent camera, use the manual settings so the camera will not have to focus, choose shutter speed, white balance, etc.
3. Prepare the Book
It's probably best to scan the book one side at a time. This means all even sided pages will be scanned, then all odd sided pages will be scanned.
Scan Tailor can actually recognize, split, and flatten a photo of a book open with both sides. Although this method can be quicker, it requires a better book stand (or a book willing to lay flat), increases the need for flattening and deskewing, and decreases the resolution per page by half.
You will need to measure the DPI of the photo. This can be done by using a ruler to measure the total length in inches of the photo area then dividing by the resolution length. Another method would be to find two points on a page that measures exactly one inch in length. Then use a tool like gimp to measure the DPI between those two points. Usually around 6 lines of text is an inch in length.
Secure the book so that it wont move around while flipping pages.
4. Scan and Prepare the Photos
Now start scanning! When finished, place even sided and odd sided photos in different folders. Use Ant Renamer to merge them together.
5. Run Scan Tailor
Scan Tailor is easy to use but this demonstration video explains the finer details.
6. Convert TIFF to PDF
Use pdfbeads to convert the TIFF files to a single PDF. The JBIG2 compression algorithm produces extremely small bi-level TIFF to PDF conversions. For example, a 230 page, 1.4MB, PDF before conversion was 19MB of 500 dpi .tif files.
pdfbeads runs through the command line, so take the following steps:
- Copy jbig2.exe and pdfbeads.exe to C:/Windows
- Install ImageMagick-6.5.6-8-Q8-windows-dll.exe. You may need to restart the computer after this step.
- Open a command prompt
- run the command
pdfbeads your\path\to\*.tif > your\path\to\output.pdf
You will likely see errors in the command prompt, but the pdf should be made, and readable, regardless.
7. OCR the PDF
Select Document, then OCR Pages, in PDF-Xchange. Make sure to resave the PDF after adding the text layer. The previously mentioned 230 page 1.4MB file was 2MB after adding the text layer.
Last Updated: 2014-10-26