Monday, September 28, 2009

Google Docs now performs OCR!!

Google is constantly working on new features for their online services like Gmail or Google Docs. The latest feature is currently available as a demonstration only and not yet integrated into Google Docs. The Google Docs OCR demonstration can OCR the three digital image formats jpg, png and gif. Google lists the following limitations that are currently in place:

  • Files must be fairly high-resolution — rule of thumb is 10 pixel character height.
  • Maximum file size: 10MB, maximum resolution: 25 mega pixel
  • The larger the file, the longer the OCR operation will take (500K: ~15s, 2MB: ~40s, 10MB: forever)

You can upload an image containing typewritten or printed text (like a fax document or a scanned newspaper clipping) to your Google Docs account and it will turn that image into editable text. The quality depends largely on the quality of the image. It is usually necessary to look over the text and correct errors that have been made during character recognition. Google Docs helps in the error correction by underlining unknown words in red in its interface. It still takes some time to correct the errors.

I am providing steps to get started and sample image provided by Google. Here is a sample form to upload scanned images to your Google Account and the server will automatically try to extract text from these images provided the image resolution is good and that the text inside images is written using Latin character sets.

image

 

image

image

HTTP.jpg provided by Google for testing

HTTP

 

Unfortunately, when I tried it I received the following error:-

Error processing document:
Expected response code 200, got 400 <?XML:NAMESPACE PREFIX = [default] http://schemas.google.com/g/2005 NS = "http://schemas.google.com/g/2005" />GData InvalidEntryException Could not convert document.

 

Invalid file

Please chose a JPG, GIF or PNG file not larger than 10 MB/25 mega pixel.

The OCR feature can also extract text from noisy images as well (like this WSJ clipping) though the recognized text is not very accurate and the document formatting is lost (see conversion results).

If you are a developer, you can add the ocr=true parameter to your upload request and Google Docs will automatically scan that image for text patterns. You can also upload images to Google Docs without the OCR parameter but in that case, the image will be converted into a new Word document minus OCR.

Like Google Docs, Google Search too includes OCR features but the difference is that while Google Docs can extract text from images, the OCR in Google Search works only with scanned PDF files.

No comments:

Post a Comment

Thank you for your feedback