Records Management Blog | Practical Records Management

Google Expands OCR Capabilities for Document Scanning

Posted by Michael Thomas on Tue, Jun 29, 2010 @ 12:34 PM
google docs OCRGoogle is Now Offering Free OCR Services for Scanned Documents. Google announced yesterday that there is a new feature available in Google Docs to allow users to import Scanned Documents. The feature, describes as "Convert Text from PDF or image files to Google Docs Documents," allows users to import a Scanned PDF or Image File (JPEG, GIF, or PNG).

There are still some questions that come up as to whether or not this new functionality indicates an intention by Google to broaden the scope of their Google Docs platform, as well as questions about how Google Docs' new OCR functionality works and the functionality that it provides. These questions include:

What OCR Engine is Google using in the Google Docs Platform?

The OCR Engine used by Google in this process is not immediately clear. Google does Sponsor an Open Source OCR Engine and Document Analysis Platform called OCRopus, but Google hasn't publicly acknowledged that this is the technology being used by any of their services, including Google Books or the new Google Docs OCR Functionality.

Does Google Docs OCR Work with TIF Files?

During our testing, we noticed that the OCR functionality didn't work for one of the most standard image formats that we find clients using, TIF Images. TIF, or TIFF (Tagged Image File Format), Images are widely considered an Industry Standard for Scanning Paper Documents, so I found the absence of this functionality to be a surprising.

For those looking to convert TIF images, you may want to use Adobe Acrobat or another utility to convert TIF files to PDF, or check out ABBYY FineReader Online. For organizations looking to convert large volumes of information, I would recommend using an alternate document capture software for converting your images to OCR.

How Well does Google Docs OCR work?

The technology is still a bit new, as it was only released yesterday, but ars technica did some testing and was nice enough to summarize their experiences. Their results were about the same as the results we experienced during our testing, and they summarized their findings: "There are still cases where this OCR would be better than nothing." Not quite the ringing endorsement that you'd hope to see attached to a Google Service, but the offering is still new.

Because of the way the import mechanism is configured, Google Docs OCR may not be the best document scanning solution for every business case, especially if you're looking to convert a large volume of paper documents to digital images. For ad-hoc, low volume OCR requirements however, the Google Docs OCR functionality serves as a solid utility for converting paper into useable text.

Have you tried the Google Docs OCR tool yet? What have your experiences been? Have you had better success with other services or software? Share your experiences in the Comments!

Tags: Scanning, Document Scanning, Google

No Simple Tools for the Paperless Office

Posted by Michael Thomas on Tue, May 11, 2010 @ 06:14 PM

There’s a great post over on Mashable – 5 Simple Tools for a Paperless Office. It’s really a great cheat sheet on how to use some really cool, useful software and online services to reduce paper and become more efficient, but I found myself reading it and thinking about the clients that I’ve worked with over the past ten years. These are all very cool tools, but I don’t think that they’re going to really take you any closer to a paperless office.

More importantly, not one of them is going to ensure that you’ll actually be able to FIND something when you go to look for it. And therein lies the problem facing businesses today. Document Management tools have some fantastic bells, whistles, and doodads, but do any of them actually do anything useful?? Honestly, out of the whole list presented, the mention of Google Apps for Business at the footnote is probably the most useful of all of the items.

Google Apps for Business is missing one key component, however. They still have added the viewer. Plug Google's fancy document viewer (you know, the one they use for Google books), into Google apps and bam! – Now you’ve really got a solution for a Paperless Office. You’d be able to tag metadata, search, and view documents from anywhere that there’s a browser.

Except… There’s no audit trails. There are still no annotation capabilities. There’s no really great, easy way to efficiently scan large volumes of documents to Google Apps. Perhaps most importantly, the more consumer-like these tools become, the biggest thing that’s missing isn’t feature functionality- it’s guidance and best practices. Scanning files isn’t hard. It’s preparing them for scanning, indexing them, and making sure that people can find them that makes it difficult. 

Tags: Paperless Office, Document Management, Document Scanning, Google

Document Management Innovations - Classification Technology Update

Posted by Michael Thomas on Thu, Apr 15, 2010 @ 11:31 AM

There's a fantastic post on Gary Rylander's Compliance Guy Blog today about new innovations and success rates regarding Automated Document Classification Technology. While this information doesn't interest many in the Records and Document Management Space, it's extremely interesting from a technology perspective.

The story features a terrific comparison between Human Document Review, and that of the Machines, and the results are very interesting. When reviewing, keep in mind that Intelligent Document Classification requires that content exist in a digital Format, meaning that Paper Documents need to be OCR'd in order to benefit from these technologies. 

From a Solutions perspective, we continue to keep our ear to the ground on these trends. Companies like IBM and one of our Partners, Attensity, offer great solutions to Document Classification challenges, but they're all driven by a very precise ROI making them inaccessible for many firms. 

What really gets me interested in this, however, is the impact of these tools on projects like those that Google are undertaking with Google Books, and the value that they have as we continue to march toward the Semantic Web. I believe that  where all of this advanced AI technology will be prove it's value at the consumer and mid-market levels.

Tags: Document Management, Enterprise Search, Google

Three Keys to Going Paperless

Posted by Michael Thomas on Mon, Mar 08, 2010 @ 07:27 AM

go paperlessThe unending effort by business managers and executives to "go paperless" is often looked at as a losing battle. In reality, however, there are at least three things that everyone can do in their daily work to reduce the amount of paper being created and stored in their daily workflow.

1) Consolidate the Inboxes! - There is no more important daily task list of the average professional than checking the number of inboxes that we have. The key to the paperless office, therefore, is to consolidate the number of inboxes that we have. Most of us have seven inboxes before we even give it a moment's thought - Office Email, Personal Email, Office Voicemail, Personal Voicemail, Office Paper Mail, Personal Paper Mail and Office Fax. When we add in the checking of the Personal Calendar and Office Calendar so we don't schedule an after-hours business meeting to conflict with the Kids' Soccer Game, we're already up to Nine different inboxes that we're monitoring on a daily basis.

Before we can even think of going "Paperless," we need to consolidate the number of inboxes. If you currently have More than two email addresses, consolidate. The truth is that most people don't need more than two email addresses, and great tools like Google Voice (from the efficiency-master - Google) can go a long way to helping consolidate the voicemail boxes.

2) Sign up for Online Billing & Banking - It's still amazing how many people don't choose the paperless billing options offered by their utilities and vendors. Not only is there an immediate benefit of being able to reduce the amount of incoming paper, but you'll help save the environment and reduce your risk of identity theft at the same time.

Many people are not aware of the risks associated with Paper Mail, but the numbers tell the story, and the truth is that most identity theft is done by Friends and Neighbors and most originates through paper mail. More information about the hard numbers can be found here: http://www.consumeraffairs.com/news04/2006/01/id_theft_survey.html , but suffice it to say that Online is the safer bet for the ID-minded customer.

3) Scan, Store & Retrieve - Paper is one of the most frequently used of the media types that we deal with on a regular basis, and it happens to be one of the most space consuming, inefficient ones. If legally possible, scanning documents can be a great alternative to maintaining hard copy records.

With the cost of desktop scanners dropping every day, and most offices having networked multi-function devices that allow for quick and easy capture of documents to network drives, scanning documents is both easier and more accessible than ever before. Take caution in the process, however, as relying only on OCR (or Optical Character Recognition) for retrieval is still not foolproof, and you'll likely want to tag your documents with some useful keywords to help you locate them when needed.

 What are your best tips for going paperless? What are the biggest obstacles you've faced? Let us know in the comments!

Tags: Paperless Office, Scanning, Enterprise Search, File Storage, Records Management, Google

Enterprise Search - To Search...or To Find?

Posted by Michael Thomas on Thu, Feb 25, 2010 @ 12:27 PM
For years, I've spoken with clients about making all of their companies' information available to users just like a "Google" search. Recently, I've begun hearing from clients and prospects, however, that they're beginning to see tools like Google Search Appliance (GSA) as an alternative solution to true Enterprise Content Management. This question is extremely valid, and to be honest, I couldn't put my finger on exactly what the right answer was.

In Early March 2009, I had an opportunity to pose this question to Miguel Zubizaretta, CTO of Hyland Software during their Team OnBase Conference. Miguel answered the question for me in a way that finally made it clear to me how ECM Technology and Enterprise Search Technology are different, and perhaps just as importantly, how they need to co-exist. "The Google, or Enterprise Search Paradigm is different from that of ECM." Said Zubizaretta. "When you do a search online, in Google, for example, the user is looking for content that relates to a specific search term. What is returned is a series of results which have a relationship with the query. If a user has their question answered or issue resolved, or if they find something relevant to that query, they're satisfied and the search was a success. The question, however, is does that query provide ALL applicable results, or does it even provide the correct result? That is the purpose of ECM."

ECM technology is built on the premise that documents and content are indexed with certain, specific pieces of metadata, or keywords. This allows for a very high degree of specificity when conducting a search - Dates, Amounts, and other Identification are specifically tagged to that object. When searching within an Enterprise Search Platform, however, the results are largely based on an algorithmic scoring and interpretation of the text within that content. This means that the search may locate the item that you're searching for, but it also may neglect certain documents that were relevant to the query, but which were not defined to have a direct relationship with that document via common metadata or keywords.

Therefore the real difference between the technologies is reflective of the two purposes for their use. If you want to find ANY answers to a given query, then Enterprise Search is likely an acceptable strategy. If, however, you want to be able to find ALL applicable answers to that query, then ECM technology is likely to be the better choice. This is especially important for CTO's and other Technology Decision makers to understand as they invest in systems, because without accurately capturing ALL of the possible results, there could be dire consequences for the business.

In addition to the difference in the search paradigm, other functionality, such as version control, audit trails, and workflow are often inherent to ECM systems. This makes ECM a critical component for a company's regulatory and audit compliance. This doesn't make these two technologies mutually exclusive, but rather quite complimentary to each other. The ability to provide full-text search capabilities within an ECM System is often very useful for eDiscovery Processes, or when someone remembers an obscure fact about a given document, but may not remember where they read it. Likewise the ability to conduct a broad Enterprise Search and view all of the ECM Results within the same window can provide a great method of federating the retrieval of content that may not be stored within the ECM System. So while neither ECM nor Enterprise Search alone are the panacea for all that ails enterprise information management, but together, both technologies can go a long way to improving the end-user experience and help users find the information that they need to make better, more informed business decisions.

Tags: ECM, Enterprise Search, OnBase, Google