Sunday, 28 December 2014

SQL Theory: Structured documents vs Unstructured documents

Understanding the Differences Between Structured and Unstructured Documents



Differences Between the Two Document Types



What is the difference between structured and unstructured documents? With a structured document, certain information always appears in the same location on the page. For example, in an employment application the applicants name always appear in the same box in the same place on the document. In contrast, an unstructured document has the opposite characteristics – information can appear in unexpected places on the document. An example would be in a hand written note or a whitepaper.


Some documents share the characteristics of both types of documents, such as invoices. For example, suppliers’ invoices feel like a structured document because they have a consistent appearance from one billing period to the next. However, when viewed in aggregate by an accounts payable department that receives thousands of invoices daily in a myriad of different formats; they seem more like structured documents.



What About Template-Based OCR Systems



Some document imaging systems advocate template-based OCR (optical character recognition) to capture the information needed to identify the document for later retrieval. They call this pixy dust, where you don’t need to do anything with the documents other than to load the automatic document feeder. Unfortunately this solution only works well with structured documents, and it is not 100% accurate even under the best conditions. (For more information on the accuracy of OCR, read our whitepaper on that subject).


Needless to say, you will need to have a different method to capture the key information needed to retrieve documents that are unstructured. In many organizations unstructured documents represent the majority of the documents that will be imaged with a document imaging system.



Characteristics of Structured and Unstructured
Documents




Type of Document
Structured
Unstructured
Characteristics:
Familiar data appears in the same place every time.
Data appears in unexpected places in the document.
Examples:
Insurance claim form
Employment application
A letter
A hand-written note
Used by Organizations:
Low volume operations
Internally created invoices
High volume operations
Invoices received from outside the organization




Conclusion




Every organization will have both structured and unstructured document with which to contend. It is generally a good idea to purchase a document imaging system that offers the maximum capabilities to deal with both types of documents, rather than purchasing a system that caters only to a single document type.

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home