Sunday 28 December 2014

SQL Theory: Structured documents vs Unstructured documents

Understanding the Differences Between Structured and Unstructured Documents



Differences Between the Two Document Types



What is the difference between structured and unstructured documents? With a structured document, certain information always appears in the same location on the page. For example, in an employment application the applicants name always appear in the same box in the same place on the document. In contrast, an unstructured document has the opposite characteristics – information can appear in unexpected places on the document. An example would be in a hand written note or a whitepaper.


Some documents share the characteristics of both types of documents, such as invoices. For example, suppliers’ invoices feel like a structured document because they have a consistent appearance from one billing period to the next. However, when viewed in aggregate by an accounts payable department that receives thousands of invoices daily in a myriad of different formats; they seem more like structured documents.



What About Template-Based OCR Systems



Some document imaging systems advocate template-based OCR (optical character recognition) to capture the information needed to identify the document for later retrieval. They call this pixy dust, where you don’t need to do anything with the documents other than to load the automatic document feeder. Unfortunately this solution only works well with structured documents, and it is not 100% accurate even under the best conditions. (For more information on the accuracy of OCR, read our whitepaper on that subject).


Needless to say, you will need to have a different method to capture the key information needed to retrieve documents that are unstructured. In many organizations unstructured documents represent the majority of the documents that will be imaged with a document imaging system.



Characteristics of Structured and Unstructured
Documents




Type of Document
Structured
Unstructured
Characteristics:
Familiar data appears in the same place every time.
Data appears in unexpected places in the document.
Examples:
Insurance claim form
Employment application
A letter
A hand-written note
Used by Organizations:
Low volume operations
Internally created invoices
High volume operations
Invoices received from outside the organization




Conclusion




Every organization will have both structured and unstructured document with which to contend. It is generally a good idea to purchase a document imaging system that offers the maximum capabilities to deal with both types of documents, rather than purchasing a system that caters only to a single document type.

Saturday 27 December 2014

SQL Theory: FileTables (SQL Server)

FileTables (SQL Server)

SQL Server 2014
2 out of 3 rated this helpful Rate this topic
The FileTable feature brings support for the Windows file namespace and compatibility with Windows applications to the file data stored in SQL Server. FileTable lets an application integrate its storage and data management components, and provides integrated SQL Server services - including full-text search and semantic search - over unstructured data and metadata.
In other words, you can store files and documents in special tables in SQL Server called FileTables, but access them from Windows applications as if they were stored in the file system, without making any changes to your client applications.
The FileTable feature builds on top of SQL Server FILESTREAM technology. To learn more about FILESTREAM, see FILESTREAM (SQL Server).


Benefits of the FileTable Feature


The goals of the FileTable feature include the following:
  • Windows API compatibility for file data stored within a SQL Server database. Windows API compatibility includes the following:
    • Non-transactional streaming access and in-place updates to FILESTREAM data.
    • A hierarchical namespace of directories and files.
    • Storage of file attributes, such as created date and modified date.
    • Support for Windows file and directory management APIs.
  • Compatibility with other SQL Server features including management tools, services, and relational query capabilities over FILESTREAM and file attribute data.
Thus FileTables remove a significant barrier to the use of SQL Server for the storage and management of unstructured data that is currently residing as files on file servers. Enterprises can move this data from file servers into FileTables to take advantage of integrated administration and services provided by SQL Server. At the same time, they can maintain Windows application compatibility for their existing Windows applications that see this data as files in the file system.

What Is a FileTable?


SQL Server provides a special table of files, also referred to as a FileTable, for applications that require file and directory storage in the database, with Windows API compatibility and non-transactional access. A FileTable is a specialized user table with a pre-defined schema that stores FILESTREAM data, as well as file and directory hierarchy information and file attributes.
A FileTable provides the following functionality:
  • A FileTable represents a hierarchy of directories and files. It stores data related to all the nodes in that hierarchy, for both directories and the files they contain. This hierarchy starts from a root directory that you specify when you create the FileTable.
  • Every row in a FileTable represents a file or a directory.
  • Every row contains the following items. For more information about the schema of a FileTable, see FileTable Schema.
    • file_stream column for stream data and a stream_id (GUID) identifier. (The file_stream column is NULL for a directory.)
    • Both path_locator and parent_path_locator columns for representing and maintaining the file and directory hierarchy.
    • 10 file attributes such as created date and modified date that are useful with file I/O APIs.
    • A type column that supports full-text search and semantic search over files and documents.
  • A FileTable enforces certain system-defined constraints and triggers to maintain file namespace semantics.
  • When the database is configured for non-transactional access, the file and directory hierarchy represented in the FileTable is exposed under the FILESTREAM share configured for the SQL Server instance. This provides file system access for Windows applications.
Some additional characteristics of FileTables include the following:
  • The file and directory data stored in a FileTable is exposed through a Windows share for non-transactional file access for Windows API based applications. For a Windows application, this looks like a normal share with its files and directories. Applications can use a rich set of Windows APIs to manage the files and directories under this share.
  • The directory hierarchy surfaced through the share is a purely logical directory structure that is maintained within the FileTable.
  • Calls to create or change a file or directory through the Windows share are intercepted by a SQL Server component and reflected in the corresponding relational data in the FileTable.
  • Windows API operations are non-transactional in nature, and are not associated with user transactions. However, transactional access to FILESTREAM data stored in a FileTable is fully supported, as is the case for any FILESTREAM column in a regular table.
  • FileTables can also be queried and updated through normal Transact-SQL access. They are also integrated with SQL Server management tools, and features such as backup.

Additional Considerations for Using FileTables



Administrative Considerations

About FILESTREAM and FileTables
  • You configure FileTables separately from FILESTREAM. Therefore you can continue to use the FILESTREAM feature without enabling non-transactional access or creating FileTables.
  • There is no non-transactional access to FILESTREAM data except through FileTables. Therefore, when you enable non-transactional access, the behavior of existing FILESTREAM columns and applications is not affected.
About FileTables and non-transactional access
  • You can enable or disable non-transactional access at the database level.
  • You can configure or fine-tune non-transactional access at the database level by turning it off, or by enabling read only or full read/write access.

FileTables Do Not Support Memory-Mapped Files

FileTables do not support memory-mapped files. Notepad and Paint are two common examples of applications that use memory-mapped files. You cannot use these applications on the same computer as SQL Server to open files that are stored in a FileTable. However you can use these applications from a remote computer to open files that are stored in a FileTable, because in these circumstances the memory-mapping feature is not used.

Related Tasks


Enable the Prerequisites for FileTable
Describes how to enable the prerequisites for creating and using FileTables.
Create, Alter, and Drop FileTables
Describes how to create a new FileTable, or alter or drop an existing FileTable.
Load Files into FileTables
Describes how to load or migrate files into FileTables.
Work with Directories and Paths in FileTables
Describes the directory structure in which the files are stored in FileTables.
Access FileTables with Transact-SQL
Describes how Transact-SQL data manipulation language (DML) commands work with FileTables.
Access FileTables with File I\O APIs
Describes how file system I/O works on a FileTable.
Manage FileTables
Describes common administrative tasks for managing FileTables.

Related Content

SQL Theory: Filestream Overview, Enable Filestream, Filestream Database, Filestream Table

FILESTREAM Overview

SQL Server 2008 R2
35 out of 41 rated this helpful Rate this topic
Much data is unstructured, such as text documents, images, and videos. This unstructured data is often stored outside the database, separate from its structured data. This separation can cause data management complexities. Or, if the data is associated with structured storage, the file streaming capabilities and performance can be limited.
FILESTREAM integrates the SQL Server Database Engine with an NTFS file system by storing varbinary(max) binary large object (BLOB) data as files on the file system. Transact-SQL statements can insert, update, query, search, and back up FILESTREAM data. Win32 file system interfaces provide streaming access to the data.
FILESTREAM uses the NT system cache for caching file data. This helps reduce any effect that FILESTREAM data might have on Database Engine performance. The SQL Server buffer pool is not used; therefore, this memory is available for query processing.
NoteNote
FILESTREAM data is not encrypted even when transparent data encryption is enabled.
For a walkthrough that shows how to use FILESTREAM, see Getting Started with FILESTREAM Storage.

When to Use FILESTREAM


In SQL Server, BLOBs can be standard varbinary(max) data that stores the data in tables, or FILESTREAM varbinary(max) objects that store the data in the file system. The size and use of the data determines whether you should use database storage or file system storage. If the following conditions are true, you should consider using FILESTREAM:
  • Objects that are being stored are, on average, larger than 1 MB.
  • Fast read access is important.
  • You are developing applications that use a middle tier for application logic.
For smaller objects, storing varbinary(max) BLOBs in the database often provides better streaming performance.

FILESTREAM Storage


FILESTREAM storage is implemented as a varbinary(max) column in which the data is stored as BLOBs in the file system. The sizes of the BLOBs are limited only by the volume size of the file system. The standard varbinary(max) limitation of 2-GB file sizes does not apply to BLOBs that are stored in the file system.
To specify that a column should store data on the file system, specify the FILESTREAM attribute on a varbinary(max) column. This causes the Database Engine to store all data for that column on the file system, but not in the database file.
FILESTREAM data must be stored in FILESTREAM filegroups. A FILESTREAM filegroup is a special filegroup that contains file system directories instead of the files themselves. These file system directories are called data containers. Data containers are the interface between Database Engine storage and file system storage. 
When you use FILESTREAM storage, consider the following:
  • When a table contains a FILESTREAM column, each row must have a nonnull unique row ID.
  • FILESTREAM data containers cannot be nested.
  • When you are using failover clustering, the FILESTREAM filegroups must be on shared disk resources.
  • FILESTREAM filegroups can be on compressed volumes.

Integrated Security

In SQL Server, FILESTREAM data is secured just like other data is secured: by granting permissions at the table or column levels. If a user has permission to the FILESTREAM column in a table, the user can open the associated files.
NoteNote
Encryption is not supported on FILESTREAM data.
Only the account under which the SQL Server service account runs is granted NTFS permissions to the FILESTREAM container. We recommend that no other account be granted permissions on the data container.

Integrated Management

Because FILESTREAM is implemented as a varbinary(max) column and integrated directly into the Database Engine, most SQL Server management tools and functions work without modification for FILESTREAM data. For example, you can use all backup and recovery models with FILESTREAM data, and the FILESTREAM data is backed up with the structured data in the database. If you do not want to back up FILESTREAM data with relational data, you can use a partial backup to exclude FILESTREAM filegroups.

Getting Started with FILESTREAM Storage

SQL Server 2008 R2
1 out of 3 rated this helpful Rate this topic
 FILESTREAM is not automatically enabled when you install or upgrade SQL Server. You must enable FILESTREAM by using SQL Server Configuration Manager and SQL Server Management Studio. To use FILESTREAM, you must create or modify a database to contain a special type of filegroup. Then, create or modify a table so that it contains a varbinary(max) column with the FILESTREAM attribute. After you complete these tasks, you can use Transact-SQL and Win32 to manage the FILESTREAM data.

How to: Enable FILESTREAM

SQL Server 2008 R2
10 out of 11 rated this helpful Rate this topic
Before you can start to use FILESTREAM, you must enable FILESTREAM on the instance of the SQL Server Database Engine. This topic describes how to enable FILESTREAM by using SQL Server Configuration Manager.
NoteNote
You cannot enable FILESTREAM on a 32-bit version of SQL Server running on a 64-bit operating system.

To enable and change FILESTREAM settings

  1. On the Start menu, point to All Programs, point to Microsoft SQL Server 2008 R2, point to Configuration Tools, and then click SQL Server Configuration Manager.
  2. In the list of services, right-click SQL Server Services, and then click Open.
  3. In the SQL Server Configuration Manager snap-in, locate the instance of SQL Server on which you want to enable FILESTREAM.
  4. Right-click the instance, and then click Properties.
  5. In the SQL Server Properties dialog box, click the FILESTREAM tab.
  6. Select the Enable FILESTREAM for Transact-SQL access check box.
  7. If you want to read and write FILESTREAM data from Windows, click Enable FILESTREAM for file I/O streaming access. Enter the name of the Windows share in the Windows Share Name box.
  8. If remote clients must access the FILESTREAM data that is stored on this share, select Allow remote clients to have streaming access to FILESTREAM data.
  9. Click Apply.
  10. In SQL Server Management Studio, click New Query to display the Query Editor.
  11. In Query Editor, enter the following Transact-SQL code:
    EXEC sp_configure filestream_access_level, 2
    RECONFIGURE
    
  12. Click Execute.
  13. Restart the SQL Server service.

How to: Create a FILESTREAM-Enabled Database

SQL Server 2008 R2
5 out of 6 rated this helpful Rate this topic
This topic shows how to create a database that supports FILESTREAM. Because FILESTREAM uses a special type of filegroup, when you create the database, you must specify the CONTAINS FILESTREAM clause for at least one filegroup.

To create a FILESTREAM-enabled database

  1. In SQL Server Management Studio, click New Query to display the Query Editor.
  2. Copy the Transact-SQL code from the following example into the Query Editor. This Transact-SQL code creates a FILESTREAM-enabled database called Archive.
    NoteNote
    For this script, the directory C:\Data must exist.
  3. To build the database, click Execute.

Example


The following code example creates a database that is named Archive. The database contains three filegroups: PRIMARYArch1, and FileStreamGroup1.PRIMARY and Arch1 are regular filegroups that cannot contain FILESTREAM data. FileStreamGroup1 is the FILESTREAM filegroup.
CREATE DATABASE Archive 
ON
PRIMARY ( NAME = Arch1,
    FILENAME = 'c:\data\archdat1.mdf'),
FILEGROUP FileStreamGroup1 CONTAINS FILESTREAM( NAME = Arch3,
    FILENAME = 'c:\data\filestream1')
LOG ON  ( NAME = Archlog1,
    FILENAME = 'c:\data\archlog1.ldf')
GO


For a FILESTREAM filegroup, FILENAME refers to a path. The path up to the last folder must exist, and the last folder must not exist. In this example, c:\datamust exist. However, the filestream1 subfolder cannot exist when you execute the CREATE DATABASE statement. For more information about the syntax, seeCREATE DATABASE (Transact-SQL).
After you run the previous example, a filestream.hdr file and an $FSLOG folder appears in the c:\Data\filestream1 folder. The filestream.hdr file is a header file for the FILESTREAM container.
Important noteImportant
The filestream.hdr file is an important system file. It contains FILESTREAM header information. Do not remove or modify this file.
For existing databases, you can use the ALTER DATABASE statement to add a FILESTREAM filegroup.

How to: Create a Table for Storing FILESTREAM Data

SQL Server 2008 R2
2 out of 4 rated this helpful Rate this topic
This topic shows how to create a table for storing FILESTREAM data.
NoteNote
This topic requires the Archive database that is created in How to: Create a FILESTREAM-Enabled Database.

To create a table to store FILESTREAM data

  1. In SQL Server Management Studio, click New Query to display the Query Editor.
  2. Copy the Transact-SQL code from the following example into the Query Editor. This Transact-SQL code creates a FILESTREAM-enabled table called Records.
  3. To create the table, click Execute.

Example


When the database has a FILESTREAM filegroup, you can create or modify tables to store FILESTREAM data. To specify that a column contains FILESTREAM data, you create a varbinary(max) column and add the FILESTREAM attribute.
The following code example shows how to create a table that is named Records. The Id column is a ROWGUIDCOL column and is required to use FILESTREAM data with Win32 APIs. The SerialNumber column is a UNIQUE INTEGER. The Chart column is a FILESTREAM column and is used to store the Chart in the file system.
CREATE TABLE Archive.dbo.Records
(
 [Id] [uniqueidentifier] ROWGUIDCOL NOT NULL UNIQUE, 
 [SerialNumber] INTEGER UNIQUE,
 [Chart] VARBINARY(MAX) FILESTREAM NULL
)
GO