Saturday 27 December 2014

SQL Theory: Full-Text Semantic Search

Problem
Sometimes it is necessary to search for specific content inside documents stored in a SQL Server database. Is it possible to do this in SQL Server? Can I run T-SQL queries and find content inside Microsoft Word files?  Yes, now with SQL Server 2012 you can do a semantic search. You can create a T-SQL query and look for information inside a Word document or other documents.  In this tip, we will look at how this works.
Solution
SQL Server 2012 introduces a new search feature beyond the relational platform called semantic search. This feature extends the semantic full-text search capabilities to external files.

Installation

The Semantic Language Statistics Database is a required component for the Statistical Semantic Search feature in Microsoft® SQL Server® 2012 Semantic Language Statistics.

The Semantic Search engine can be installed when you do the initial installation of SQL Server or it can be added later.
Here is a screenshot of the option when you do a normal installation:

install full text and semantic search

If you did not install this when you did the initial installation you can run the SQL Server installer for the SemanticLanguageDatabase as shown below. This can be found in the SQL Server setup folder.

Open the SQL Server Management Studio

Attach the semantic language statistics database after installation:

You have to run SQL Server Management Studio as an administrator, and it works. Attach the database to the instance of SQL Server by using Management Studio or by calling CREATE DATABASE with the FOR ATTACH syntax. For more information, see Detaching and Attaching Databases. By default, the name of the database is semanticsdb. You can optionally give the database a different name when you attach it. You have to provide this name when you register the database in the following step:

CREATE DATABASE semanticsdb

ON (FILENAME = 'C:\Microsoft Semantic Language Database\semanticsdb.mdf') 

FOR ATTACH; 

Important

When the semantic language statistics database is extracted, restricted permissions are assigned to the database file and log file in the file system. As a result, the user may not have permission to attach the database. If an error is raised when you try to attach the database, check and fix file system permissions as appropriate.

Register the semantic language statistics database: 

Call the stored procedure sp_fulltext_semantic_register_language_statistics_db and provide the name that you gave to the database when you attached it:

EXEC sp_fulltext_semantic_register_language_statistics_db @dbname = N'semanticsdb';

GO

Verify Features Are Installed

To verify if you have Full Text and Semantic Search installed you can run this query:
SELECT SERVERPROPERTY('IsFullTextInstalled')

The following query s used to verify that the semantic database is installed and registered:
SELECT * FROM sys.fulltext_semantic_language_statistics_database

Create Sample Database

To use this new feature we are going to use the FileTable feature in SQL Server 2012.  First I am going to create a folder where we will store our database files.  In this example I have created a folder called "C:\FileTable".

After this folder is created, open SQL Server Management Studio and create a new database called "FileSearchTest".  This is created just like any other database except we are going to also enable FILESTREAM for this database as well.

The script below creates the database:
CREATE DATABASE FileSearchTest
ON PRIMARY
(
    NAME = N'FileSearchTest',
    FILENAME = N'C:\FileTable\FileSearchTest.mdf'
),
FILEGROUP FilestreamFG CONTAINS FILESTREAM
(
    NAME = MyFileStreamData,
    FILENAME= 'C:\FileTable\Data'
)
LOG ON
(
    NAME = N'FileSearchTest_Log',
    FILENAME = N'C:\FileTable\FileSearchTest_log.ldf'
)
WITH FILESTREAM
(
    NON_TRANSACTED_ACCESS = FULL,
    DIRECTORY_NAME = N'FileTable'
)

Create Sample Table

Create a new table as a FileTable inside the database.
USE FileSearchTest
go
CREATE TABLE DocumentSemantics AS FileTable
WITH
(
    FileTable_Directory = 'DocumentSemantics',
    FileTable_Collate_Filename = database_default
);
GO
We just created a FileTable! This is a special table that is used to store non-relational data like Word files, text files, ppt files, etc... The below image shows this new feature in SSMS.
Open the SQL Server Management Studio

There is also a feature in SQL Server 2012 to explore the FileTable directory from within SSMS.  In SSMS right click on the new FileTable you can see the "Explore FileTable Directory" option.  This will show you the files that exist in this folder.

Open the SQL Server Management Studio

You can now copy files into the folder "C:\FileTable\Data", using Windows Explorer and then if you run a SELECT against the DocumentSemantics FileTable you can see a list of these files.
You can just run this command to see all of the details:
SELECT * FROM dbo.DocumentSemantics


Open the SQL Server Management Studio

Deleting Files

You can also delete files from the folder using T-SQL as shown below:
USE FileSearchTest
DELETE FROM dbo.DocumentSemantics
WHERE dbo.DocumentSemantics.file_type <> 'docx'

Searching Content Within Files

In this example we are going to search for text and find the files with that content:
USE FileSearchTest
GO
SELECT name
FROM dbo.DocumentSemantics
WHERE FREETEXT (file_stream, 'After climbing a very high mountain, we discovered that there are many other mountains to climb')

The query will show the name of the documents that contain the following text: "After climbing a very high mountain, we discovered that there are many other mountains to climb". 
The results displayed are the following:

Open the SQL Server Management Studio








If we open these documents in WordPad we can see this text as outlined below.

Open the SQL Server Management Studio

Finding Documents By Filename

Another option we have is to find filenames that include some text.  In this example we are going to search for the word "system".
These are the files in folder "C:\FileTable\Data" that include the word "system":

Open the SQL Server Management Studio

We can run this T-SQL script to find the first 5 files that contain the word "system" in the filename:
 USE FileSearchTest
SELECT TOP (5)
 D.name,
 D.cached_file_size,
 KEYP_TBL.score
FROM dbo.DocumentSemantics D
INNER JOIN SEMANTICKEYPHRASETABLE
 (
 dbo.DocumentSemantics, 
 (name, file_stream)
 ) AS KEYP_TBL
 ON D.path_locator = KEYP_TBL.document_key
WHERE KEYP_TBL.keyphrase ='system'
ORDER BY KEYP_TBL.score DESC;

The results displayed by the query are as follows:

Open the SQL Server Management Studio
Next Steps
Semantic Search combined with FileTables is a powerful feature that will let us handle external files in SQL Server easily and we can query the content easily using T-SQL.
For more information, you can read these artilces:


Last Update: 10/4/2012 

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home