NOTES | Querying Full-Text Data

NOTES | Querying Full-Text Data

Creating Full-Text Catalogs and Indexes

Full-text search allows approximate searches in SQL Server 2012 databases. Before you start using full-text predicates and functions, you must create full-text indexes inside full-text catalogs.
After you create full-text indexes over character columns in your database, you are able to search for:

  • Simple terms—that is, one or more specific words or phrases.
  • Prefix terms, which are terms the words or phrases begin with.
  • Generation terms, meaning inflectional forms of words.
  • Proximity terms, or words or phrases close to another word or phrase.
  • Thesaurus terms, or synonyms of a word.
  • Weighted terms, which are words or phrases that use values with your custom weight.
  • Statistical semantic search, or key phrases in a document.
  • Similar documents, where similarity is defined by semantic key phrases.

Full-Text Search Components

In order to start using full-text search, you have to understand full-text components. For a start, you can check whether Full-Text Search is installed by using the following query.

SELECT SERVERPROPERTY('IsFullTextInstalled');

If Full-Text Search is not installed, you must re-run the setup.

You can create full-text indexes on columns of type CHAR, VARCHAR, NCHAR, NVARCHAR, TEXT, NTEXT, IMAGE, XML, and VARBINARY(MAX). Besides using full-text indexes on SQL Server character data, you can store whole documents in binary or XML columns, and use full-text queries on those documents. Columns of data type VARBINARY(MAX), IMAGE, or XML require an additional type column in which you store the file extension (such as .docx, .pdf, or .xlsx) of the document in each row.

You need appropriate filters for documents. Filters, called ifilters in full-text terminology, extract the textual information and remove formatting from the documents. You can check which filters are installed in your instance by using the following query.

EXEC sys.sp_help_fulltext_system_components 'filter';

Word breakers and stemmers perform linguistic analysis on all full-text data. Because rules differ from language to language, word breakers and stemmers are language specific. A word breaker identifies individual words (or tokens). Tokens are inserted in a full-text index in compressed format. The stemmer generates inflectional forms of a word based on the rules of a language.
You can use the following query to check which languages are supported in SQL Server.

SELECT lcid, name
FROM sys.fulltext_languages
ORDER BY name;

Stemmers are language specific. If you use a localized version of SQL Server, SQL Server Setup sets the default full-text language to the language of your instance, if the language is supported on your instance.

You can prevent indexing noise words (like “SQL Server”) by creating stoplists of stopwords.
You can check current stopwords and stoplists in your database by using the following queries:

SELECT stoplist_id, names
FROM sys.fulltext_stoplists;
SELECT stoplist_id, stopword, language
FROM sys.fulltext_stopwords;

Full-text queries can search not only for words you provide in a query; they can search for synonyms as well. The location of the thesaurus files for a default instance is SQL_Server_install_path\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\FTDATA.
You can manually edit each thesaurus file and configure the following elements:

  • diacritics_sensitive Set the value of this element to 0 if the language is accent insensitive, or to 1 if it is accent sensitive.
  • expansion Use this element to add expansion words for a word. For example, you can add the expansion word “author” to the word “writer” in order to search for “author” as well when an end user searches for the word “writer.”
  • replacement Use this element to define replacement words or terms for a specific word or term. For example, “Windows 2008” could be a replacement for “Win 2k8.” In such an example, SQL Server would search for “Windows 2008,” even though “Win 2k8” was used in a search term.

After you edit the thesaurus file for a specific language, you must load it with the following system procedure call.

EXEC sys.sp_fulltext_load_thesaurus_file 1033;

Full-text queries can search on document properties as well. Which properties can be searched for depends on the document filter. You can create a search property list to define searchable properties for your documents. You can include properties that a specific filter can extract from a document.

Exam Tip
Although full-text search is not on the list of the exam objectives, an indirect question about it could appear. Remember that full text predicates can also be a part of the WHERE clause of a query.

Creating and Managing Full-Text Catalogs and Indexes

Full-text indexes are stored in full-text catalogs. A full-text catalog is a virtual object, a container for full-text indexes. As a virtual object, it does not belong to any filegroup.

The last option mentioned, the STATISTICAL_SEMANTICS option, deserves deeper explanation. Statistical semantic search gives you deeper insight into documents by extracting and indexing statistically relevant key phrases. Full-text search uses these key phrases to identify and index documents that are similar or related. You query these semantic indexes by using three T-SQL rowset functions to retrieve the results as structured data. You use these functions in the practices in this chapter. Semantic search extends full-text search functionality. It enables you to query the meaning of the documents. For example, you can query the index of key phrases to build the taxonomy of documents. You can query the document similarity index to identify résumés that match a job description. Semantic search gives you the possibility to create your own text-mining solution. Semantic search could be especially interesting in conjunction with text-mining components of SQL Server Integration Services (SSIS).

In order to use the Semantic Search feature, you have to have Full-Text Search installed.

Using the CONTAINS and FREETEXT Predicates

SQL Server supports the CONTAINS and FREETEXT predicates which are very powerful for limiting the result set of a query by using full-text indexes. Both of them support various term searching. SQL Server also supports two table-valued functions for full-text searches, and three table-valued functions for semantic searches.

The CONTAINS Predicate

With the CONTAINS predicate, you can search for the following:

  • Words and phrases in text
  • Exact or fuzzy matches
  • Inflectional forms of a word
  • Text in which a search word is close to another search word
  • Synonyms of a searched word
  • A prefix of a word or a phrase only

The FREETEXT predicate

The FREETEXT predicate performs a best match of the specified words and phrases by finding rows in which the terms and phrases match the meaning, rather than the exact wording of the query. This predicate is an optional part of the optional WHERE clause of the SELECT statement.

Note
It is not valid to place NOT before a FREETEXT predicate that begins with a WHERE clause.


Exam Tip
The FREETEXT predicate is less selective than the CONTAINS predicate, and thus it usually returns more rows than the CONTAINS predicate


Using the Full-Text and Semantic Search Table-Valued Functions

Using the Full-Text Search Functions

Leave a Reply