THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Hilary Cotter

SQL Full-text search anomolies

On the newsgropes recently I have encountered some posters who are perplexed by the results of their full-text searches.

The first poster was puzzled as to why searches on f.b.i. would not find F.B.I. in his content.  SQL FTS and Microsoft Search products index uppercased tokens with periods in them differently than other lower or upper cased tokens. During indexing most language word breakers will index upper cased tokens with periods in them as the token, and then the token with all the periods stripped out. So F.B.I. is indexed as F.B.I. and FBI. All tokens are stored upper cased in the index with the exception of path names (for cosmetic reasons - e.g. in Indexing services a path like c:\TeMp is stored as c:\temp as lowercase was considered to look better than uppercase).

There is no length cut off as far as I am aware. For example, C.R.E.E.P. (Committee to Re-Elect the President - the unfortunately and what we can only hope was unintentional acronym for President Richard Nixon's second election campaign) is indexed as CREEP and C.R.E.E.P. I've tried up to 10 letter combinations.

Here is an example:

Create table testAcronymTable(pk int not null identity constraint testAcronymTablePK primary key, charcol char(20))

GO

insert into  testAcronymTable(charcol) values('F.B.I.')

insert into  testAcronymTable(charcol) values('f.b.i.')

insert into  testAcronymTable(charcol) values('fbi')

insert into  testAcronymTable(charcol) values('FBI') 

insert into  testAcronymTable(charcol) values('F B I') 

insert into  testAcronymTable(charcol) values('f b i') 

GO

create fulltext catalog testacronymcatalog

GO
--indexing using the neutral word breaker
create fulltext index on testAcronymTable(charcol language 0) key index testAcronymTablePK on testAcronymCatalog
GO
--returns f.b.i., F.B.I., F B I and f b i  - expected, SQL FTS
--throws away the .'s
--you get the same results if f, b, and i are removed from the noise list
drop fulltext index on testAcronymTable
GO
--indexing using the American English Word breaker word breaker
create fulltext index on testAcronymTable(charcol language 1033) key index testAcronymTablePK on testAcronymCatalog
GO
select * from  testAcronymTable where contains(*,'fbi')
GO
--returns fbi F.B.I, and FBI - fbi, and FBI are expected, unexpected is
--F.B.I, F.B.I is indexed as F.B.I. and FBI.
select * from  testAcronymTable where contains(*,'FBI')
GO
--returns fbi F.B.I, and FBI - fbi, and FBI are expected, unexpected is
--F.B.I, F.B.I is indexed as F.B.I. and FBI.
select * from  testAcronymTable where contains(*,'F.B.I.')
GO
--returns fbi F.B.I, and FBI - F.B.I is expected, unexpected is
--FBI and fbi, F.B.I is indexed as F.B.I. and FBI.
select * from  testAcronymTable where contains(*,'f.b.i.')
GO
--returns f.b.i., F B I, and f b i -if your noise word list does not contain
--f, b, and i
--returns nothing -if your noise word list does contain
--f, b, and i

Note that this works for most languages but does NOT work for the neutral, Chinese (both types),  Japanese, and Thai word breakers. They will index F.B.I. as F, B, and I - in other words as three separate letters. Korean will index F.B.I. as F.B.I

The next post covers some of the unusual behaviors while searching on C#, C+, C++, and F#, etc

 

Published Thursday, March 08, 2007 2:11 PM by Hilary Cotter

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Kevin said:

I've found a strange anomaly related to full text indexing.  I am indexing a table which contains both an image column (storing mostly office files as well as image files - jpgs, bmps, etc.) as well as nvarchar columns.  We needed to add an IFilter so that we could index Office 2007 files, so I installed the IFilter from Microsoft, ran the command "sp_fulltext_service 'load_os_resources', 1" as it states to do in this article: "http://support.microsoft.com/default.aspx?scid=kb;en-us;945934", and after I rebuilt the index, all the records with image files of jpgs, and bmps no longer get the nvarchar columns indexed.  Looking at the full text index log files, it says it failed to index the image rows ("Error '0x8004170c: The document format is not recognized by the filter.").

If I run the command "sp_fulltext_service 'load_os_resources', 0" and rebuild the index, the nvarchar columns mentioned above get properly indexed, but the Office 2007 image files do not.  I haven't found anyone reporting a similar issue, maybe we are one of the few people using SQL Server this way?

September 17, 2008 7:01 PM

Leave a Comment

(required) 
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement