On the newsgropes recently I have encountered some posters who are perplexed by the results of their full-text searches.
The first poster was puzzled as to why searches on f.b.i. would not find F.B.I. in his content. SQL FTS and Microsoft Search products index uppercased tokens with periods in them differently than other lower or upper cased tokens. During indexing most language word breakers will index upper cased tokens with periods in them as the token, and then the token with all the periods stripped out. So F.B.I. is indexed as F.B.I. and FBI. All tokens are stored upper cased in the index with the exception of path names (for cosmetic reasons - e.g. in Indexing services a path like c:\TeMp is stored as c:\temp as lowercase was considered to look better than uppercase).
There is no length cut off as far as I am aware. For example, C.R.E.E.P. (Committee to Re-Elect the President - the unfortunately and what we can only hope was unintentional acronym for President Richard Nixon's second election campaign) is indexed as CREEP and C.R.E.E.P. I've tried up to 10 letter combinations.
Here is an example:
Create table testAcronymTable(pk int not null identity constraint testAcronymTablePK primary key, charcol char(20))
GO
insert into testAcronymTable(charcol) values('F.B.I.')
insert into testAcronymTable(charcol) values('f.b.i.')
insert into testAcronymTable(charcol) values('fbi')
insert into testAcronymTable(charcol) values('FBI')
insert into testAcronymTable(charcol) values('F B I')
insert into testAcronymTable(charcol) values('f b i')
GO
create fulltext catalog testacronymcatalog
GO
--indexing using the neutral word breaker
create fulltext index on testAcronymTable(charcol language 0) key index testAcronymTablePK on testAcronymCatalog
GO
--returns f.b.i., F.B.I., F B I and f b i - expected, SQL FTS
--throws away the .'s
--you get the same results if f, b, and i are removed from the noise list
drop fulltext index on testAcronymTable
GO
--indexing using the American English Word breaker word breaker
create fulltext index on testAcronymTable(charcol language 1033) key index testAcronymTablePK on testAcronymCatalog
GO
select * from testAcronymTable where contains(*,'fbi')
GO
--returns fbi F.B.I, and FBI - fbi, and FBI are expected, unexpected is
--F.B.I, F.B.I is indexed as F.B.I. and FBI.
select * from testAcronymTable where contains(*,'FBI')
GO
--returns fbi F.B.I, and FBI - fbi, and FBI are expected, unexpected is
--F.B.I, F.B.I is indexed as F.B.I. and FBI.
select * from testAcronymTable where contains(*,'F.B.I.')
GO
--returns fbi F.B.I, and FBI - F.B.I is expected, unexpected is
--FBI and fbi, F.B.I is indexed as F.B.I. and FBI.
select * from testAcronymTable where contains(*,'f.b.i.')
GO
--returns f.b.i., F B I, and f b i -if your noise word list does not contain
--f, b, and i
--returns nothing -if your noise word list does contain
--f, b, and i
Note that this works for most languages but does NOT work for the neutral, Chinese (both types), Japanese, and Thai word breakers. They will index F.B.I. as F, B, and I - in other words as three separate letters. Korean will index F.B.I. as F.B.I
The next post covers some of the unusual behaviors while searching on C#, C+, C++, and F#, etc