THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Adam Machanic

Adam Machanic, Boston-based SQL Server developer, shares his experiences with programming, monitoring, and performance tuning SQL Server. And the occasional battle with the query optimizer.

Counting occurrences of a substring within a string

I have absolutely no idea why anyone wants to do this, but I keep answering the same question in forums: "How do I count the occurrences of a substring [note: usually comma] within a string?"

 

In an effort to thwart carpal tunnel syndrome, I have created the Ultimate Substring Occurrence Counting UDF.

... And here it is:

 

CREATE FUNCTION dbo.GetSubstringCount
(
@InputString TEXT,
@SubString VARCHAR(200),
@NoisePattern VARCHAR(20)
)
RETURNS INT
WITH SCHEMABINDING
AS
BEGIN
RETURN
(
SELECT COUNT(*)
FROM dbo.Numbers N
WHERE
SUBSTRING(@InputString, N.Number, LEN(@SubString)) = @SubString
AND PATINDEX(@NoisePattern, SUBSTRING(@InputString, N.Number + LEN(@SubString), 1)) = 0
AND 0 =
CASE
WHEN @NoisePattern = '' THEN 0
ELSE PATINDEX(@NoisePattern, SUBSTRING(@InputString, N.Number - 1, 1))
END
)
END

First note: You need (regular readers, you guessed it) a numbers table.

Okay, so what's it do? Simply put, it returns the number of times @SubString appears within @InputString. But wait! -- Act now and you will receive an additional bonus feature at no extra cost! Can you feel the love?

The @NoisePattern parameter allows the user to put the UDF into "exact match" mode.

For instance, let's say you have a big string containing some text about automobile manufacturers, and for some reason (again, I have no clue why people need this functionality -- fill me in if you do!) you want to count the number of occurrences of the word "auto", but not the number of occurrences of other forms of the word, e.g. "automobile" or .... some word that ends in "auto" (if such a word exists).

By specifying a pattern for @NoisePattern of characters that shouldn't be adjacent to your word, you're telling the UDF that any other characters are safe. Leaving the parameter empty means that all occurrences of the substring will be counted. Examples:

 

SELECT dbo.GetSubstringCount('We like the autos, the autos that go boom.', 'auto', '')
-- Returns 2

SELECT dbo.GetSubstringCount('Autos are fun. I like to drive my auto.', 'auto', '')
-- Also returns 2

SELECT dbo.GetSubstringCount('Autos are fun. I like to drive my auto.', 'auto', '%[a-z]%')
-- Only returns 1 -- The exact match must not have adjacent alphabetic characters

Published Wednesday, July 12, 2006 10:22 PM by Adam Machanic
Filed under: ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

beckham said:

Good one..

May 7, 2007 12:59 AM
 

Jereme Guenther said:

I am looking for this functionality, however I would prefer to do it without a Numbers table.

My need is that of a search engine sproc.  I need to be able to weight the returned records and one of the items in the weighting system is to check how many times the search pattern occurs in the record.

June 17, 2008 11:12 AM
 

Pickles said:

I need this too. I have a column that basically contains sql and I need to pull out field_ids from it, but each row has a different number of field_ids in the sql. Is there a different(better) solution?

July 16, 2008 6:28 PM
 

Adam Machanic said:

I don't know why you would want to do it without the numbers table.  It makes the solution an order of magnitude faster.

July 23, 2008 2:30 PM
 

Craig Hathaway said:

Hi Adam, I have tried this code but even the included examples do not work!

SELECT dbo.GetSubstringCount('We like the autos, the autos that go boom.', 'auto', '')

-- returns 0   <-- should be 2

SELECT dbo.GetSubstringCount('Autos are fun.  I like to drive my auto.', 'auto', '')

-- returns 0   <-- should be 2

SELECT dbo.GetSubstringCount('Autos are fun.  I like to drive my auto.', 'auto', '%[a-z]%')

-- returns 1   <-- should be 1

Is there a SQL setting / version dependency that O could be missing?

Craig

February 22, 2009 9:06 PM
 

Adam Machanic said:

Hi Craig,

Thanks for pointing this out.  It is indeed a version issue.  I wrote this back in the bad old days of SQL Server 2000, and apparently there was a change to the way PATINDEX works between 2000 and 2005. I just tested:

SELECT PATINDEX('', ' ')

SQL Server 2000 returns 0, whereas SQL Server 2005 returns 1.  This is breaking the third predicate in the WHERE clause within the function, which checks to see if the target string is prepended by anything that matches the input pattern.

I'll have to think about how to fix this, but as a temporary workaround if you don't want to use exact-match mode, you could pass in some character that you know can't possibly exist in the target string:

SELECT dbo.GetSubstringCount('We like the autos, the autos that go boom.', 'auto', CHAR(255))

--Returns 2, even in SQL Server 2005 or 2008

In the meantime, I'm wondering if this "new" behavior makes sense?  I'm not sure, but I'm leaning towards SQL Server 2000's answer.  An empty pattern shouldn't, in my opinion, match on anything at all...

February 23, 2009 11:18 AM
 

Funmarkaz said:

Great work dude!

Can i use it on MyISAM?

October 6, 2009 1:25 AM
 

Adam Machanic said:

Funmarkaz, yes, it should work with a bit of modification; MySQL's CREATE FUNCTION syntax isn't quite the same as SQL Server's.

October 6, 2009 2:27 PM
 

Mike Schafer said:

I found this post because I was searching for a function for  a project where I need to find out how many delimiters exist in a string.  This implementation does not require a numbers table or anything other than the function itself.  There is no "additional bonus" feature in this version but it will count occurences without any additional db objects. Happy Querying!

CREATE FUNCTION dbo.GetSubStringCount (

      @InputString NVARCHAR(4000),

      @SearchString VARCHAR(255)

)RETURNS INT WITH SCHEMABINDING AS

BEGIN

      DECLARE @occurences AS BIGINT

             ,@position AS BIGINT

      SET @occurences = 0

      SET @position = 0

      WHILE @position < LEN(@InputString)

      BEGIN

            IF CHARINDEX(@SearchString, @InputString, @position) > 0

            BEGIN

               SET @occurences = @occurences + 1

               SET @position = CHARINDEX(@SearchString, @InputString, @position)

            END

            SET @position = @position+1

      END

      RETURN @occurences

END

December 29, 2009 12:51 PM
 

Adam Machanic said:

Mike,

Thanks for sharing. A simpler and more efficient way to solve the problem (if you don't want the "bonus" feature) is to do:

SELECT LEN(@InputString) - REPLACE(@InputString, @SearchString, '')

I suspect that the numbers table will provide better performance than a WHILE loop, and both will be less efficient than the above solution, but I'll leave that testing as an exercise for anyone interested in taking this a bit further.

Even better would be to inline the Numbers table version (search my blog for my post on that topic), and a SQLCLR solution would probably be fastest of all. I would personally definitely keep the "bonus" around as it's been quite useful in a few projects I've worked on.

December 30, 2009 11:47 AM
 

Brian Lewis said:

My reason for wanting this functionality: to count line breaks in sys.syscomments in order to measure how many lines of T-SQL there are in the project. The bonus feature will allow blank lines to be excluded from the count.

April 28, 2010 4:57 PM
 

Adam Machanic said:

Okay, the 2005/2008 bug is fixed.

May 1, 2010 11:49 AM
 

Jason said:

Adam, thanks for posting this.   Very handy use of the numbers table.  

I'm troubleshooting a system that is experiencing tempdb meta-data contention and trying to identify queries that are creating temp objects within ad-hoc sql statements.  I have traces of this activity and am using this to identify the 'worst offenders'. Initially I was just counting each query where TextData was like '%table%' until I found that some batches created 10-15 table variables, so I needed a way to count them.

Works great.  thx.

June 10, 2011 4:47 PM
 

Ron said:

Adam,

Is there a modification I can make to get:

SELECT dbo.GetSubstringCount('ROUTINE MEDICAL EXAMINATION', 'NORMAL ROUTINE HISTORY AND PHYSICAL', '%[a-z]%')

to return 1 (ROUTINE)

and

SELECT dbo.GetSubstringCount('REFLUX, ESOPHAGEAL', 'ESOPHAGEAL REFLUX', '%[a-z]%')

to return 2 (ESOPHAGEAL & REFLUX)

February 19, 2013 8:35 AM
 

Adam Machanic said:

Ron: Sure, it's doable, but it would be a completely different function. What you want to do is split both strings on any non-alpha character, then intersect the results. You can search my blog for a string splitter (the CLR version would probably be best), and then just use the INTERSECT operator to get your final answer.

--Adam

February 19, 2013 9:50 AM
 

Rayliner said:

I'm not sure if it is faster or slower, but how about this:  a single line, no external tables needed.

String: 'The dog and the cat broke the plate before the farmer got home'

find the length of the string (63)

upshift the entire string (and the search string), and replace the search string with nothing (''). Find the length of that new string (51)

Subtract the new length from the old length and divide by the length of the search string. The result will be the number of occurrences of the search string in the original line.

select

length('The cat and the dog broke the plate before the farmer got home.') as orig_length,

length(replace(upper('The cat and the dog broke the plate before the farmer got home.'),upper('the'),'')) as newlength

,length('the') as substringlength

,(length('The cat and the dog broke the plate before the farmer got home.') - length(replace(upper('The cat and the dog broke the plate before the farmer got home.'),upper('the'),'')))/length('the') as NumberOfOccurrences

from dual

or with a table that has a line of chars in colC, and you need to search for 'x':

select (length(colC) - length(replace(upper(colC),upper('x')))/length('x') as countByLine

from myTable;

October 28, 2014 7:20 PM
 

Rayliner said:

oops, looks like Adam mentioned this in one of the posts already.  sorry to duplicate your solution.   I was interested in having the count so when we were lookign for a variable name throughout our code, and we were looking to see how it was used, we wouldn't need to walk through the entire line if it only existed once, or we would know to look for the extra occurrences.  (We also bolded each occurrence so they would stand out, but the occurrence counter in addition to the bold helped out).

October 28, 2014 7:46 PM
 

Find a accurate occurrence of a substring in a twine SQL SERVER 2008 | My WordPress Website said:

November 23, 2014 5:52 PM

Leave a Comment

(required) 
(required) 
Submit

About Adam Machanic

Adam Machanic is a Boston-based SQL Server developer, writer, and speaker. He focuses on large-scale data warehouse performance and development, and is author of the award-winning SQL Server monitoring stored procedure, sp_WhoIsActive. Adam has written for numerous web sites and magazines, including SQLblog, Simple Talk, Search SQL Server, SQL Server Professional, CoDe, and VSJ. He has also contributed to several books on SQL Server, including "SQL Server 2008 Internals" (Microsoft Press, 2009) and "Expert SQL Server 2005 Development" (Apress, 2007). Adam regularly speaks at conferences and training events on a variety of SQL Server topics. He is a Microsoft Most Valuable Professional (MVP) for SQL Server, a Microsoft Certified IT Professional (MCITP), and an alumnus of the INETA North American Speakers Bureau.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement