|
|
|
|
-
The other day I wrote a blog post about my initial trials with SQL Storage Compress 5.0. For the most part my experience with the product was fantastic, however there were a few areas where I felt bit uneasy - and wanted to address those concerns.
Recoverability
Since I am a troublemaker by nature, one of the first things I tried in my initial round of testing was to access a compressed database while the HyperBac service was stopped. I simply stopped the service through the HyperBac Configuration Manager, and quickly discovered that my compressed databases were no longer accessible. Assuming there is no data loss, this situation is easy enough to recover from, by following these steps:
- Take the database offline (you can right-click the database in Object Explorer, or use ALTER DATABASE);
- Start the HyperBac service;
- Bring the database online.
You can also use a more tedious technique, which decompresses each individual mdfx/ldfx file into its equivalent mdf/ldf file, and then simply attach those files as if it were a normal database. This is similar to the technique you would have to follow if you were using the product and your license expired, or if you decommissioned a SQL Server and only had the .mdfx/.ldfx files available:
- Launch a command prompt in the HyperBac\bin directory (in my case, C:\Program Files (x86)\Red Gate\HyperBac\bin\);
- Run the following for each mdfx/ldfx file:
|
hyperutil.exe /S"<path>\<file>.mdfx" /O"<path>\<file>.mdf" /E hyperutil.exe /S"<path>\<file>.ldfx" /O"<path>\<file>.ldf" /E
|
- Since these are no longer compressed mdfx/ldfx files but standard mdf/ldf files, use the following syntax to attach the database:
|
CREATE DATABASE <db_name> ON PRIMARY (FILENAME = '<path>\<file>.mdf') LOG ON (FILENAME = '<path>\<file>.ldf') FOR ATTACH; |
Of course in this case you will no longer be able to automatically use compression against these databases, unless you backup and restore them as mdfx and/or ldfx files again.
In a thread on the Red Gate Forum, Jeffrey Aven agreed that it might be beneficial to warn users of the impact when pressing the Stop button in the HyperBac Configuration Manager; so, they are going to consider addressing this in a future update to the product. While this doesn't help people who might (even inadvertently) stop the service via Task Manager, the Services applet or NET STOP, at least the casual goof might be prevented.
Log Size / Duration
In my first tests, the log file for the HyperBac compressed database was roughly 10X larger than the log file for native or non-compressed data. I checked multiple times that for each test everything was the same - identical workload, same isolation level, same autogrowth settings, same recovery model. While I had no problem creating this scenario multiple times on the first day, I've had trouble re-creating it since the first time I stopped and re-started the HyperBac service. Today I ran some tests with a little more data being pumped through to see if I could re-create it, and thought I would share more pretty graphs. I conducted four different tests instead of three: - No compression
- Native compression
- Red Gate compression on both data and log files
- Red Gate compression on data file only
Here were the comparisons of time required to perform data load and update workload tasks:  Time, in seconds, required to perform workload tasks
From this we can see that, when the background service has to compress both data file and log file information, the time required to run the workload is dramatically affected. Next are the comparisons to data, log and overall size using the various compression methods:  Size, in MB, of data and log files after workload
While the log is obviously larger when we don't have HyperBac compress it, it seems that for volatile workloads (at least in typical OLTP applications) it may be worth considering only compressing the data file(s) and leaving the log file in the default mode. The duration for the workload when both log and data were being compressed by the HyperBac service was more than 3X longer than compressing just the data file. And while the uncompressed log file was much larger in that case, the overall size was not much larger than native compression, and the time required was much more in line as well. (I'll be sharing my workload scripts with Red Gate, and can publish them here on request as well.)
Performance Overhead
Again I did not perform any tracing, but I did pay a little more attention to spot checking Task Manager to see the impact on CPU. During INSERT/UPDATE operations the CPU was running pretty hot; the biggest consumers were (predictably) sqlservr.exe and HyperBacSrv.exe:

I then discovered that the HyperBacSrv.exe process utilizes a substantial amount of CPU even when there is no activity against SSC compressed databases; it utilized anywhere from 12 to 60% of the CPU while a workload was running against a non-compressed database. I'm not saying this is alarming, just something to watch for when running your own tests. The CPU overhead of native compression and how it affects sqlservr.exe still appears to be in the 5% range for my scenario - but again I will need to run more elaborate trace and perfmon tests to provide more accurate overhead analysis.
Conclusion
SQL Storage Compress 5.0 seems to be a candidate worthy of testing. In environments with adequate CPU resources and where disk space and/or I/O are your primary bottlenecks, or where you don't have the ability to use native data compression, the product can fill a tangible gap. But as always, to determine if a product will be beneficial to your
environment, I will always recommend that you perform your own
benchmarks and are aware of the risks and overhead.
|
-
Recently, Red Gate Software released a new version of their product called SQL Storage Compress 5.0. This is an alternative to using the native Data Compression and Backup Compression features introduced in SQL Server 2008. In order to utilize Data Compression, you must be running Enterprise Edition or above (and yes, for the nit-pickers, I know this includes Developer and Evaluation, but these aren't real production editions). For Backup Compression, this was relaxed in SQL Server 2008 R2 to also include Standard Edition, however Data Compression remains a big ticket item. So certainly on the surface, SQL Storage Compress seems to be a decent solution for people who:
(a) run SQL
Server Standard Edition, and can't take advantage of the native
data compression features; (b) have a mix of editions, and want to use the same compression solution everywhere; or, (c) commonly perform backup/restore operations between different editions.
Setup
Setting it up is quite easy - you run the installer, it installs a service and a configuration tool, and basically you only have to change a few things in your normal routines in order to take advantage of compression from the product.
- If you want to compress data (resulting in smaller MDF/NDF files), you simply have to create or restore databases with .mdfx / .ldfx / .ndfx extensions instead of .mdf / .ldf / .ndf. The background service will actively compress these data files in real time.
- If you want to compress backups (resulting in smaller backup files), you simply have to change your native backup statements to save to .hbc or .hbc2 filenames instead of the traditional .BAK extension. (Alternatively, you can configure the service to also compress .BAK files, however I don't recommend this as it could become confusing in your environment.) For anyone who has used HyperBac, some of this may sound familiar, and that is because Red Gate acquired HyperBac earlier this year. In fact much of the branding remains:

- SQL Storage Compress also supports encryption options, but I'll leave those for another day.
At first I was surprised that nothing really happened at the end of setup, except that the activation screen was raised. To get to the configuration tool, go to Start > Programs > Red Gate > HyperBac > HyperBac Configuration Manager. This is where you can change settings, most commonly which extensions should actively be compressed by the service:

On the other tabs, you can also start and stop the HyperBac service, inspect its status, and configure locations for log and index files. Red Gate has published a detailed walk-through which helps you learn how to utilize the product once you have it up and running.
The Tests
To keep things simple, I wanted to create new databases from scratch, mark them as compressed, and then populate them with data. I wanted to measure the following aspects:
- speed of initial load operations
- speed of read operations
- speed of write operations
- size of data and log files
- size of backup files
Of course I wanted to compare these metrics using both native and Red Gate compression, and use a baseline where no compression was used at all.
Rather than repeat all of the boring test setup and sample data, I will point you to this post of mine from last year:
http://sqlblog.com/blogs/aaron_bertrand/archive/2009/08/23/sql-server-2008-r2-digging-deeper-into-unicode-compression.aspx
The tests I ran this week were essentially identical to the tests from the above blog post;there were a few exceptions, for example to set up the Red Gate compressed database, I had to restore an existing backup using WITH MOVE to . I haven't yet tested whether the service works equally on a database where the CREATE DATABASE statement uses the modified extensions; I am not 100% clear on whether the service only listens for RESTORE statements and doesn't pay attention to CREATE DATABASE. More tests later.
The machine, also, is the same - an 8GB VM with 4 cores. The hardware isn't an essential detail of the tests, because they're all running on the same box - but keep in mind that these are local virtual disks, so the I/O certainly doesn't represent what you might be dealing with on a real server with a decent disk subsystem. As always, I recommend you do your own performance tests in your own environment!
The Results
For the speed tests, I divided the timing of the results into the three major operational categories: initial load, read, and write, and then stacked these to show overall duration. From the below graph you can see that, as you might expect, the fastest overall speed for the workload was in the database with no compression. For the database using SQL Storage Compress, the initial data load was slower than native compression, but subsequent operations were faster.
 Duration, in seconds, of identical workload replay
Next I compared the data and log file sizes. SQL Storage Compress certainly squeezed the same data into the smallest data file, by almost 50% compared to native data compression. However, this came at quite a cost in log space: the same workload produced almost 300 MB of log compared to about 40 MB of log in both the native compressed and non-compressed databases:
 Size, in MB, of compressed and non-compressed data and log files
Finally, I ran some backups to test eventual backup size. I ran four different tests: simple backup with no compression, a simple backup with native compression, and the two HyperBac compression modes (normal and fast). The timings for these backups were irrelevant, because they all completed within one or two seconds (just not enough data to perform a meaningful comparison). But the resulting sizes had a few surprises:
Backup size, in MB, of compressed and non-compressed data
Note that even though the original size of the SQL Storage Compressed database was the smallest, taking a normal, non-compressed backup actually yields a much larger file - relatively equivalent to a non-compressed backup of non-compressed data. As of SQL Server 2008 R2, backup compression is now included in Standard Edition and above, so whether or not you have SQL Storage Compress installed, most of you will be able to compress your backups in one way or another - so this is unlikely to hit anyone in practice. Without thinking too deeply about what is in this backup file, the result is peculiar to me. Other than that, the chart speaks for itself - for compressed or non-compressed data, native backup compression and HyperBac compression provide almost identical compression ratios. Native compression has the edge if you are in a situation where you need to restore databases to servers without the SQL Storage Compress service running. I hope to do more tests in the future comparing speed and resource usage as well as size on much larger databases.
Caveats
As mentioned before, this was a pretty basic test on low-end hardware. This is in a virtual machine with a single C:\, so the lack
of I/O distribution probably impacted both solutions negatively when
compared to a real-world scenario. Could the SQL Storage Compress performance been optimized?
Possibly, especially if I had consulted Red Gate. In general, though, the test was
apples-to-apples... and it was quite true to what a typical user might experience when trying out the product for the first time. I'll have to put some thought into making some
higher-octane hardware available for further and more realistic performance tests.
Questions
As I was running the tests, some questions occurred to me. Three I have answers for; one I do not.
- What if I only want to compress some of my data?
In the native Data Compression solution, you can pick and choose which objects you want to use compression, and you can turn these on and off more or less ad hoc as well. You can do this with SQL Storage Compress as well, but you need to be much better prepared up front: since the Red Gate solution works on a file basis, you would have to place objects on specific physical files within different filegroups in order to compress some objects but not others. This is another thing I plan to play with in future tests.
- What if my trial expires?
Here they explain how to recover your database if you have compressed it and the trial has expired:
http://www.red-gate.com/SQLStorageCompress/ExpiredTrial
- What if the service stops?
If the background service stops, your database will become inaccessible. Personally I think there should be a much bigger warning on the Stop button in the Configuration Manager. If you stop the service and re-start it, you will get this error message if you try to use the database:
|
Msg 945, Level 14, State 2, Line 1 Database 'RGCompression' cannot be opened due to inaccessible files or insufficient memory or disk space. See the SQL Server errorlog for details. |
Management Studio still shows the database in the tree, but it is offline (not explicitly, but for all intents and purposes):
 A compressed database in SSMS if the SQL Storage Compress service has stopped
And as the error message suggests, there are plenty of errors in the SQL Server error log, all of which - at least on first glance - point to "database gone, no paddle" (click to embiggen):
 Error log after SQL Storage Compress service stops (click to embiggen)
This aspect looks pretty brittle to me so far - I'll have a follow-up post that explains how to recover your databases should this happen to you.
- Does the HyperBac service need to be a dependency in a cluster?
I would expect so, and if it isn't a requirement, it is probably still a good idea, especially given the above. But I did not have a cluster available to run these initial tests. If you have any information about this, please let me know.
Conclusions
I've either realized or further cemented a few ideas in the process of conducting these tests. In no particular order:
- Whether or not you use native or 3rd party compression for your data files, you should always use backup compression if possible. You get tremendous space savings, at a very low cost of additional overhead - and in all the tests I've performed with various products, almost always faster. Again it is not really important whether you use the native backup compression solution or a 3rd party product, as long as you can restore the database on all the required instances.
- SQL Trace can be a real pain sometimes. I set up several traces so that I could include more pretty graphs, but they all ended up with 2 useless rows. I obviously boneheaded something, but was too tired to start over. I should probably have SQL Sentry installed in this VM. Again, something to work out for more elaborate tests in the future.
- SQL Storage Compress 5.0 can provide some advantages over native compression methods. So, it could be a viable alternative for those running Standard Edition and lower, and even for those with the ability to use native compression, depending on priorities. I'll mention again that you shouldn't take my graphs above as gospel; if you want to evaluate how the solution works for you, I strongly recommend you download the 14-day trial edition and conduct your own tests. For backup compression specifically, you should also consider the following products:
- The two issues with this product that stand out to me are:
- Why are the log files so big?
- How do we recover a database if the service has stopped?
EDIT: I address these questions to some degree in a follow-up post.
|
-
Maybe I'm a little late to the party, but I thought I would share with you the most recent release of Microsoft's Best Practices Analyzer for SQL Server 2008 R2 (you can also read the original blog announcement from CSS). Let me be clear: this tool is not just for people who are already on SQL Server 2008 R2. It will also happily analyze your SQL Server 2008 instances, so don't be fooled by the label.
That said, getting the thing installed was an exercise in patience. Like most people, I clicked the download link:
http://www.microsoft.com/downloads/details.aspx?FamilyID=0fd439d7-4bff-4df7-a52f-9a1be8725591
And I was off to the races. Or so I thought. I still haven't developed a healthy habit of READING before CLICKING. The first time I ran the installer, I got this complaint about not having the Baseline Configuration Analyzer installed (I'm typing out the error message so that users experiencing the same thing can find this post, and I'm showing a screen shot to show how asinine it is to put a URL in a dialog but not have the ability to click or copy it):
|
Microsoft Baseline Configuration Analyzer 2.0 is not installed. Please install it from http://www.microsoft.com/downloads/details.aspx?FamilyID=1b6e9026-f505-403e-84c3-a5dea704ec67
|

To help out a bit, here is the link to download the Configuration Analyzer (which, I will admit, was much easier to obtain from a Google search than by transcribing the link from the dialog) :
http://www.microsoft.com/downloads/details.aspx?FamilyID=1B6E9026-F505-403E-84C3-A5DEA704EC67
(If you're not on Windows 7 or Windows Server 2008 R2, you have the potential to receive error messages about missing PowerShell 2.0.)
I got a little bit further into the installation once I had the Configuration Analyzer installed. But then, I discovered in a very round-about way, that there is an issue with the installer when you try to put the BPA on a machine that is not connected to a domain (e.g. just about every virtual machine I own). When it tried to run the PowerShell commands during setup, I saw a PS window briefly, along with some scary-looking red text, and then it disappeared and setup returned this error message (again typed out for SEO goodness):
|
There is a problem with this Windows Installer package. A program run as part of the setup did not finish as expected. Contact your support personnel or package vendor.
|

Which is great to know that it wasn't BPA's fault, but something else's. But would it have been that hard to consume the error message that obviously came back from PowerShell? Or better yet, return an error message like this:
|
Google "BPA 2.0 error powershell" |
Because as it turns out, there is a very easy fix to this, and it was the first hit for the search above. If you want to install BPA onto a machine that is not domain-savvy, instead of double-clicking the MSI file, run this command line (replacing <path> with the folder where the MSI file is located, and changing <platform> either to 32 or 64):
|
msiexec /i <path>\SQL2008R2BPA_Setup<platform>.msi SKIPCA=1 |
(Thanks to Adam Saxton for pointing this out on the CSS SQL blog.)
Once I got the BPA running, it did quickly point out several issues I was aware of (including bad partition alignment on a non-essential drive, some databases that were intentionally offline, and the fact that the error logs have a lot of history), as well as a couple of new issues:
The one I've expanded is one that Bob Ward talked about in a blog post yesterday as one of the key reasons to use the new BPA:
Obviously it finds a bunch of things that you may already have alerting for, and probably some things that you don't. In any case, this can be a very quick way to assess a SQL Server instance without having to rummage around for all of this information manually, and even with the stumbling you may have getting it configured, it should pay off rather quickly. There are plenty of things it doesn't look for, and you can get that information from other places as well - for example, check out the BLITZ! takeover script from Brent Ozar (blog | twitter):
http://www.brentozar.com/sql/blitz-minute-sql-server-takeovers/
Anyway, I haven't really offered any input that you can't see elsewhere, but I hope it is useful nonetheless.
|
-
In a lot of my projects, the issue of naming/coding conventions comes up. I am a big fan of PascalCase/plural for tables and views (e.g. dbo.CustomerAddresses, dbo.OrderDetails), PascalCase for column names (e.g. FirstName, HomePhone); and for stored procedures, I am partial to the Entity_Action naming convention (e.g. dbo.Customer_Create, dbo.Customer_Update). I know some like to separate out their words with underscores, but with the exception of stored procedures, I find this tedious.
One of the stickier points in several cases has been how to case short suffixes, such as the ID in CustomerID. Let's go back to what the ID is; an identifier. I often liken it to an identification card, which usually contains either a visible or a bar-coded identifier of some sort (SSN, driver's license #, etc). While sometimes this card is called an I.D. and sometimes an ID, I have yet to see it called an Id or an id.

For short suffixes that are actually acronyms, I prefer all upper-case. Most developers I've come across seem to like to lower-case the remainder of the acronym. And in some cases they have to; e.g. in JavaScript you refer to element.innerHtml; if you try element.innerHTML you get an error message. I can still tell what Html and Xml are when they aren't in upper-case, but it can be cumbersome to read that way, particularly with the Id / ID suffix. Because often "CustomerId" can look like "Customerld" - the difference is only slightly more clear in a fixed width, code-ish font:
CustomerId vs. Customerld |
Among other things, using an upper-case ID ensures that everyone knows we aren't talking about the customer's "pleasure principle" (according to Freud). And isn't this easier to read?
The "I dee" part rings out much better for me when the acronym is upper case. And when I find queries written by others where they have used their convention, I always re-write to conform to my convention. And while it is as weak an argument as writing standards-compliant code for fear that you will ever port to a different RDBMS, I find it hard to justify leaving code as is when it will fail if we ever have to use a case sensitive collation.
Even when I will always name the database entities in this way, if you want to call it CustomerId in the application code, feel free. Just be consistent. And don't expect me to leave it that way if you hand control of the code over to me. What are your thoughts? Are you a proponent of CustomerId? In the application code only, or in both the app and the database? If so, why?
|
-
Ola's Index Maintenance scripts
On August 3rd, Ola Hallengren (blog) released a substantial update to his Backup, Integrity Check and Index Optimization scripts - allowing you to more finely tune the object selection process. Over this past weekend he also updated the error handling to get around some blocking issues. For information about Ola's free solution, go here: http://ola.hallengren.com/Versions.html
SQL Server 2005 SP3 Cumulative Update #11
Last night, Microsoft released a new cumulative update for SQL Server 2005 SP3. This will bring you to build 9.00.4309. You can read about the fixes and request the download in KB #2258854:
http://support.microsoft.com/kb/2258854
SQL Server 2008 R2 Cumulative Update #3 Last night Microsoft also released the third update for R2, which brings the build number up to 10.50.1734. Check out the fixes and obtain the bits in KB #2261464: http://support.microsoft.com/kb/2261464
Mladen Prajdić's SSMS Tools Pack
Mladen (blog | twitter) quietly announced today on twitter that next week we will see the 1.9 release of his popular SSMS Tools Pack, a free add-in for Management Studio 2005 and above. In Mladen's own words, the update will contain "mostly a lot of bug fixes. window coloring by regexes, snippet printing, rewritten regions and debug sections, that's about it :)" In the meantime, you can download version 1.8 here: http://www.ssmstoolspack.com/
|
-
-
It's pretty simple: buy the penny auction software that QuiBids* uses, get a hosting account somewhere, and go to town.
* or Swoopo, or BigDeal, or BidCactus... the list goes on and on and
on...
The money these sites are making is absolutely mind-boggling. Let's take a recent auction (click to embiggen):
So the guy saved $556 and change. Pretty good for "doubleagent," whoever that person may be (more on this in a bit). But let's dig a bit deeper. He bid 214 times. At 60 cents a bid, that is $128.40 on top of the $243.44 he had to pay outright. QuiBids' loss so far: $427.16. Doesn't seem like a sound business model, right? (While they are probably buying enough that they are getting volume discounts, it can't be more than 5 or 10%.) Let's keep digging. Remember that the auction starts at one penny, and each bid increments the price by 1 penny. How many bids does it take to get the price up to $243.44? 24,344. This means that 24,130 times (24,344 - 214), some poor soul spent $0.60 bidding on a product he or she would not ultimately win. This is where the math gets interesting, and the revenue model starts to make a bit more sense. Before taking these losing bids into account, QuiBids was down by about $427. But then add the revenue from the bids: 24,130 * $0.60 = $14,478. So now QuiBids is UP by about $14,051. Let me repeat that profit value, so it is crystal clear:
So they sold a $799 iPad at an amazing profit margin of over 1,750%. No, that is not a misplaced decimal, and it did not take place during a period of hyperinflation. And while most people paid only a small portion of this, it is appalling that there were likely other users wasting just as many bids as "doubleagent" -- but did not win. And this is just one product - they sell between 3 and 4 iPads every day, and hundreds - if not thousands - of other products. Let's do some speculative and conservative math on a typical daily income statement for QuiBids, with some assumptions:
- there are too many items to take a good catalog, but the retail prices of the auction items range from $10 gift cards to $2,000+ MacBook Pros. Let's say the average product retails for $250. I was going to say $300 (about half the cheapest iPad), but I'm adding a $50 fudge factor so that I don't have to bother with calculating transaction fees - which can vary widely depending on how people purchase their bids. I'm also allowing for a bulk volume discount.
- we can assume that the average product sells for 40% of the retail cost (which I think is modest) - this means the typical $250 product actually sells for $100 (and this means 10,000 bids took place).
- we'll stick to the single penny auctions - there are some that list at 2 cent seed and 2 cent increment, but they'll just make this math more cumbersome.
- let's assume a nice round number of 10,000 products sold per day - I'm sure it is more, but that will just make this more nauseating.
- we'll assume the dedicated hosting to handle the traffic is $3,000 a month. Probably over-shooting this a bit, but even at a ridiculous figure of $30,000 a month, I doubt it will figure in to the astronomical numbers.
So what do we have here, assuming a 30 day month, for an average day?
Revenue
Product Sales: 10,000 products per day @ $100 = $1,000,000 Bid Sales: 10,000 products per day @ (10,000 * $0.60) = $60,000,000
Costs
Cost of Goods Sold: 10,000 products a day @ $250 = cost: $2,500,000 Expenses: Monthly hosting fee = cost: $3,000 / 30 = $100
Those numbers don't look huge:
Revenue: $61,000,000 Cost: $2,500,100 ----------------------- Profit: $58,499,900 (2,340%)
But remember: that is one day. Let's blow that profit up over a year and just multiply everything by 365:
Revenue: $22,265,000,000 Cost: $912,536,500 -----------------------
Obviously I've been generous about either the number of products they are selling, or the number of bids. If they are really selling 10,000 products a day and this many people
are spending all this money on bids, then the economy is not in as bad a
shape as everyone is making us believe.
And just to add insult to injury, let's dig a little deeper still. The veil of privacy really prevents me from knowing who "doubleagent" is, and I respect that I probably shouldn't be able to figure it out. But if you watch one of these bidding wars going on in real time, you will notice a very peculiar thing: roughly 99% of the usernames are made up usernames, and the avatars are never real pictures. On most other sites like this that I've experienced, you get a good mix of "cool" and "real" usernames. It makes me wonder what percentage of the users in any bidding war are, in fact, real people making real decisions. I'm sure that setting up a large pool of shill bidders (and even fake winners) and having the software "compete" with real people is not that hard, and potentially drives the bids a lot higher than they would have gone naturally.
On eBay I've always wondered how many sellers set up a separate account merely for the purpose of bidding against their bidders, especially in auctions with a reserve. Of course in that case it can backfire in a few ways; if the shill bidder wins on QuiBids, then the iPad still belongs to them, and they can offer it up for auction again. So not only are they $14,051 to the good, they still have the product! All they've paid for are the hosting / bandwidth costs, and fractional transaction fees when they charge users' credit cards for more bids. Circling back to the privacy shield, this is certainly not easy to
prove; but, you have to admit, it is not all that hard to envision,
either.
I'm just as much for capitalism as the next guy - but this is the kind of thing that makes me wonder if socialism has it right after all. Or at least, these are the kind of people that give capitalism a bad name. I have no problem seeing how they convinced venture capital to front them some cash... especially given the figures above. You just have to figure out which (if not all) do not have any qualms helping you screw most of your customers.
All in all, I guess I'm not really adding anything of value to the discussion. I'm
just amazed at the potential revenue involved here, and even more so
that it is legal. And finally, maybe a bit jealous that I didn't think
of this first - while my moral code would have ultimately prevented me
from following through, it hurts to think that I could own my own island
like these guys must by now.
You can read a lot more about this controversy here (and I forgive whoever sent that link in a reply to a tweet of mine from last month - I looked through my @mentions history and could not find the original pointer):
http://notmytribe.com/2010/penny-auctions-quibids-swipebids-et-al-are-more-than-scams-theyre-fraud-818056.html
Not surprisingly, a few other people have talked about this also:
http://www.delhandley.com/ripoff_report/bid-rivals-swoopo-quibids-and-other-ripoff-auction-websites/
http://wiki.answers.com/Q/What_are_opinions_on_the_website_QuiBids
http://www.ripoffreport.com/internet-fraud/quibids-llc/quibids-llc-a-cunning-ripoff-4468c.htm
While obviously all of these articles are slanted against the penny auction model, there really aren't any articles out there in favor of the business (at least none that aren't obvious or clever plants).
|
-
|
When I started my technical career, the only way to learn about programming was from books and magazines. I remember writing my first HTML page and finding it quite similar to the Vic20 days, where you would transcribe code for hours; when you made a mistake, that meant it was time to start over.
The Internet was around, but it wasn't really all that useful just yet -
there certainly was no MSDN Library, never mind blogs like this site, or interactive communities like StackOverflow and Twitter. Oh
sure, Usenet has been around forever, and in fact my career change was due largely in part to a technical mailing list where folks shared solutions to problems in the realm of HTML and JavaScript. But they pale in comparison to the thriving communities and other online resources we have today.
|
|
With the Internet comes
responsibility, however; reading something online does not make it true. This is why I prefer communities and interactive blogs (where you get more insight than from just one "author"), over paper books or passing some certification exam. Those things have their merits of course; while I have never taken an MS cert exam, I still have a bookshelf -- and this isn't even the whole thing -- that represents a small forest, from which I glean tidbits all the time:
In fact, just last week, I had three of these books out on my desk with about 20 new dog-eared pages.
I also think that a lot can be said about being in an environment where there are other smart people to bounce ideas off of. This morning, for example, I learned a boatload of stuff (in comparison to what I
already knew) by sitting with a co-worker for 5 minutes and going over some C# code I was re-tooling.
But in my opinion, the interactive nature of the growing SQL Server community is where people in our corner of the technical world can learn the most. Having the perspective of so many great minds at your disposal can be an intoxicating thing... you just have to get over the nerves of asking questions of folks like, well, you know who you are. (I don't want this to turn into an ass-kiss-fest, but there is an unprecedented number of smart folks bending over backwards to help this community.) And it doesn't stop online ... there are also free events like SQL Saturdays, paid resources such as the PASS Summit, and training courses that, while not free, can be worth every penny (I wrote briefly about Paul and Kimberly's course earlier this year). I don't think you can say any of this about the PHP, Oracle or Flash
communities - at least not with a straight face.
In the end, this installation of T-SQL Tuesday has me thinking of all the ways we learn, and how impressive it has been that the way we learn has evolved so much, even in my own relatively short career. And with the caliber of folks donating their time, energy and brainpower into making the community better, it can only be -- as Radioactive Man would say -- Up and Atom!
|
|
-
-
Last August, I wrote a lengthy post about how I handle splitting up a list of integers that are passed into a procedure as a comma-separated list.
I am still using this method today. But since almost a year has passed, and because there have been plenty of articles written about splitting in the meantime, I've decided to try out a couple of other methods -- to make sure I was still doing "the right thing." First I wanted to test out Adam Machanic's CLR approach. I also took a great interest to the XML splitting methods publicized via a T-SQL tennis match between MVPs Brad Schulz and Adam Haines (part I and part II). And I wanted to test Peter Larsson's derived table approach. I also thought it only fair to compare a few of the methods I've used in the past before coming across these faster and/or more scalable solutions, if for nothing else, to illustrate how far we've come.
I'm not going to revisit why you may want to process a list of integers instead of handling each integer individually. I'm going to assume that, like me, you have cases where you simply don't know how many values might be passed in a parameter list... causing you to ditch the approach by Celko where he has n individual parameters so he can enjoy type safety at the cost of code that is long and horribly hard to maintain (see one of his illogical rants on this).
The Functions
First, I will outline the 9 methods I'm going to compare, in no particular order:
RBAR 1
This is the solution I see most often out in the wild. What does RBAR mean? "Row by agonizing row." I'm not sure where I first picked up that term, but because it sounds so much like "FUBAR," I really like it. While most solutions do use loops of some sort, this is the worst kind... take a chunk off the list, do some stuff with it, insert it into the table, assign a new value to the list, then move on to the next part of the list. Here is the function:
CREATE FUNCTION dbo.SplitInts_RBAR_1 ( @List VARCHAR(MAX), @Delimiter CHAR(1) ) RETURNS @Items TABLE ( Item INT ) AS BEGIN DECLARE @Item VARCHAR(12), @Pos INT;
WHILE LEN(@List)>0 BEGIN SET @Pos = CHARINDEX(@Delimiter, @List);
IF @Pos = 0 SET @Pos = LEN(@List)+1;
SET @Item = LEFT(@List, @Pos-1);
INSERT @Items SELECT CONVERT(INT, LTRIM(RTRIM(@Item)));
SET @List = SUBSTRING(@List, @Pos + LEN(@Delimiter), LEN(@List)); IF LEN(@List) = 0 BREAK; END RETURN; END
|
RBAR 2
This is a very slight variation on RBAR1 (can you spot the differences?), but results in about twice as many reads when just outputting the data (the reads are roughly the same when using the data for more practical purposes, like joining with existing tables). For those wishing to stick with RBAR for whatever reason (remember what it sounds like), you may want to compare RBAR 1 and RBAR 2 and see if you can ensure that your functions are not suffering from performance problems you might not know about.
CREATE FUNCTION dbo.SplitInts_RBAR_2 ( @List VARCHAR(MAX), @Delimiter CHAR(1) ) RETURNS @Items TABLE ( Item INT ) AS BEGIN DECLARE @Item VARCHAR(12), @Pos INT; SELECT @Pos = 1; WHILE @Pos > 0 BEGIN SELECT @Pos = CHARINDEX(@Delimiter, @List);
IF @Pos > 0 SELECT @Item = LEFT(@List, @Pos - 1); ELSE SELECT @Item = @List; INSERT @Items SELECT CONVERT(INT, LTRIM(RTRIM(@Item))); SELECT @List = RIGHT(@List, LEN(@List) - @Pos);
IF LEN(@List) = 0 BREAK; END RETURN; END
|
CTE 1
This function uses a recursive CTE, with the first element and the remainder functioning as the anchor query, whittling down each element in turn. It inserts into the @table variable until you reach the last element.
CREATE FUNCTION dbo.SplitInts_CTE_1 ( @List VARCHAR(MAX), @Delimiter CHAR(1) ) RETURNS @Items TABLE ( Item INT ) AS BEGIN WITH ints(item, remainder) AS ( SELECT item = SUBSTRING(@List, 1, CHARINDEX(@Delimiter, @List)-1), remainder = LTRIM(RTRIM(SUBSTRING(@List, CHARINDEX(@Delimiter, @List) + 1, LEN(@List)))) UNION ALL SELECT item = SUBSTRING(remainder, 1, CHARINDEX(@Delimiter, remainder)-1), remainder = LTRIM(RTRIM(SUBSTRING(remainder, CHARINDEX(@Delimiter, remainder) + 1, LEN(remainder)))) FROM ints WHERE CHARINDEX(@Delimiter, remainder) > 0 ) INSERT @Items SELECT item FROM ints OPTION (MAXRECURSION 0); RETURN; END
|
CTE 2
This function is similar to the above recursive query, however it uses the positions in the string and fewer LEN() calls to extract each value.
CREATE FUNCTION dbo.SplitInts_CTE_2 ( @List VARCHAR(MAX), @Delimiter CHAR(1) ) RETURNS @Items TABLE (Item INT) AS BEGIN DECLARE @Len INT = LEN(@List) + 1;
WITH a AS ( SELECT [start] = 1, [end] = COALESCE(NULLIF(CHARINDEX(@Delimiter, @List, 1), 0), @Len), [value] = LTRIM(RTRIM(SUBSTRING(@List, 1, COALESCE(NULLIF(CHARINDEX(@Delimiter, @List, 1), 0), @Len)-1))) UNION ALL SELECT [start] = CONVERT(INT, [end]) + 1, [end] = COALESCE(NULLIF(CHARINDEX(@Delimiter, @List, [end] + 1), 0), @Len), [value] = LTRIM(RTRIM(SUBSTRING(@List, [end] + 1, COALESCE(NULLIF(CHARINDEX(@Delimiter, @List, [end] + 1), 0), @Len)-[end]-1))) FROM a WHERE [end] < @len ) INSERT @Items SELECT [value] FROM a WHERE LEN([value]) > 0 OPTION (MAXRECURSION 0);
RETURN; END
|
Numbers
This is the solution I am currently using in my production instances of SQL Server 2005 and SQL Server 2008. It uses a numbers table and, up until this week, was the fastest method I had tested to date.
CREATE FUNCTION dbo.SplitInts_Numbers ( @List VARCHAR(MAX), @Delimiter CHAR(1) ) RETURNS TABLE AS RETURN ( SELECT Item = CONVERT(INT, LTRIM(RTRIM( SUBSTRING(@List, Number, CHARINDEX(@Delimiter, @List + @Delimiter, Number) - Number)))) FROM dbo.Numbers WITH (NOLOCK) WHERE Number <= CONVERT(INT, LEN(@List)) AND SUBSTRING(@Delimiter + @List, Number, 1) = @Delimiter );
|
Inline 1
This is a similar approach to the numbers table solution (first spotted as written by Peso but with some minor adjustments), without requiring access to a numbers table. This does populate a local table variable in the body of the function, in order to make use of a less complex CTE to generate a sufficient set of integers.
CREATE FUNCTION dbo.SplitInts_Inline_1 ( @List VARCHAR(MAX), @Delimiter CHAR(1) ) RETURNS @Items TABLE ( Item INT ) AS BEGIN WITH v0 AS ( SELECT n = 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9 UNION ALL SELECT 10 UNION ALL SELECT 11 UNION ALL SELECT 12 UNION ALL SELECT 13 UNION ALL SELECT 14 UNION ALL SELECT 15 ), v1 AS ( SELECT 16 * v0.n AS n FROM v0 ), v2 AS ( SELECT 256 * v0.n AS n FROM v0 ), v3 AS ( SELECT 4096 * v0.n AS n FROM v0 ), v4 AS ( SELECT 65536 * v0.n AS n FROM v0 WHERE n < 2 ) INSERT @Items SELECT Item = CONVERT(INT, (SUBSTRING(@Delimiter + @List + @Delimiter, w.n + 1, CHARINDEX(@Delimiter, @Delimiter + @List + @Delimiter, w.n + 1) - w.n - 1))) FROM ( SELECT n = v0.n + v1.n + v2.n + v3.n + v4.n FROM v0, v1, v2, v3, v4 ) AS w WHERE w.n = CHARINDEX(@Delimiter, @Delimiter + @List + @Delimiter, w.n) AND w.n < LEN(@Delimiter + @List); RETURN; END
|
Inline 2
Similar to the one above, but this time without the local table variable. You'll see that a lot more numbers had to be hard-coded into the function definition, but this will potentially pay off in terms of reads (stay tuned).
CREATE FUNCTION dbo.SplitInts_Inline_2 ( @List VARCHAR(MAX), @Delimiter CHAR(1) ) RETURNS TABLE AS RETURN ( SELECT Item = CONVERT(INT, (SUBSTRING( @Delimiter + @List + @Delimiter, w.n + 1, CHARINDEX(@Delimiter, @Delimiter + @List + @Delimiter, w.n + 1) - w.n - 1 ))) FROM (
SELECT n = v0.n + v1.n + v2.n + v3.n FROM ( SELECT n = 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9 UNION ALL SELECT 10 UNION ALL SELECT 11 UNION ALL SELECT 12 UNION ALL SELECT 13 UNION ALL SELECT 14 UNION ALL SELECT 15 ) AS v0, ( SELECT n = 0 UNION ALL SELECT 16 UNION ALL SELECT 32 UNION ALL SELECT 48 UNION ALL SELECT 64 UNION ALL SELECT 80 UNION ALL SELECT 96 UNION ALL SELECT 112 UNION ALL SELECT 128 UNION ALL SELECT 144 UNION ALL SELECT 160 UNION ALL SELECT 176 UNION ALL SELECT 192 UNION ALL SELECT 208 UNION ALL SELECT 224 UNION ALL SELECT 240 ) AS v1, ( SELECT n = 0 UNION ALL SELECT 256 UNION ALL SELECT 512 UNION ALL SELECT 768 UNION ALL SELECT 1024 UNION ALL SELECT 1280 UNION ALL SELECT 1536 UNION ALL SELECT 1792 UNION ALL SELECT 2048 UNION ALL SELECT 2304 UNION ALL SELECT 2560 UNION ALL SELECT 2816 UNION ALL SELECT 3072 UNION ALL SELECT 3328 UNION ALL SELECT 3584 UNION ALL SELECT 3840 ) AS v2, ( SELECT n = 0 UNION ALL SELECT 4096 UNION ALL SELECT 8192 UNION ALL SELECT 12288 UNION ALL SELECT 16384 UNION ALL SELECT 20480 UNION ALL SELECT 24576 UNION ALL SELECT 28672 UNION ALL SELECT 32768 UNION ALL SELECT 36864 UNION ALL SELECT 40960 UNION ALL SELECT 45056 UNION ALL SELECT 49152 UNION ALL SELECT 53248 UNION ALL SELECT 57344 UNION ALL SELECT 61440 UNION ALL SELECT 65536 UNION ALL SELECT 69632 UNION ALL SELECT 73728 UNION ALL SELECT 77824 UNION ALL SELECT 81920 UNION ALL SELECT 86016 UNION ALL SELECT 90112 UNION ALL SELECT 94208 UNION ALL SELECT 98304 UNION ALL SELECT 102400 UNION ALL SELECT 106496 UNION ALL SELECT 110592 UNION ALL SELECT 114688 UNION ALL SELECT 118784 UNION ALL SELECT 122880 UNION ALL SELECT 126976 UNION ALL SELECT 131072 UNION ALL SELECT 135168 UNION ALL SELECT 139264 UNION ALL SELECT 143360 UNION ALL SELECT 147456 ) v3 ) w WHERE w.n = CHARINDEX(@Delimiter, @Delimiter + @List + @Delimiter, w.n) AND w.n < LEN(@Delimiter + @List) );
|
CLR
This was an adaptation of Adam Machanic's CLR approach (as
mentioned above), with very minor modifications to return a table of integers instead of strings. I won't re-list his CLR function but will gladly share my modifications. You may want to play with Erland's CLR version as well, which is a lot more straightforward and will potentially scale equally well, at least to moderately-sized strings. Here is the wrapper to the function (I created a StringHelper.dll with several user-defined functions, including SplitInts()):
CREATE FUNCTION dbo.SplitInts_CLR ( @List NVARCHAR(MAX), @Delimiter CHAR(1) ) RETURNS TABLE ( Item INT ) EXTERNAL NAME [StringHelper].[UserDefinedFunctions].[SplitInts]; GO
|
XML
Again, this solution was thrown into the mix due to some heavy touting and promise indicated by Brad Schulz. I've always been wary of XML solutions for splitting / concatenating because of the overhead of constructing and deconstructing the XML around the data. Soon we'll see if my concerns are justified.
CREATE FUNCTION dbo.SplitInts_XML ( @List VARCHAR(MAX), @Delimiter CHAR(1) ) RETURNS TABLE AS RETURN ( SELECT Item = CONVERT(INT, Item) FROM ( SELECT Item = x.i.value('(./text())[1]', 'INT') FROM ( SELECT [XML] = CONVERT(XML, '<i>' + REPLACE(@List, @Delimiter, '</i><i>') + '</i>').query('.') ) AS a CROSS APPLY [XML].nodes('i') AS x(i) ) AS y WHERE Item IS NOT NULL );
|
I'm leaving out error handling for brevity, and I'm also leaving out
any validation that the list consists of only integers -- I'm going to
assume that you can perform that kind of checking much more efficiently
in your application code, and can add it to the T-SQL side if you see
fit. I'm also not worrying at this stage about removing duplicates or about preserving the order that the values were originally defined in the list.
Finally, I wanted to ensure that each solution could handle 50,000
entries in the list; since your needs may be a lot more modest than this, you may want to pare down your entries to improve
performance (for example, you don't need as many entries in "Inline 2", and your Numbers table can contain fewer rows - for some solutions, you'll need a numbers table covering 14 * <max number of elements> since an INT can be up to 12 digits and the comma and perhaps a space). In this situation, since I knew all of my values would be <= 2 characters and I didn't have any embedded spaces, I didn't really need a Numbers table quite so large, and I didn't really need to define so many permutations in the "Inline 2" function.
The Testing Process
First, since a few of the solutions require a numbers or tally table, I prepared the following table:
SET NOCOUNT ON; DECLARE @UpperLimit INT; SET @UpperLimit = 256000;
WITH n AS ( SELECT x = 1 UNION ALL SELECT x = x + 1 FROM n WHERE x < @UpperLimit ) SELECT [Number] = x INTO dbo.Numbers FROM n WHERE x <= 256000 OPTION (MAXRECURSION 0); GO CREATE UNIQUE CLUSTERED INDEX n ON dbo.Numbers([Number]);
|
Then I started a trace with a filter of TextData LIKE '%dbo.SplitInts%' and with the following columns: TextData, Reads, Writes, CPU, Duration. Once the trace was running, I ran the following code block three times, giving me a good sample of how each solution performs:
DECLARE @List VARCHAR(MAX) = '1,2,3,4,5,6,7,8,9,10,', @Delimiter CHAR(1) = ',';
DECLARE c CURSOR FOR SELECT List FROM ( SELECT r = 1, List = REPLICATE(@List, 10) UNION ALL SELECT r = 2, List = REPLICATE(@List, 50) UNION ALL SELECT r = 3, List = REPLICATE(@List, 100) UNION ALL SELECT r = 4, List = REPLICATE(@List, 500) UNION ALL SELECT r = 5, List = REPLICATE(@List, 1000) UNION ALL SELECT r = 6, List = REPLICATE(@List, 2500) UNION ALL SELECT r = 7, List = REPLICATE(@List, 5000) ) AS x ORDER BY r;
OPEN c;
FETCH NEXT FROM c INTO @List;
WHILE @@FETCH_STATUS = 0 BEGIN SELECT Item FROM dbo.SplitInts_CLR (@List, @Delimiter); SELECT Item FROM dbo.SplitInts_RBAR_1 (@List, @Delimiter); SELECT Item FROM dbo.SplitInts_RBAR_2 (@List, @Delimiter); SELECT Item FROM dbo.SplitInts_Numbers (@List, @Delimiter); SELECT Item FROM dbo.SplitInts_XML (@List, @Delimiter); SELECT Item FROM dbo.SplitInts_CTE_1 (@List, @Delimiter); SELECT Item FROM dbo.SplitInts_CTE_2 (@List, @Delimiter); SELECT Item FROM dbo.SplitInts_Numbers_1(@List, @Delimiter); SELECT Item FROM dbo.SplitInts_Numbers_2(@List, @Delimiter); FETCH NEXT FROM c INTO @List; END
DEALLOCATE c;
|
I ran the test 10 times, then pulled averages from the trace tables for each function and element count.
The Results
In terms of duration, CLR won every time. Heck, CLR won every time in every category. (This might not come as a big surprise to you. After all, just yesterday, I let the cat out of the bag by declaring that I'd
finally given in and am letting the CLR into my life.)
But if you are one of many who are not able to implement CLR functions in your production environments, you may want to pay attention to some of the other methods. In the chart below I've highlighted the winners and losers as follows: longest duration at each interval (red), shortest duration at each interval (light green), and the shortest duration other than CLR (even lighter green). Note that "CTE 1" for 50,000 elements, at over 26,000ms total duration, is way off the chart.

 Duration, in milliseconds, for splitting delimited strings of varying numbers of elements
We can see that duration for the CLR method and a few others creep up very slowly, while several of the remaining methods increase at a much more rapid pace. There is also an inexplicable bump for the XML method at 500 and 1,000 elements, where duration is longer than found with 25,000 elements. I thought this may be an anomaly or due to other pressure on the box, but this was repeatable time and time again. We can see that the "CTE 1" method is the absolute worst performer at 5,000 rows and above, and that aside from the CLR method, the numbers table was the winner up to 10,000 elements; beyond that, the "Inline 2" method took over -- but granted, not by much.
As mentioned before, I examined reads, writes, CPU and duration to compare these different splitting techniques. CPU correlated almost identically to duration, so I'm not going to bother repeating a similar chart as above. Writes did not seem to factor in; the only method that performed any writes to speak of was the "CTE 1" function, and based on duration alone we are discarding that as an alternative in any case. Here are the reads (again I've highlighted the winners in green and the losers in red):
 Logical reads, in 8K pages, for splitting delimited strings of varying numbers of elements
I didn't graph the reads because, due to the range of values involved, I'd have to do it logarithmically to even present it on your screen, and that would take away some of the "punch" you get with these numbers. :-)
What I've Learned
If you are going to be mainly dealing with < 100 items in your lists, then the difference between the various methods was so minuscule, it was not even worth mentioning the actual figures. Even at 1000 items, the only real loser was the XML method (even though the reads on the CTE methods might make you think otherwise). Once you get over 10,000 items in the list, you start to see a greater separation between the men and the boys; there is clearly a set of options that yield unacceptable performance, and a set of options that aren't all that different. To a large degree, the duration figures correlate almost directly to the reads, but even though several methods provide the same 0-level of reads, the CLR always wins in duration. You may nit-pick and find little things that will make various methods a little bit faster, but I don't think there is anything ground-breaking that will change any function's performance by an order of magnitude.
The major lessons to learn here are as follows:
- You should test all available
solutions both with varying numbers of parameters. In this case if I had tested only
against a smaller list, I might have ended up with a solution that would
have been a real performance problem should I need to deal with thousands of elements -- and without proper testing scenarios in place, this would most likely happen at an
inconvenient time.
- While this stands true for any type of performance test, I couldn't leave it out: If you are running tests on a local machine or a test server, you should test on the destination hardware as well, since there are many factors that could change performance in that environment. Without belaboring the point too much, I had slightly different results on a local machine with less RAM and fewer processors that might have led me in a different direction if I had been less thorough.
- The usage of LEN() or DATALENGTH(), combined with using a @table variable inside the function to hold intermediate results, can increase the reads and overall duration in a major way; it is clear that certain functions above are not optimized for this type of implementation.
Conclusion
Based on these tests, I will be proceeding with introducing CLR functions to my primary systems to help with certain string functionality. For now, I will be limiting this to splitting, concatenation, parsing, and RegEx matching. For splitting specifically, as time goes on, we are introducing more and more bits of functionality that require accepting application data in a list (permissions, network associations, etc.), and I am looking only slightly ahead to a day where we will need to process thousands of items in some of these lists. While the overall duration and read savings compared to a numbers table was quite modest, centralizing this functionality for further extensibility and without the limitations of T-SQL UDFs seems to be a worthwhile investment to me.
Soon I hope to perform some tests revolving around file system access, since we have several C# command-line apps, as well as inline T-SQL using xp_cmdshell, all for the purpose of processing files. These files are almost always processed in the same type of way; the biggest difference is either the location or the purpose of the file.
|
-
I've always been very reluctant to use the CLR. On a completely unrelated tangent recently, I begrudgingly threw CLR into the ring of a string splitting performance test (blog post forthcoming). Actually, the reason I was performing the test at all was because of a couple of blog posts by Brad Schulz, where he shows some great enhancements to typical XML splitting mechanisms. (Articles are here and here.)
The results were astounding to me. Again, I will provide far more details later, but I was convinced that the mere overhead of the CLR would make it slower than other solutions, particularly on smaller strings. (Think about how hard it is to shave 5 minutes off of a 20 minute drive.) I was surprised that this was not the case at all, and as a result, I will be testing various ways to implement the CLR for various string functionality, such as splitting strings, and of course validating e-mail addresses and other RegEx necessities. I will also be interested to test it for file system operations, such as browsing a directory list, and queuing files up for BULK INSERT / archive / destruction.
So, CLR, you haven't yet found a place in my heart, but you have found a place in my databases. And it is due in large part to Adam Machanic, whose post last April on splitting strings has become one of my new favorites.
I hope this will be enough motivation for all of you other holdouts out there to give the CLR a shot!
|
-
-
I find myself categorizing a subset of SSMS features ranging from "must have" features to "cringe when they are mentioned." I'll try to summarize the standouts for you here, then ask you these questions: Which parts of SSMS do you rely on? Which do you wish were not there at all?
The "I almost always use it to get the task done." variety
- SQL Server Agent's Job / Job Step Editor. While not pretty, the wizard does make quick work of creating jobs with several steps, multiple schedules, and various execution paths. (Let's just pretend the 2008 R2 version isn't a no-op due to a debilitating bug.)
- Profiler. Similar to the job procedures, I am unlikely to ever get the hang of scripting a server-side trace from memory. So, I use the profiler UI to set up the trace details, then save the script for future use. If I have to run a slightly different server-side trace, I can use the same script with minimal changes; but in many cases it is just as quick to use the UI again and start from scratch. (Technically, I guess this isn't part of SSMS, but it is certainly a management UI that has had its share of criticism over the years.)
- Right-click object > Modify, or Script as ALTER to New Window. This is by far the quickest way for me to get at the *current* version that is deployed, and when I am done testing my revisions against the dev server, I can then move the changes to source control. I see a lot of people using sp_helptext or manually selecting from syscomments / sys.sql_modules / OBJECT_DEFINITION(), but this can involve several more steps to get the code into suitable (and more importantly, maintainable) shape.
- Registered servers. These are a god-send, especially since I've learned to export carefully constructed nodes and share them across all machines. I wish the tool in general was more consistent at remembering passwords depending on how I've initiated or switch a connection, but that's another story I suppose.
The "I will use it if I have to, but usually feel icky and prefer better alternatives." variety
- Table Designer - occasionally I will use this tool (right-click a table, Design), but only at the beginning of projects, where I want to slightly adjust the column order in a new table. Even in that case, more often than not, I'll drop and re-create the table. For tables already in use, I always use DDL commands like ALTER TABLE, knowing that any new columns end up at the end of the table. And that's okay with me.
- SQL Server Agent > Job History. This dialog has some real quirkiness; in our environment, it is always confused because it is being run on a desktop in one time zone against servers in another. I have been complaining about this since before SQL Server 2005 was released (though first officially documented in 2006; see Connect #124841). I also find the interface cumbersome to use and drill down - usually I get more complete and consumable information, and more immediately, from SQL Sentry's Event Manager - which also has the ability to store more history in the repository without bogging down production instances of msdb.
- Select Top N Rows. This is a quick way to grab an arbitrary 1000 rows (or whatever number you customize) from a table. Sadly, it just blindly executes the query without giving you the opportunity to modify the column list, where clause or isolation level (personally, I prefer Script > As SELECT To > New Query Editor Window). Oh, and it won't necessarily put you in the context of the right database; instead, it uses a three-part name for the table, which can be frustrating if you try to add a simple join or subquery and forget which database you're really in.
The "I will not touch it with a 10-foot pole"! variety:- Activity Monitor. While this replacement of the useless view of the same name in previous iterations of the tool shows a lot of potential, and its
motivation is in the right spot, it is in its infancy and has a lot of quirky behaviors. If I want a good
look at what is going on in my server, I will use SQL
Sentry's Performance Advisor - which does a great job of warning me about issues even when I'm not paying attention. For a really quick pulse of current
activity or in environments where I haven't convinced them to use
Performance Advisor, then I turn to Adam Machanic's sp_whoIsActive. For a really quick look, yes, Activity Monitor can be useful... but for very little incremental effort you can get to a lot more information without memorizing all of the DMVs.
- This will be ridiculously shocking, but any of the features that allow you to shrink a database or a file. I acknowledge that there are some cases where shrinking a file is necessary, and I don't want to get into a religious battle about it. I just wish that the task was much more difficult and thought-provoking to complete. Rather than one button or checkbox, imagine if the process to shrink a file were as convoluted as setting up SQL Server? There is one case where I'd applaud a longer and more cumbersome set of steps to accomplish a simple task.
- Most other designers / wizards. Including, but not limited to:
- View Designer
- Query Designer
- Edit Top N Rows (formerly Open Table)
- New/Edit Database
- New/Edit Login
Go ahead, search Connect. There are dozens of bugs against these dialogs, and few if any will ever be fixed. I can quote myself on twitter from earlier tonight: "Never send a UI to do a DDL's job." As I've explained above, this isn't always true; but in most cases the DDL is going to be more predictable and also infinitely more repeatable. The "Wait, why isn't that feature there?" variety
I'm cheating here a little bit, but there are some noticeable omissions from Management Studio that I can't even complain about because they never bothered to create them. UIs to support features that are cumbersome to set up using code alone, such as Service Broker and Extended Events (they are gradually getting to others such as Mirroring and of course the new DACPAC stuff). Or things that are long overdue but are otherwise covered by great 3rd party add-ins, like Mladen Prajdic's SSMS Tools Pack.
The "Oh yeah, I didn't even think of that one!" variety
It is a pretty vast application and there are probably several features I didn't even think to mention. So again, I'll ask: which parts of SSMS can you not live without? Which ones would you throw off a pier if you were in control?
|
-
Since I am still afraid of SSIS, and because I am dealing mostly with CSV files and table structures that are relatively simple and require only one of the three letters in the acronym "ETL," I find myself using BULK INSERT a lot. I have been meaning to switch to using CLR, since I am doing a lot of file system querying using xp_cmdshell, but I haven't had the chance to really explore it yet. I know, a lot of you are probably thinking, wow, look at all those bad habits. But for every person thinking that way, I am sure there is someone thinking, yes, I am still using BULK INSERT as well. So, let me tell you about two little problems I came across recently, and how I worked around them.
The Phantom Carriage Return
In one system, we get log files from a bunch of application servers running Linux. The other day we built a new log file output mechanism using the same code (we are removing, as much as we can, the applications' dependency on the database). However, the same code outputting the same type of log file (just containing different data) was generating the following errors when I tried to BULK INSERT any file with more than one row:
Msg 4863, Level 16, State 1, Line 1 Bulk load data conversion error (truncation) for row 1, column <last column number> (<last column name >). |
I asked the developer if anything had changed with the constant he was using for carriage returns, and he said no, "\n" was still in place. No matter how I changed the code, I was still getting the error:
... (WITH ROWTERMINATOR = '\r\n', ... ); ... (WITH ROWTERMINATOR = '\n\r', ... ); ... (WITH ROWTERMINATOR = '\n\n', ... ); ... (WITH ROWTERMINATOR = '\r', ... ); |
Nothing would work. I had half a mind to go and look at the Perl code on the Linux box, but before I got to that point, and on a total lark, I tried this version of the code instead:
DECLARE @b NVARCHAR(2048); SET @b = 'BULK INSERT ... WITH (ROWTERMINATOR = ''' + CHAR(10) + ''', ... );' |
Lo and behold; if I used the CHAR() code from SQL embedded in the string, it worked! For some of you, the addition of dynamic SQL will make you wag your finger at me even more. But for anyone who is polling for new files to load, you are probably already using dynamic SQL to construct these statements anyway. I haven't really dug into why this worked but "\n" didn't, I just knew it was working and could move onto the next most important task.
Using BULK INSERT Against a Linked Server
When you try to use BULK INSERT against a four-part name, e.g.:
BULK INSERT [OtherServer].[MyDatabase].[dbo].[foo] ... ;
|
You get the following error:
Msg 208, Level 16, State 82, Line 1 Invalid object name 'foo'. |
I complained about the misleading error message in Connect #525287: BULK INSERT to linked server gives misleading error message. In that Connect item, I mentioned that I had an easy workaround, and had a follow-up from "Pbaggett" asking me about my workaround (though I just noticed the comment today), so I thought it was a perfect opportunity to make this post a little meatier.
In the system I am working in, there is a central table that describes various tables that exist on various systems in the environment. Most of the tables that are a target for BULK INSERT are local, but a few are remote. So, this central table contains information about the server if it exists; otherwise it is NULL. As described above, all BULK INSERT statements are constructed dynamically anyway, since the target tables aren't always inserted to, and the file names and number of files are volatile. Anyway, the trick here is to execute a SQL statement *on* the remote server when necessary (using a dynamically built, and then nested, sp_executesql call). I'll illustrate what I mean here with some sample code (for brevity, I've left placeholders for error handling, loops, etc.):
DECLARE @LinkedServer VARCHAR(32), @Database NVARCHAR(128), @Table NVARCHAR(255), @FilePath VARCHAR(2048), @sql NVARCHAR(MAX);
-- this would actually be in two loops: one to get all of the target tables, -- and the other to get all of the relevant files for each target table.
SELECT @LinkedServer = '[OtherServer]', @Database = '[MyDB]', @Table = '[dbo].[foo]', @FilePath = '\\FileServer\file.csv';
SET @sql = N'BULK INSERT ' + @Database + '.' + @Table + ' FROM ''' + @FilePath + ''' ' + 'WITH (FIELDTERMINATOR = ''|'', ROWTERMINATOR = ''\n'');';
SET @sql = N'sp_executesql N''' + REPLACE(@sql, '''', '''''') + ''';';
BEGIN TRY -- sp_testlinkedserver goes here
SET @sql = N'EXEC ' + COALESCE(@Server + '...', '') + @sql;
PRINT @sql;
-- I've commented out the EXEC here; PRINT should -- give you the gist of what's going on.
--EXEC master..sp_executesql @sql; END TRY BEGIN CATCH -- error handling END CATCH |
Of course there are some complications here... doubling up the apostrophes and trying to debug the final SQL statement can be a pain; you also need to be sure that all of the remote servers are able to see the same network shares using the context that the linked server uses. But so far this workaround has allowed me to keep my log polling / loading process central, even while the distribution of data has scaled out to other servers.
|
-
Well, this is the first time in a long time that I've blogged about cumulative updates for two different versions of SQL Server on the same day. Yesterday Microsoft released a cumulative update for SQL Server 2008 SP1 (bringing you to 10.0.2775), and a corresponding cumulative update for SQL Server 2008 R2 RTM (bringing you from 10.50.1600 to 10.50.1702). You can read more about these updates here: Cumulative Update #1 for SQL Server 2008 R2
RTM (KB #981355)
Cumulative Update #8 for SQL Server 2008
Service Pack 1 (KB #981702)
Note that the cumulative update for SQL Server 2008 R2 is merely a rollup of the updates found in cumulative updates 5, 6 and 7 for 2008 SP1 - it does not include several of the fixes from CU #8 for 2008 SP1. I would expect to see a continued pattern of bugs relevant to both versions being fixed in one branch, and not rolled into the other until the subsequent update (and where it appears first will depend on where it was discovered).
|
|
|
|
|
|