During all phases of development, particularly when building a completely new system, it is very common to have a need for data to test with. As important as this is for a UI system, the need is doubled is when you are working on a reporting system. While Everyone knows that the process of getting test data can be a fickle thing during the development process. All programmers, relational and functional alike, when we test our code, we tend to create some meaningless data, usually only a row or two, to validate whatever table constraints, triggers, procedures, form, screen, batch process, etc that we are building, but that data, generally speaking of course, likely bears little resemblance to "real" data. But it is fine for the development process.
To write queries, everyone (including myself, if I can be honest with you) want some reasonable data in a table to write queries against. Consider that the overarching mantra of the forum/newsgroup question answerer is: "provide me a table and some test data..." so we can test/write queries against it. Without some data, answering a question for a person can be like target shooting with little idea of where the target is actually located.
The important question is, what does "some" mean, as in "I need some data". I have worked on this mentally for years, as I struggled to put a label on the data correctness factor, because it it such a hard process balancing the needs of the programmers building the database and code that uses it with the programmers building reporting. The process is a bit like having builders building the foundation of a house while another group of folks setting up for a full dinner party on the floor above. Never going to happen in most any industry other than ours.
Lately, on a large reporting project I have been working on, this frustration has reached dizzying heights. We have data conversions from an existing system going on all of the time, working to get the final data right for the final cutover. We have gone over and over throughout the process of the entire project with each "new" set of data it was questioned how "right" the data was. Each of these sets of data was good enough to use in some capacity, but rarely if you need fine, verifiable detail going back to the original system, generally to match up to some existing report. The whole process can be really frustrating, though somewhat necessary.
So in order to label this data, I will attempt to coin a few names for some of the ways our data ends up in the system during the process, if for no other reason than I can begin to label data in our systems in this manner for developers who are stuck using data in various forms, even if it does drive them a bit nutty. The terms I am tossing out to define test data quality are: Scribble, Reasonable, Representative and Live/Real. I will define these as:
Scribble - Just data that is entered with no concern for other users. probably very fake looking, including values like 'ahdflka', or 'aafdfiduah' or even 'alsjdflaj;' (that's kind of fun...) No real value other than being there for you to mess around with to try updates, deletes, etc. Reporting value is almost nil, even for the person who thinks very abstractly.
Reasonable - Valid data based on business rules, but likely may not resemble reality. The purpose of this type of data is often to test one area of functionality, like to see some data on the screen. It might just be a set of rows that have been hand manufactured to look like possible data, even if it is not a combination of values that has ever, or will ever exist. This kind of data is built relatively easy, either using the user interface, or just by creating data using random number generation (like for keys based on numeric surrogates). You must think abstractly when using this data, since nothing is likely to be correct. Random numbers are certainly no great shakes when it comes to correctness, and a person typing data in for testing is very often going to get very lazy and start entering gibberish (like 'aslkfal', or 'askjakjcb' (that never loses it's fun factor))
Representative Subset - A set of data that is incomplete, but resembles, or is a subset of, reality. This is the type of data that we all want as developers. It looks good, feels good, and other than magnitude, it is real data. It is also far harder to create. For existing systems, it might be converted data, or for completely new systems, it is probably hand entered, likely for the QA team.
Real - Real live data that, if you weren't in a development area would in fact be usable for business. It is likely a snapshot up to some point in time, but as of a given point in time, this is what could be in production. This data is more likely in a system that you are building as an upgrade to another system. If this is a wholly new system, there is some chance that your system will start out empty. At best clerks might be hired to copy data that is currently held in file folders (paper, remember it?) into the online system, but they might not do that until after the "go live" date.
Maybe by labeling data as "reasonable" it would be clear to folks that there is data to browse, try out, etc, but there is no way possible that they can validate this data against anything other than the business rules. This is especially an issue with a reporting source like a data warehouse. It is expected in the minds of all users that data warehouse data will be based on a verifiable source, though when in the development phase, this clearly might not be the case.
Have better ideas? Better names? Want to show me up and look cool to your nerdy friends? Please leave comments :)
Crossposted to http://drsql.spaces.live.com