How advanced do you have to be to take advantage of your Refactoring SQL Applications book? Can a beginner also benefit from this book?
It depends on how you define “beginner”. If you mean “first programming job” or “first DBA job”, I'm afraid that the book may be a little hard to swallow for a complete novice. But if you have spent several months working on an application and if you have performance issues, if you consider yourself a professional, if not quite a seasoned one, there are probably many points from the book that you can pick easily. The basic ideas are not complicated.
Is there a 'safe' way to refactor databases? Are there any tools to help?
First of all, the book is about refactoring applications, much more than refactoring the database (even if I touch the topic), precisely because it is safer and usually much more efficient to modify the code rather than the structure. There are of course many cases when the database design is here and there rather questionable. If you rewrite all your programs to take advantage of a good design, that's fine. Otherwise, you risk making performance worse, rather than better, if you try to disguise a clean design with views, materialized or not, so that it looks like the bad old design to applications you have no time to rewrite.
If there were a safe, automated way to refactor SQL applications, I would have written (long ago) a software product instead of a book, and become rich. What I like in the concept of refactoring is that it acknowledges that you can, and should, review the code. Too often, the ideas that people have of improving performance are limited to changing database parameters, adding indexes, reorganizing data and upgrading hardware. Actually, improving the code is probably the safest and most efficient method of all if you have a decent knowledge of SQL. Database parameters have side effects, adding an index may speed up a few queries but also break a nightly insertion batch that must run between the moment when New-Yorkers head home and Tokyoites arrive at the office – because maintaining an extra index means extra work and extra time. Reorganizations are disruptive, sometimes impossible, and their effects, when there are effects, rarely last. Even a hardware upgrade can be risky; you may have to pay much more for your software licenses for an improvement that is usually modest, and sometimes faster processors just increase contention and make matters worse. This isn't to say, of course, that you should run your database server out of the box with default parameters on the cheapest hardware available and index no table; there are times when parameters must be adjusted, indexes created on other columns than primary keys and hardware upgraded.
But too often people just proceed by trial and error, and don't challenge the most sensitive part of the system: the application code. If you master it (a big if, sometimes, with legacy applications) you are on a safe ground when you review it with a critical eye. There isn't much to lose, and a lot to gain.
Concerning tools, you probably think of the various “wizards”, “advisors” and other “assistants” for SQL tuning? Actually, they work well (I have seen the SQL Server Database Engine Tuning Advisor being impressively precise in its predictions), but their scope is exactly what they say: tuning. For me, tuning is doing exactly the same thing a little faster, and refactoring finding a way to get the same result faster; I find refactoring much more interesting. Generally speaking, it's not tuning that will make performance really jump. My problem with tuning tools is not, actually, with the tools themselves, but with the marketing fluff around them, which let people imagine that tools will auto-magically solve all performance issues for them. A bit like making believe that a little signal processing will make an apprentice violinist sound like Itzhak Perlman.
There is a wide gap between making a DBMS run better and improving program performance. A Texan friend of mine was telling me lately of a program that vigorous rewriting brought from a runtime of 18 hours down to 2 minutes; not long ago, I had myself a case of 13 hours down to 10 minutes, and another one of 8,000 to 40 seconds only last Monday. You don't see that everyday, but improving by a factor of 4 or 5 is common, and I very rarely add indexes. Profilers are much more useful than tuning tools; they tell you where time is spent. Then you can rethink the process. I usually query dm_ views or their Oracle equivalents to try and get an idea of what is keeping the DBMS busy. Just seeing how many times a statement has been executed and the number of logical I/Os it has performed (pages it has accessed in memory) tells a lot about how applications were coded.
If someone could only read one chapter of your Refactoring SQL Applications book; which chapter would this person get the most value out?
There are few chapters in this book, only seven. I would say the 6th one perhaps, which deals about algorithms, getting rid of cursor loops and this kind of thing. Increasingly, that's what makes performance sink.
Why is refactoring so hard when dealing with databases?
From a technical point of view, it's not so hard. Once you have understood a couple of key concepts, with a little SQL skill you look like a Hogwarts graduate. The real, practical difficulties sometimes come from unexpected corners. I have devoted a full chapter to the generation of test data and the comparison of result sets, because they are topics that can make you waste a lot of time. The particularity of SQL applications is that as you are often working with enormous datasets, there is a huge leverage effect, and the insignificant change is rarely insignificant – one way or the other.
What is perhaps harder to deal with is ... face. When you increase throughput by anything between 30% and say a factor 3, people are glad with your work. When you reach factors 10 and above, people who were involved in the initial program or the last optimization efforts are beginning to feel uncomfortable – even when you work closely with them (which I try to do whenever I can) – because it puts them in an awkward position with their management or their customers. The problem isn't that developers are bad; you sometimes see code that was obviously written by someone who took the first job available to feed a large family, but usually it's just lack of database experience that shows.
Some developers don't think about performance and large data sets until it is too late, how much more difficult is it to refactor when dealing with a lot of data?
I'd say that there are several aspects. I have already mentioned getting usable data and comparing result sets. There is also speed of iteration: when you try to refactor a process that takes 4 or 5 hours, you usually have to run it once, to get a reference. Meanwhile you rework the code, and perhaps you'll make it run in 2 hours or so. Add various interruptions, etc. and that's a day's work before you know how your first draft compares to the reference version, which may add up to a day to prepare a suitable environment, and some prior monitoring. If you discover that a composite index should be rebuilt with the columns in a different order, it can take hours and you'd rather execute it during the night, hoping that it will not fail for an unexpected reason. Ditto with any physical reorganization such as partitioning. As you very justly said, performance considerations often come too late: in the best of cases, a few weeks before the scheduled shipping to production, in the worst case when the application is already being used by thousands of users. Time is usually critical, and there may be a lot of pressure. Trial and error isn't an option, and you have to know what you do.
What is the primary reason to refactor databases, is it performance, maintenance headaches or to take advantage of new features?
In practice, I have always been called on performance issues, sometimes triggered by a software or hardware upgrade (new DBMS version that doesn't behave exactly like the previous one, new storage unit or introduction of synchronous disk mirroring to a distant disaster recovery site). That's when you discover a lot of interesting things. Usually you get an e-mail that says “help! this procedure runs in 3 minutes when it is written that way, and when we change this criterion it takes two hours”. Implicitly, people ask for your helping them to make it run in 3 minutes in all cases. Very often, after digging in the code you discover that in all cases it can run in 20 seconds. But 3 minutes, almost 10 times longer, was considered satisfactory. I have quoted, in “The Art of SQL”, a case where the problem was that a query that took 4 minutes on the bigger production database was taking 11 minutes on the much smaller development environment, and the question that was posed was “how can we get the same execution plan” (they were different). After rewriting, times were under 15 seconds in production, and something like 2 seconds in development. Still with different execution plans. But nobody would have challenged the 4 minutes without the 11 minute problem. What worries me most is that very often, it's a kind of threshold effect that activates a refactoring effort, and that many applications stay below the radar.
I have been trying to sell, at one of my customers, the idea of setting up a team of experts that would help developers refactor their code as a kind of permanent background task. It's a hard sell. People associate experts to crisis management.
When procedural programmers start writing SQL they usually pick a row by row solution instead of a set based solution, why are set based solutions important?
Procedural programmers work row by row, and object-oriented programmers column by column ... What most developers who have lacked exposure to databases fail to perceive is that the apparent simplicity of the SQL language hides a full, complex subsystem. The execution of every SQL query means crossing layers and layers of software. If each round-trip costs you 2 milliseconds (random figure), if you do it one million times you add more than 30 minutes to your process for nothing. It's as simple as that. Moreover, when your program says, in effect, to the server, “give me this little bit ... and this little bit ... and this little bit ...”, issuing simple query after simple query, all that the optimizer can do for you is try and get each little bit as fast as possible, for which options are usually limited. You cannot do much to improve 'select ... from ... where primary_key_value = ...'. But when you step back, notice that it is deep nested in loops and that a single query can do the same work, it's a different matter. What you ask of the server should look like a kid's letter to Santa.
Are user defined functions and views misused/overused by most developers?
User-defined look-up functions are very common source of trouble; once again, people apply what they think are “good practices” but are bad SQL practices. You end up calling functions over and over with (very often) a small set of input parameters. In SQL, if you want a standard, sharable way to retrieve some particular data, you create a view, not a function. But views, in turn, are frequently misused, because people combine views that perform a lot of work that is useless in the context of their query. The SQL Server and Oracle optimizers are rather clever at pushing conditions down to the core of views (I have an example in the book), but you don't need enormous complexity to 'lose' an optimizer. Join two views that are the union of aggregates and contain subqueries, and you can be almost certain that the query will take much more time than it would using the base tables.
In terms of performance and refactoring, what kind of change improves performance the most?
Not accessing the database when it's not needed, and limiting the number of round trips.
Can your book be used for all databases or is it geared toward a specific RDBMS like Oracle?
I have tried in this book, even more than in “The Art of SQL” to keep a good balance between MySQL, Oracle and SQL Server; and what I say for them is applicable to Sybase, DB2 or Postgres as well. From time to time, I have used a single product because it allows several ways to implement a feature, and it makes easier to pinpoint differences between technical choices. But in most cases, when the sample code that is shown in the book is in T-SQL, you'll find the Oracle and MySQL equivalent as well in the downloadable code associated to the book (and vice versa).
Why do you write technical books?
Good question! I guess that the good, politically correct answer would be my love of mankind, but it's more complex than that. Certainly not for the money. There's the challenge of saying something new – or in a new way. There's the challenge of writing a book that is worth buying when there is so much free information on the web. There's the challenge of being able to “talk” to people of very different backgrounds. There's love of language. There's marketing, because when you are a consultant it doesn't hurt to have published a book (even if I don't make much of it: I have had my two main customers for more than 10 years and I'm not desperately trying to market myself). There's the fatigue of always encountering the same mistakes, and a desire of spreading your ideas about what you think is right on a grand scale. There's the pleasure of starting with a blank screen and seeing your name on the cover of a real book several months later (which makes, after a while, forget about all the terrible moments in-between). There's probably some thirst for recognition. There's a bit of masochism. There's the parable of talents lingering in a corner of your head. There's chance: after having posted a rather challenging query on an Oracle mailing-list, I was contacted by Jonathan Gennick, then with O'Reilly; Jonathan proposed to me to participate to an SQL Cookbook (in the end, Anthony Molinaro wrote it – and did a good job). I acknowledge the value of cookbooks, but it's not what I wanted to write – I prefer giving a broad picture rather than the details. Whenever I have been given the details before the broad picture, I have been lost. I made a counter-proposal with “The Art of SQL”, which, after some wavering, was accepted. For “Refactoring SQL Applications” it was an idea that Jonathan transmitted to me shortly before “The Art of SQL” was published. I liked the idea, but it took me some time before I decided to take the plunge again.
How many different databases do you work with on a regular basis?
Most of my billing is related to Oracle work; I have had recently to audit an SQL Server application, and I do more and more MySQL work for personal projects. I find it very interesting to compare various products, because all have their strengths, and if Oracle has long had a technical lead, SQL Server has caught up fast (there are even areas, such as auto-parameterizing, where I find what it does subtler than what Oracle does) and there isn't much difference today; their behavior is very similar. MySQL is a little behind yet, it's mostly monitoring facilities that feel a little rough compared to the other products. But it shouldn't last.
What database books are on your bookshelf?
What I have on my bookshelf isn't necessarily meaningful, because there are complimentary books I have received, books in which I have been involved as a technical reviewer (the last one was Clare Churcher's “Beginning Database Design”, which I have also prefaced) and all the translations of “The Art of SQL” I can get hold of. The last database book I have bought is Pascal's “Practical issues in Database management”, and that was quite a time ago. The last IT book I have bought before that must have been Brooks' classic “Mythical Man-Month” that had been on my 'to read' list for years.
Of all the computer books I have at home, the ones I open on the most regular basis must be the 1st and 3rd volumes of Knuth, which I bought in London 20 years ago. But I spend a lot of time in reference manuals, on the web usually.
Where can we see or hear Stéphane Faroult this year? Are there any conferences, seminars or webinars coming up where you will be present?
I am not much of a conference animal. I talked at some conferences in the past. I have almost spoken at a conference this year (I had submitted one paper that had been accepted) but the idea of a 9-hour jet lag for a 3 day conference (I live in France, the conference was in California) has weakened my resolution. I was thinking of extending a 20 minute presentation on SQL best practices I had done for a customer (kind of mission impossible), and instead I translated it as is and posted it on Youtube in 3 parts. After 2 months, it has been seen by several thousand people. The last time I talked at a conference, I had 50 people in my room or about (that was before “The Art of SQL”; perhaps I'd had 100 today). I think I'll do more videos. Not as good as conferences for networking, but you reach people who aren't given even the opportunity to go to a conference. Several years ago, I was writing on a website a column entitled “Aunt Augusta” supposedly written by a very Victorian agony aunt who was a mix between the Lady Bracknell of “The Importance of Being Earnest”, the aunt Agatha of the PG Wodehouse novels and the heroins of Barbara Pym. Aunt Augusta was answering technical questions while gossiping about her friends, the local vicar and so on. It was very funny to write (an English friend who was sharing my weird sense of humor was polishing it) and quite a number of people were sending questions to “Aunt Augusta”. I once got an email from a guy with an impossible name (for me) who was stuck with an Oracle installation on Windows. It started with “I am in Ulan-Bator ...”. I think of him from time to time. Given the choice, I prefer reaching Ulan-Bator, and La Paz, and Mysore, and Nairobi, and Chengdu as well as London, San Diego or Amsterdam.
When will your Refactoring SQL Applications book be published?
It is announced for the second half of July. The production process should start in May.