THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Merrill Aldrich

Trick Question – Part Två

This post is part two of an effort to examine, in brief, the decisions entailed in designing the interface between database and application. We’re in a strange place at this moment, where both the possibilities and the liabilities of techniques like ORM have come to a head, and it seems vital to go into data tier design at least equipped with all the information to make sensible choices around security, extensibility, and so on.

The landscape going into the design of a business application practically always looks like this: there’s the need to create an object-oriented application, where the structure of in-memory objects serves the purposes of the app, and where the objects have relatively short lifespan in memory. There is then the need to deal with storing the data for the system over the longer term, in a durable, consistent, flexible and safe storage medium.

In practically all existing systems this results in two “models:” a model defining the application’s in-memory object graph, and a second, similar but not identical relational table model for the database. The tension between these two models has been the subject of a lot of study and debate, but if you’ve worked on a system of any complexity, you have no doubt faced the fallout from the friction between these two fundamentally different ways of thinking.

So, if you are lucky (crazy?) enough to embark on a new, from-scratch design, what are the first data-related design decisions? Here’s a swing at that, from my admittedly limited experience.

Though perhaps this is counter-intuitive, I think the first impactful decision is probably whether the application will have direct write access to the tables in the database, and the implications this has on security, on modularity of the whole system, and what the data access layer will look like. (There are those who say this should never be done, and those who say it should always be done, but it’s been my experience that most real systems just “fall” into one technique or the other without fore-thought, so it would be useful to have some framework for choosing based on facts, and real risks, and benefits.)

It’s helpful to frame this with an analogy from OO design: a fundamental tenet of OO design is data hiding, where variables of simple types (integers, strings) are made private, and “wrapped” by accessor methods that allow the developer to establish rules about what values are valid or not valid, and when and how the data is allowed to change “inside” the object. Any change to the private data is made through the public methods. Having private data and a public set of accessor methods allows the enforcement of more specific data types -- phone number, not string, for example, or dates constrained by rules, numbers in ranges, uniqueness of values, etc. This is in contrast to a system (or “abomination” in many people’s view :-) where all the simple types are public, and can be accessed and modified from any other code module in an uncontrolled fashion. Hard lessons from older procedural systems has generally led to consensus that data hiding is “good” and helps improve the quality of software. It’s also generally agreed that the modularity resulting from this idea helps systems to remain more flexible, and enables developers to fix problems with less chance the fixes will have unintended side-effects.

When connecting an application to a database, in the typical fashion where the app is the primary client for the data, this becomes a fundamental question: should all the data in the data tables be “public” from the point of view of the app? The data belongs to the application, after all, and maybe since it only “persists” private variables that are already encapsulated within the OO model, that’s enough encapsulation and more is just wasted effort. Or should the raw data in tables be treated in a fashion similar to private variables, where there’s a layer in between? The analogy may be imperfect, but the underlying idea is important.

One side note: don’t entertain the illusion that the one, golden application will be the only client that will have access to the database, and can act as sole owner of the data. No real systems work that way; people use SQL Server as an integration platform, and they will be in the data or will expect other tools (BI, for example) to be able to use it. It’s only sensible defensive programming to design the database assuming people will use it.

It’s not too difficult to examine the implications of both designs:

Path A: the data is encapsulated (private)

Path B: the data is not encapsulated (public)

Implications

Implications

There will be at least two security principals, and it’s best to set them up right from the start: one will be the owner of the database, and a second, restricted account will provide the application with access (if using the trusted application model) or a collection of restricted end-user accounts will provide application access (if using impersonation).

There might be just one security principal: the account that owns the database also provides unrestricted access from the application. This is faster and more convenient for development, but less secure.

A public layer that is distinct from private tables will provide access to the data in the database. The public layer will consist of views, functions and stored procedures that present the data to outside consumers, and allow encapsulation at the database level.

  1. There’s some opportunity to revise the database, whether for performance or to fix design issues, entirely independently from the application’s code. All that is required is to keep the interface consistent
  2. Rules and logic for data manipulation can be enforced for multiple clients. There will be multiple clients – it’s a certainty.
  3. Data types can be managed via check constraints but also in the stored procedures that are used to perform data manipulation.

The app can access tables directly, analogous to “public” data in the OO world. This is again faster and more convenient for development in the short term, but it implies that:

  1. The database cannot be treated as a separate module and tuned or fixed independently from the application’s data access layer beyond basics like indexes.
  2. Little in the way of rules around data modification can be enforced at the database level outside constraints and primitive data types. Note here that the app will not be the sole consumer of the data in a database – that’s wishful thinking in any real-world system.
  3. Data type enforcement in the database is restricted, for practical purposes, to what can be managed with Check constraints; that is, types are simple/primitive.

You might, now that schemas are more flexible in 2005 and later, consider explicitly reserving the DBO schema as “private” data, and restricting permissions at the schema level. One or more additional schemas can be added for “public” views, functions and stored procs to simplify management of permissions, and make the distinction readily visible.

Everything can be in DBO, unless there’s some other purpose you’d like to apply schemas to.

It’s possible to help prevent SQL injection vulnerabilities at the database level, because the application will run with a restricted set of permissions that prohibit the execution of ad-hoc data manipulation (that is, the app might have read-only rights to tables, or no rights to tables, but in either case can perform no modification of data or schema other than execution of provided stored procs.) The app obviously should also be written safely, but this enhances security.

The database server cannot provide any protection from SQL injection vulnerability in the app. The application must have data modification rights, so any injection vulnerability in the app passes straight through to the database. 100% of protections from SQL injection must exist in the app code. It is possible to do this correctly, but you’re working the injection trapeze without a net.

Data manipulation must be written into stored procedures, and data access into views or functions. Some ORM-flavored frameworks might provide a bit of automation, but if the logic of these procedures is more complex than CRUD, then it’s a development and coordination effort. You want someone with reasonable T-SQL skill to be the developer. This is the main disadvantage of this method.

Data manipulation can to a certain extent be auto-generated by a tool (ORM, Entity Framework, nHibernate, etc.) that maps objects to rows and collections to tables. That is, a table maps to an in-memory collection, each atomic object in the collection maps to a row, so the T-SQL for select, insert, update, delete can mostly be code-generated. This is the perceived advantage of ORM.

In theory multiple apps could use the same interface at the T-SQL level to modify the data – stored procedures can be application-agnostic -- though in practice this is rare.

If multiple apps were to share the data-access technique, they would have to do that through application-specific libraries or services.

The data-access layer, in effect, has two components: one application code component and one T-SQL code component, linked by an interface contract. The application data access layer has independence from the details of table schema, within practical limits. Note, however, that a table can’t really, in practice, get to be that different from what the app is trying to store and still function sensibly.

The data access layer seems more self-contained, but it composes SQL on the fly that must match the database schema exactly. Here you have to precisely coordinate table schema and app code. This is simpler at first, but you can get backed into a corner with regard to performance/scalability, or versioning/upgrades where every simple database change demands a matching application code change.

Procedure caching / Query plan re-use works equally well in either case if implemented correctly (Views, Functions and Stored Procs)

Procedure caching / Query plan re-use works equally well in either case if implemented correctly (Parameterized Statements)

I have seen applications built both ways successfully, and both ways badly - but the main point is to go in with your eyes open and a full understanding of the tradeoffs. Agree? Did I miss something? Disagree? I want to hear from you!


 [Footnote - let's discuss in person at PASS next week! I'm psyched!]

Published Saturday, October 24, 2009 7:58 PM by merrillaldrich

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Merrill Aldrich : Trick Question ??? Part I said:

October 25, 2009 2:43 AM
 

Brian Feifarek said:

Hi Merrill, I am enjoying your trick question, Parts 1 and 2, and looking forward to future installments.  

I especially like your list of implications of either path.  In my systems to date I have taken the B public path (for ease of development-- clients don't have really deep pockets), but am increasingly interested in the more robust A private path.  

I think the main challenge for me in taking the (A) "private" would be how to best present the data for the BI and other 'consumers' of the data, and how to get the business owners to feel comfortable with how to navigate in the data model.  And in general my biggest pain point is trying to integrated/synchronize between different software packages so that the different databases don't overlap or disagree with each other.  Other packages are 'predetermined', so in a practical sense that means my app and database have to jump through lots of hoops.

November 2, 2009 5:27 PM
 

merrillaldrich said:

Hi Brian. Thanks for your question - it's a common issue. I think if you have a small enough system to allow ad-hoc BI-type queries without the risk of disrupting the other functions of your app, then I would handle this by making a set of views to expose to BI users. The views can be "public" (in the sense of OO public, with security) and subject to ad-hoc queries. They can have some independence from the underlying tables, too, so if you elect to you can use them to present the data in a clearer way, or to provide an unchanging interface, where the underlying tables might be complicated or might get refactored.

One simple method is just to make a one-for-one view for each table you want to expose, in a different schema, like "reporting.customers," "reporting.orders," etc. If tables change in future, you can simply use the view definitions to maintain consistency of each view's output, and the end users would be unaffected.

I have seen several third party/ISV systems where they deliberately provide views like this and encourage users to rely on them instead of the tables. Some go so far as to say they will actively support the views, and try to keep their output consistent from version to version, while they will not support access directly to tables.

Obviously if you have a busy transactional system with a lot of users, you might have to copy the data out via some ETL process instead, to prevent blocking. Even in that case, views can provide a stable source for the ETL.

November 9, 2009 11:34 PM
 

Merrill Aldrich said:

This is the third part of a series ( Part 1 , Part 2 ) thinking out loud about the decision making around

November 16, 2009 3:50 PM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement