Is it better to move data to procedures or move procedures to data?
The answer is, of course, “it depends.” Let’s consider a scenario where you have two SQL Server instances: ServerA and ServerB, and you have a procedure on ServerB (call it procB), but need to access data on ServerA.
Three database solutions are common:
Solution 1: You can first copy the data from ServerA to ServerB (using any number of technologies such as replication or a simple query via a linked server), and then have procB access the data locally on ServerB, however many times.
Solution 2: You can have procB access the data on ServerA from ServerB through a linked server using a mechanism such as a distributed join or an openquery.
Solution 3: You can create a procedure procA on ServerA, and have procB make a remote procedure call to procA to retrieve its result, and then further process the returned data in procB on ServerB, if necessary.
In a given application, any of these three solutions may be optimal, depending on the characteristics of that application. In particular, it depends on (1) the amount of data the procedure needs to access, (2) how current the data must be, and (3) how often the procedure needs to access the data.
For instance, if the amount of data on ServerA is large, it is okay for procB to read from a copy of the data that may be slightly or even significantly lagging behind its source, and (3) it needs to be accessed frequently in procB, then some version of Solution 1 can be a good choice.
The key consideration is that moving data, especially a large amount of data, across servers can be expensive, and moving a piece of code is cheap.
It follows that if a process involves moving a large amount of data across server boundary for processing on a different server, it may be cheaper to move the code, or a relevant part of it, to where the data resides, and only send back the result, if the result is significantly smaller than the amount of data needed for processing.
Note that what is considered ‘a large amount’ should not be measured in terms of how many gigabytes, or even how many megabytes. Rather, it should be measured by the relative cost of moving the data versus that of the total process. So for instance, your procedure may be moving only 128KB of data from server A to server B, which generally speaking should be inexpensive, and the whole procedure executes quickly. But if this procedure is called very frequently, moving that 128KB constantly effectively ends up moving ‘a large amount’ of data, and can become rather expensive.
This sounds trivial. But unfortunately, I have seen too many cases in the real world where little thought is given to these considerations when creating procedures that access data across multiple servers.
In some instances, it is a case of abstraction abuse. By that, I mean people sometimes are too easy getting carried away with using other stored procedures and views (i.e. abstraction units), not knowing that they are incurring expensive cross-server data operations, or not exploring alternative options to cut down moving data across servers.
In other cases, programming convenience ends up trumping application performance. For instance, I have seen a case that goes something like the following:
The procedure has a three table join that includes two tables from a linked server and a very small local table. The result is inserted into a temp table. The temp table is further processed in a series of update statements. Some of the updates use yet another very small local table. The final result set from these updates is small, and is used in the rest of the procedure.
It turns out that the two remote tables are relative large, and this procedure is called frequently. In addition, since the distribution join touches a significant proportion of each table of these two remote tables, basically all the rows are copied over locally to process the join.
One way to optimize this procedure is to move the code segment mentioned above to the remote server and wrap it in a procedure there. This basically moves the entire process of preparing the final result set into the procedure on the remote server, and the local procedure now just needs to make a call to this new procedure, gets the small result set and move on with the rest of its processing. This is a perfect example where you want to move your procedures to your data instead of moving your data to your procedures.
So the bad practice is to not think through the trade-offs between moving data to procedures and moving procedures to data when architecting cross server solutions, and end up moving an unnecessary amount of data.