THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Linchi Shea

Checking out SQL Server via empirical data points

Linked servers: permissions and distributed query performance

In an earlier post, I highlighted that linked server security configuration can have a huge performance impact, and complained that this was not properly documented in SQL Server Books Online and filed a Connectitem for this. Good news is that in Books Online for SQL Server 2008 R2,  Microsoft has revised the documentation, and included the following guideline forusing distributed queries:

 

To create the best query plans when you are using a table on a linked server, the query processor must have data distribution statistics from the linked server. Users that have limited permissions on any columns of the table might not have sufficient permissions to obtain all the useful statistics, and might receive aless efficient query plan and experience poor performance. If the linked serveris an instance of SQL Server, to obtain all available statistics, the user must own the table or be a member of the sysadmin fixed server role, the db_ownerfixed database role, or the db_ddladmin fixed database role on the linkedserver.

This is a major improvement.To drive this home, however, it would be nice to see the impact in action. In my presentation at the recent SQL Saturday #59 in NYC, I demonstrated, with a simple example, that the performance of adistributed query can be significantly impacted by the user permission. Unfortunately, due to time limit, I could not go into great details during the demo. Let me describe the example in this post.

To make the example work,some assumptions are in order:

·      You have two SQL Server 2005 (or 2008) instances:SQL01 and SQL02. There is a user database called TestDB on SQL02.
·      You have a SQL login called bob on SQL02, which is granted access to TestDB as a user in the public role only.
·      A linked server called SQLBOB is created on SQL01 as follows:

 

EXEC master.dbo.sp_addlinkedserver @server= N'SQLBOB',

     @srvproduct=N'',

     @provider=N'SQLNCLI',

     @datasrc=N'SQL02',

     @catalog=N'TestDB'

EXEC master.dbo.sp_addlinkedsrvlogin @rmtsrvname=N'SQL02',

     @useself=N'False',

     @locallogin=NULL,

     @rmtuser=N'bob',

      @rmtpassword='########' -- replace withreal password for bob

GO

EXEC master.dbo.sp_serveroption @server=N'SQLBOB',

     @optname=N'data access', @optvalue=N'true'

GO

EXEC master.dbo.sp_serveroption @server=N'SQLBOB',

     @optname=N'rpc', @optvalue=N'true'

GO

EXEC master.dbo.sp_serveroption @server=N'SQLBOB',

     @optname=N'rpc out', @optvalue=N'true'

GO

·      A second linked server SQLSA is created on SQL01 as follows:

EXEC master.dbo.sp_addlinkedserver @server= N'SQLSA',

      @srvproduct=N'',

      @provider=N'SQLNCLI',

      @datasrc=N'SQL02',

      @catalog=N'TestDB'

EXEC master.dbo.sp_addlinkedsrvlogin @rmtsrvname=N'SQLSA',

      @useself=N'False',

      @locallogin=NULL,

      @rmtuser=N'sa',

      @rmtpassword='########' -- replace with real sa password

GO

EXEC master.dbo.sp_serveroption @server=N'SQLSA',

      @optname=N'data access', @optvalue=N'true'

GO

EXEC master.dbo.sp_serveroption @server=N'SQLSA',

      @optname=N'rpc', @optvalue=N'true'

GO

EXEC master.dbo.sp_serveroption @server=N'SQLSA',

      @optname=N'rpc out', @optvalue=N'true'

GO

 

The only difference between SQLBOB and SQLSA is that the former is configured to log into SQL02 with bob, whereas the latter is configured to log into SQL02 with sa.

Now, let’s create a table in TestDB on SQL02 and populate it with 100,000 rows using the following script:

use TestDB

go

drop table item

go

create table item(i int, j int, cchar(200))

 

;with tmp(i, j, c) as (

   select 1, 1, replicate('a', 200)

   union all

   select i + 1, j+1, replicate('a', 200)

      from tmp

    where i < 100000

)

insert item

select * from tmp

option (maxrecursion 0)

go

create clustered index ix_i on item(i)

go

create index ix_j on item(j)

go

grant select on item to bob

Then on SQL01, let’s create a temporary table and populate it with 20 rows.

drop table #tmp

go

create table #tmp(i int, c varchar(10))

 

;with tmp(i, c) as (

   select 1, cast('abc' as varchar(10))

   union all

   select i + 1, cast(i as varchar(10))

      from tmp

    where i < 20

)

insert #tmp

select * from tmp

option (maxrecursion 0)

go

create clustered index ix_c on #tmp(c)

go

 

Now, we are ready for ourdemo. On SQL01, run the following queries:

 

--Query 1

select t1.*, t2.*

 from #tmp t1, SQLBOB.TestDB.dbo.item t2

 where t1.i = t2.j

 

-- Query 2

select t1.*, t2.*

 from #tmp t1, SQLSA.TestDB.dbo.item t2

 where t1.i = t2.j

Note that the only difference between Query 1 and Query 2 is that the former accesses the item table on SQL02 via linked server SQLBOB whereas the latter does via linked server SQLSA.

You should find that both queries will come back relatively quickly with no big difference in duration.However, if you capture their query plans, you’ll see that the remote query forQuery 1 is something similar to the following:

SELECT "Tbl1004"."i""Col1010","Tbl1004"."j""Col1011","Tbl1004"."c" "Col1012" FROM"TestDB"."dbo"."item" "Tbl1004" ORDERBY "Col1011" ASC'

 

In other words, the optimizer on SQL01 has decided to bring all the rows of the table item from SQL02 to SQL01 in a single scan, and then process the query locally.

 

And if you capture the query plan for Query 2 or useSQL Profiler on SQL02, you’ll see that the remote query would look like thefollowing:

 

SELECT "Tbl1004"."i""Col1010","Tbl1004"."j" "Col1011","Tbl1004"."c""Col1012" FROM "TestDB"."dbo"."item""Tbl1004" WHERE "Tbl1004"."j"=?

 

This time, SQL01 is using a different strategy. Basically, it is fetching only those rows from SQL02 where column item.j matches the column i values from the local temporary table.  If you look at the incoming traffic on SQL02with SQL Profiler, you’ll see the following:

 

declare @p1 int

set @p1=1

exec sp_prepexec @p1 output,N'@P1 int',N'SELECT"Tbl1004"."i""Col1006","Tbl1004"."j""Col1007","Tbl1004"."c" "Col1008" FROM "master"."dbo"."item" "Tbl1004" WHERE "Tbl1004"."j"=@P1',2

select @p1

 

And this is followed by 20 executions of sp_execute. Essentially, SQL01 has decided to parameterize the calls to SQL02 through the linked server SQLSA by creating a temporary stored procedure on SQL02 and then call the proc 20 times, one for each value of column i in #tmp.

 

So what is the implication? Well, you’ll see a huge performance difference if table item on SQL02 is not populated with 100,000 rows but 10,000,000 rows. In that case, 20 calls with index seek will absolutely beat a scan of a large table any time. The larger the table on SQL02, the more significant the performance difference between Query 1 and Query 2.

 

Why is there such a performance difference between Query 1 and Query 2?

 

The root cause is explained by the quotation from SQL Server 2008 R2 Books Online (reproduced at the beginning of this post). With SQLBOB, the user accessing SQL02 is bob, who only has the SELECT permission on the item table and does not have permission to see the distribution statistics. With SQLSA, however, the user is sa and has all the permission to access everything on SQL02, including the distribution statistics for the item table.

 

Note that the worst case scenario for the Query 2query plan would be scanning the item table 20 times if the column item.j turnsout to be not selective, whereas with Query 1, a single scan of the item table is the worst case. In a way, it makes sense to choose the query plan as seen with Query 1 when the optimizer on SQL01 has no knowledge of the distribution statistics of the item table on SQL02. In terms of the worst case scenario, it is actually safer.

 

If you run this example in your environment, you may not see the exact behavior as I have described. But if you play with different numbers of rows or different column widths in the test tables, you should eventually see the above two different plans.

Published Tuesday, November 23, 2010 12:58 AM by Linchi Shea

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

DM UNseen said:

What about connecting to a data source that is not SQL Server (eg ORACLE). Is there any affect there as well for not having SA rights on thr oracle instance?

November 23, 2010 1:43 PM
 

Linchi Shea said:

In case it's not clear, I was using sa for simplicity in the demo. As the SQL2008 R2 BOL states, to be able to access the distribution stats for the table, the user needs to be the table owner or a member of the sysadmin fixed server role, the db_ownerfixed database role, or the db_ddladmin fixed database role on the linkedserver. Now, this applies to the SQL Server data source. I do not know to what extent the SQL Server optimizer will go to retrieve and make use of the distribution stats of a non-SQL Server data source.

November 24, 2010 1:09 AM
 

Christina P said:

I am using linked servers for the data warehouse (SQL 2008), so I want the nightly ETL to be efficient, but I currently am using a SQL user set up on each of the linked servers that has read only permissions to source systems (SQL server databases in 2005 and 2008).  I don't want the data warehouse user to modify the production databases (which would only happen accidentally).  All of the database roles listed would grant the data warehouse SQL user way more power to modify the source system than I want.  So is there no way around it so that the data warehouse account is able to access the distribution stats for efficient loading of the data without the permission to modify/delete objects?

November 29, 2010 5:50 PM
 

Chuck Lathrope said:

We use stored procedures on the remote boxes for this very reason. May not be ideal on all situations, but works well.

November 29, 2010 7:37 PM
 

Christina P said:

Good idea, Chuck.

November 30, 2010 10:51 AM
 

Linchi Shea said:

Christina P;

Just want to note that the behavior is specific to distributed queries. If you can pass your queries through to the linked server and have them processed there, the problem does not apply. In an ETL situation where we are typically talking about a large amount of data, you probably should avoid doing much distributed queries, though I don't know your specific requirements and they may dictate running distributed queries.

December 1, 2010 10:42 AM
 

David Walker said:

Wouldn't it be great if there was a ROLE that allowed JUST the reading of table statistics?

We have some linked servers that are defined using a read-only userid with read-only permissions.  We do, in fact, often get terrible performance when using these linked servers.  But we don't want to add ddladmin permissions to the "read-on;y" userid that we use for linking.

We can't really avoid doing distributed queries; we have different data in different databases (that are sometimes of different servers).  We don't want to replicate every database to every server!

Adding a role that provided table statistics is the right way to solve this problem.  I will file a Connect suggestion.

December 1, 2010 11:12 AM
 

David Walker said:

I created a suggestion on Connect, which was assigned the interesting Connect item number 626262.  

Feel free to vote for it.  The proposed solution given in the BOL clarification is terrible, from a security standpoint.

https://connect.microsoft.com/SQLServer/feedback/details/626262/add-a-role-that-allows-statistics-to-be-read-performance

December 1, 2010 11:29 AM

Leave a Comment

(required) 
(required) 
Submit

About Linchi Shea

Checking out SQL Server via empirical data points

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement