Linked servers and distributed queries

Linked servers allows to issue SQL commands against OLE DB providers. Because Microsoft also makes available an “OLE DB Provider for ODBC Drivers”, it is also possible to issue queries against a variety of ODBC drivers. With linked servers and distributed queries, you can query all sorts of data sources and merge them on the fly with your SQL Server database. Example of data sources includes Analysis Services (SSAS), Access, Excel, Text files, Oracle, MySQL as well as SQL Server instances and many, many other sources.

This article will focus on distributed queries over SQL Server instances. Organizations frequently uses linked server in order to use data that is available on other servers or instances. This flexible strategy eliminates the need of synchronizing data over several servers (for example with replication). It is not a bad choice but one must understand the pitfalls in order to make enlightened decisions.

Creating a linked server
The easiest way to start is to create a linked server that references your own server:

-- 'localhost' is the standard "loopback" hostname
-- It points to the local machine
sp_addlinkedserver 'localhost'

Once the “localhost” linked server is created, you can reference objects using a four-part name in the form linkedserver.catalog.schema.object, for example:

SELECT * FROM localhost.MY_DATABASE.dbo.MY_TABLE;

 Network Bandwidth and latency
The first obvious drawback of using a linked server is the network speed cost. For this single reason, linked server should not be used when we seek optimal performance (unless you need to scale your database on multiple servers but then again it’s not necessarily a good approach). There is a lot of overhead involved with SQL Server having to query the object metadata, the statistics (if possible) and send the query and results over the network. Note that we are not talking about end results here but intermediate query results so it doesn’t matter if your query return only one row. All this overhead makes remote query a lot more expensive than local query and joins between tables won’t be optimal. In general you want to use linked server when coupling is low, that is when you do not need to join intermediate results with the local database objects.

Transactions
There is also significantly more overhead involved in distributed transactions. All servers involved in a transaction must have MSDTC service (Distributed Transaction Coordinator) – which must be properly installed and configured. Avoid distributed transactions unless absolutely necessary.

Distribution statistics
The query processor uses statistics in order to produce the best possible query plan and SQL Server is able to use linked server statistics to optimize the query execution plan. However, the user running the query must have appropriate permissions on the remote server in order for the engine to use them. Awkwardly, for SQL Server, it turns out that the user running the query  must have the permission to run DBCC SHOW_STATISTICS. MSDN documentation states that :  “(…) to obtain all available statistics, the user must own the table or be a member of the sysadmin fixed server role, the db_owner fixed database role, or the db_ddladmin fixed database role on the linked server.” (link: http://msdn.microsoft.com/en-us/library/ms189811.aspx). This is much, much more permissions that is needed to read a table. Let’s hope Microsoft will fix this flaw in the near future. You can vote for the Microsoft Connect suggestion here.

Collations
Collations are used by SQL Server to compare and order strings. When working with remote SQL Server instances, the engine will correctly compare and order strings based on the remote column collation.  Therefore, if remote and local columns have  different collations it will result in collation conflicts. When defining a linked server, you have the option of using remote or local collation (“Use Remote Collation” in Server Options). If that option is set to true, SQL Server will try to push the ORDER BY and the WHERE clauses to the remote server. If Use Remote Collation is set to false, SQL Server will use the default collation of the local server instance. If the default collation of the local server instance do not match with the remote server column collation, this will result in poor performance. The local server will have to filter and order the data, thus having to transfer each row beforehand. It is obviously much faster to filter and order the data on the remote server. Then again, deciding to use the remote collation could lead to incorrect results.

Moreover, it is not possible to join on columns that have a different collation. The workaround is to explicitly cast the collation when querying the remote server with the COLLATE clause. But this is an expensive operation if you must scan millions of rows, especially if you need to access the column frequently. In that case, you should manually transfer the data to a local table with the proper collation. This problem can also arise on the same local database since collations are defined at the column level.


Comments

3 responses to “Linked servers and distributed queries”

Leave your response
  1. blogr7791 says:

    Thanks. It’s interesting

  2. chagbert says:

    Gives insight to Linked Servers in SQl however very basic info is provided. Would have loved to see how to test the linked ‘localhost’ for connectivity using the cmds i.e. I find that many users will probably get stuck on connectivity thus unable to execute rest of cmds. Thanks anyway -: chagbert.

  3. Neeraj Mital says:

    Great insights !!