statistics – SQL Bad Practices

Linked servers and distributed queries

Francois — Sun, 23 Oct 2011 22:37:38 +0000

Linked servers allows to issue SQL commands against OLE DB providers. Because Microsoft also makes available an “OLE DB Provider for ODBC Drivers”, it is also possible to issue queries against a variety of ODBC drivers. With linked servers and distributed queries, you can query all sorts of data sources and merge them on the fly with your SQL Server database. Example of data sources includes Analysis Services (SSAS), Access, Excel, Text files, Oracle, MySQL as well as SQL Server instances and many, many other sources.

This article will focus on distributed queries over SQL Server instances. Organizations frequently uses linked server in order to use data that is available on other servers or instances. This flexible strategy eliminates the need of synchronizing data over several servers (for example with replication). It is not a bad choice but one must understand the pitfalls in order to make enlightened decisions.

Creating a linked server
The easiest way to start is to create a linked server that references your own server:

-- 'localhost' is the standard "loopback" hostname
-- It points to the local machine
sp_addlinkedserver 'localhost'

Once the “localhost” linked server is created, you can reference objects using a four-part name in the form linkedserver.catalog.schema.object, for example:

SELECT * FROM localhost.MY_DATABASE.dbo.MY_TABLE;

Network Bandwidth and latency
The first obvious drawback of using a linked server is the network speed cost. For this single reason, linked server should not be used when we seek optimal performance (unless you need to scale your database on multiple servers but then again it’s not necessarily a good approach). There is a lot of overhead involved with SQL Server having to query the object metadata, the statistics (if possible) and send the query and results over the network. Note that we are not talking about end results here but intermediate query results so it doesn’t matter if your query return only one row. All this overhead makes remote query a lot more expensive than local query and joins between tables won’t be optimal. In general you want to use linked server when coupling is low, that is when you do not need to join intermediate results with the local database objects.

Transactions
There is also significantly more overhead involved in distributed transactions. All servers involved in a transaction must have MSDTC service (Distributed Transaction Coordinator) – which must be properly installed and configured. Avoid distributed transactions unless absolutely necessary.

Distribution statistics
The query processor uses statistics in order to produce the best possible query plan and SQL Server is able to use linked server statistics to optimize the query execution plan. However, the user running the query must have appropriate permissions on the remote server in order for the engine to use them. Awkwardly, for SQL Server, it turns out that the user running the query must have the permission to run DBCC SHOW_STATISTICS. MSDN documentation states that : “(…) to obtain all available statistics, the user must own the table or be a member of the sysadmin fixed server role, the db_owner fixed database role, or the db_ddladmin fixed database role on the linked server.” (link: http://msdn.microsoft.com/en-us/library/ms189811.aspx). This is much, much more permissions that is needed to read a table. Let’s hope Microsoft will fix this flaw in the near future. You can vote for the Microsoft Connect suggestion here.

Collations
Collations are used by SQL Server to compare and order strings. When working with remote SQL Server instances, the engine will correctly compare and order strings based on the remote column collation. Therefore, if remote and local columns have different collations it will result in collation conflicts. When defining a linked server, you have the option of using remote or local collation (“Use Remote Collation” in Server Options). If that option is set to true, SQL Server will try to push the ORDER BY and the WHERE clauses to the remote server. If Use Remote Collation is set to false, SQL Server will use the default collation of the local server instance. If the default collation of the local server instance do not match with the remote server column collation, this will result in poor performance. The local server will have to filter and order the data, thus having to transfer each row beforehand. It is obviously much faster to filter and order the data on the remote server. Then again, deciding to use the remote collation could lead to incorrect results.

Moreover, it is not possible to join on columns that have a different collation. The workaround is to explicitly cast the collation when querying the remote server with the COLLATE clause. But this is an expensive operation if you must scan millions of rows, especially if you need to access the column frequently. In that case, you should manually transfer the data to a local table with the proper collation. This problem can also arise on the same local database since collations are defined at the column level.

Table variable for large tables (vs temporary tables)

Francois — Wed, 14 Sep 2011 02:45:16 +0000

The main reason why Microsoft introduced table variable in SQL Server 2000 is to reduce stored procedure recompilations (a recompilation occurs when the stored procedure execution plan is recreated). Table variables also have a well defined scope (the current procedure or function) and they induce less logging and locking (because their transaction last for a single SQL statement). These are great advantages when dealing with short simplier OLTP-style queries and processes.

However, there are huge drawbacks of using table variables when you process a lot of rows. For a large table, using a table variable is very often a bad practice…

Statistics
First, they do not have any statistics (statistics are used by the query optimizer to produce the most efficient query plan based on data distribution). The following example demonstrates that the query optimizer has no clue about how many rows a table variable has when building the query plan:

SET NOCOUNT ON
-- Declare table variable
DECLARE @TABLE_VARIABLE TABLE (ID INT PRIMARY KEY CLUSTERED)
DECLARE @I INT = 0

-- Insert 10K rows
BEGIN TRAN
WHILE @I < 10000
BEGIN
INSERT INTO @TABLE_VARIABLE VALUES (@I)
SET @I=@I+1
END
COMMIT TRAN

-- Display all rows and output execution plan
set statistics profile on
SELECT * FROM @TABLE_VARIABLE
set statistics profile off

Result:

Rows	StmtText	…	EstimateRows
10000	\|–Clustered Index Scan(OBJECT:(@TABLE_VARIABLE))	…	1

The optimizer do not recompile queries that use table variables. In our example, although SQL Server performs a clustered index scan, it assumes the index has only one row because the engine does not have access to the table variable/clustered index statistics. Of course, such an assumption can make a huge impact on performance when a suboptimal query plan is used on a large table. A workaround is to use the OPTION (RECOMPILE) hint.

Indexes
You can’t add indexes to a table variable. Creating specific indexes obviously helps to improve query performance. The workaround is to specify constraints when declaring the table. Specifying a PRIMARY KEY CLUSTERED will create a clustered index and specifying a UNIQUE column will create a nonclustered index. However, you won’t always have the necessary flexibility of the indexes. For example, it won’t be possible to create non unique clustered indexes or nonclustered index with included columns.

Parallel plans
When executing INSERTs, UPDATEs or DELETEs on a table variable, the SQL Server storage engine never generate a parallel execution plan. This is a huge handicap and affects heavily the query performance when playing with large datasets.

Using local variables in T-SQL queries

Francois — Fri, 15 Jul 2011 12:00:59 +0000

A query plan is a set of steps generated by the database engine to retrieve data. Query plans are produced by the query optimizer from SQL statements.

SQL Server automatically caches query plans and try to reuse them whenever possible. For many applications (such as OLTP transactional applications), plan reuse is a very good thing since it avoids unneeded compilations that may take much time to complete each time a query is executed. SQL Server caches query plans (execution plans based on parameter assumptions) but not execution contexts (execution plans based on the actual parameters values). If you execute a query or stored procedure several times per second, you want to reuse the query plan as much as possible. However, when querying large tables, using the optimal plan is preferable since the queries may take several minutes to complete. In these cases, it is obviously better to save minutes (sometime hours) with an optimal plan at the cost of that extra 1-second of plan compilation.

In many cases you may end up with a sub-optimal query plan because the queries are compiled before the actual parameter values are known. Such is the case when local variables are used. Let’s create a sales table with 1 million sales on July 1st and 5 sales on July 2th with a index on the sale date (tested on SQL Server 2008 R2):

SET NOCOUNT ON

-- Drop Sales Table
IF EXISTS(select 1 from INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME='SALES_TABLE')
DROP TABLE SALES_TABLE

-- Create Sales Table
CREATE TABLE SALES_TABLE(
[SALES_ID] [int] NOT NULL IDENTITY(1,1) PRIMARY KEY CLUSTERED
,SALE_DATE [datetime] NOT NULL
,SALE_AMOUNT [numeric](28,10) NOT NULL
)

-- Insert Sales Data
DECLARE @I INT = 0
BEGIN TRAN
WHILE @I < 1000000
BEGIN
INSERT INTO SALES_TABLE(SALE_DATE,SALE_AMOUNT) SELECT '20110701', RAND() * 100.0
SET @I=@I+1
END

SET @I=0
WHILE @I < 5
BEGIN
INSERT INTO SALES_TABLE(SALE_DATE,SALE_AMOUNT) SELECT '20110702', RAND() * 100.0
SET @I=@I+1
END
COMMIT TRAN

-- Create index on Sale Date
CREATE NONCLUSTERED INDEX [IX_SALE_DATE] ON [SALES_TABLE]
(
[SALE_DATE] ASC
)

Let’s summarized the sales for July 2th:

-- Query with constant value
set statistics profile on
SELECT SUM([SALE_AMOUNT]) FROM SALES_TABLE WHERE [SALE_DATE]='20110702'
set statistics profile off

You will notice in the results that the engine does an Index Seek on the IX_SALE_DATE index. This is the optimal plan since there is only 5 sales on July 2th. Now let’s declare the local variable @mydate and set it to July 2th:

-- Parametrized query with local variable
declare @mydate datetime = '20110702'
set statistics profile on
SELECT SUM([SALE_AMOUNT]) FROM SALES_TABLE WHERE [SALE_DATE]=@mydate
set statistics profile off

SQL Server does not use the optimal plan (it does a clustered index scan instead). Why is that? This is because the engine simply ignores the local variable value and compiles a plan based on general statistics assumptions. The compiled query plan is “good enough” for just about any value of @mydate (note that sometimes the query optimizer is way off, you must make sure that you have enough statistics and that they are up-to-date). When you are using local variables in your queries and you can afford to lose one or two seconds, you should force query recompilation using this syntax:

Use OPTION (RECOMPILE)

-- Parametrized query with local variable and OPTION (RECOMPILE)
declare @mydate datetime = '20110702'
set statistics profile on
SELECT SUM([SALE_AMOUNT]) FROM SALES_TABLE WHERE [SALE_DATE]=@mydate OPTION (RECOMPILE)
set statistics profile off

Note that the same principles applies to stored procedure compilations (you can use WITH RECOMPILE argument with the EXEC statement) when you want the procedure to recompile with the provided parameters, thus avoiding parameter sniffing).