SQL Bad Practices

Speeding up aggregates with indexed views

Francois — Sun, 25 Mar 2012 19:52:12 +0000

Ever so often, queries include some aggregates data. In a datawarehouse environment, they are frequently used for metric and Key Performance Indicator (KPI) calculations. Aggregates are also used a lot for reporting purposes and for statistical computation.

When a query perform poorly, our first instinct is to add an index to speed it up. Creating a view and indexing that view is often an overlooked solution. However, an indexed view might perform much better with less performance hit (on INSERT, UPDATE & DELETE) compared with an index. Enterprise edition of SQL Server is able to automatically use indexed views even when they are not referenced, just like indexes.

Let’s create a typical web log table with two million entries. This table contains the cookie_id of the visitor, the visit date and the transaction value if the user made a purchase during the visit.

-- Create table
IF OBJECT_ID('WEBLOG','U') IS NULL
BEGIN
CREATE TABLE [WEBLOG](
[COOKIE_ID] [int] NOT NULL,
[VISIT_DATE] [date] NOT NULL,
[TRANSACTION_VALUE] [money] NOT NULL
) ON [PRIMARY]

CREATE CLUSTERED INDEX [IX_VISIT_DATE] ON [WEBLOG]
( [VISIT_DATE] ASC ) ON [PRIMARY]
END

-- Empty the table
DELETE FROM WEBLOG

-- Insert 2 million rows
DECLARE @I INT = 0
DECLARE @HITS INT = 2000000
BEGIN TRANSACTION

WHILE @I<@HITS
BEGIN

INSERT INTO WEBLOG(COOKIE_ID,VISIT_DATE,TRANSACTION_VALUE)
SELECT RAND()*@HITS/100 -- COOKIE_ID
,DATEADD(day,CONVERT(INT,(RAND()*365)),'20110101') -- VISIT DATE
,CASE WHEN RAND()<0.01
THEN RAND()*10.0 ELSE 0.0 END -- TRANSACTION_VALUE

SET @I=@I+1
END

COMMIT TRANSACTION

We wish to report the number of visits, the lifetime value (total purchases) and revenue per visit of each cookie:

SELECT COOKIE_ID
, COUNT(*) AS FREQUENCY
, SUM(TRANSACTION_VALUE) AS LIFETIME_VALUE
, AVG(TRANSACTION_VALUE) AS REVENUE_PER_VISIT
FROM [DBO].WEBLOG
GROUP BY COOKIE_ID

On average, this query takes 600.4 milliseconds on my system. We can speed it up using a nonclustered index:

CREATE NONCLUSTERED INDEX [IX_RFM] ON [WEBLOG] ([COOKIE_ID])
INCLUDE ( [TRANSACTION_VALUE]) ON [PRIMARY]

It now takes 336 milliseconds to run the query, a performance improvement of 44%. We can increase this performance gain by using an indexed view instead of creating the previous index:

IF OBJECT_ID('RFM','V') IS NOT NULL
DROP VIEW [RFM]
GO

-- Create view
CREATE VIEW [RFM] WITH SCHEMABINDING AS
SELECT COOKIE_ID, SUM(TRANSACTION_VALUE) AS MONETARY
,COUNT_BIG(*) AS FREQUENCY
FROM [DBO].WEBLOG
GROUP BY COOKIE_ID
GO

-- Create clustered index on view, making it a indexed view
CREATE UNIQUE CLUSTERED INDEX IDX_RFM_V ON [RFM] (COOKIE_ID);

The query now runs in 53.8 ms on average: a 91% performance gain. The thing is, fetching the data is almost instantaneous because the result of the view is materialized: the majority of the elapsed time is spent displaying the query results. Performance do not depend on the underlying table but rather on what can be fetched from the materialized results of the view (everything, in our case):

Query execution plan

Here, even if the average aggregate (AVG) is not defined in the view, the query optimizer is able to derive the result from the COUNT and the SUM aggregates. If the view gets big, you can also create nonclustered indexes on your view to speed-up access to subsets of your view. You get the best query performance gains if your underlying tables are large and your query results stay small (hence the benefit with aggregations).

The improved query speed comes with additional overhead when modifying table data (just like indexes). The following table displays a summary of my test results:

Indexed view is also a great way to improve INNER JOINS performance. When two or more table are prejoined in an indexed view, the query optimizer can choose to retrieve the materialized view data instead of performing a costly join operation.

For more information on indexed views, see the following Microsoft article: http://technet.microsoft.com/en-us/library/cc917715.asp.

Windows Authentication: giving built-in or service accounts permissions to your database

Francois — Mon, 16 Jan 2012 11:02:04 +0000

Some days ago, I was asked to do a quick security check on a web application. After reviewing authentification mechanisms best practices, we quickly looked at how the application was connecting to its databases. As with a lot of web app, the application was connecting to the database through a unique login.

Positive points first:

Windows Athentication mode was used (√)
The application is using the Windows Athentication mode to connect to its databases. This is great since you don’t have to include usernames or password within your code. Moreover, Windows has all sort of policies that you an enable to make your application more secure, such as Password Complexity policy or Password expiration policy. Just make sure SQL Server enforces it (CHECK_POLICY=ON at the login level – see ALTER LOGIN).

Another benefit is that no clear text password will be transferred through the communication protocol, (as it is when SQL Server password is used), preventing attackers from sniffing passwords. More details here.

Allow remote connections to this server was unchecked (√)
Since the web application is running on the same server as SQL Server, there is no need to accept remote connections, because the connections will always be handled by the web application. This prevents all intrusions from external applications or individuals.

SQL Server port was blocked by the Firewall (√)
By default, SQL Server listens to TCP port 1433 for remote connections. Blocking this port on the Firewall level adds a layer of protection.

HOWEVER, there was two major security issues:

To much permission was given to the login (x)
An important principle in computer security is the principle of least privilege. In this case, this principle means that the login should have had the minimal permissions needed by the application. However, the login was given administrative privileges (sysadmin server role) when he should have had only read/write permissions to some tables and execute permissions to some stored procedures.

The login used to connect to the database was the IIS anonymous built-in account (x)
Like a lot of web application, IIS anonymous authentication was used and the application was connecting through a single login. By default when IIS anonymous authentication mode is used, the ISS web application is running with the IUSR_YourComputerName windows user (IIS anonymous built-in account). So database permissions were given to that login.

This is a bad practice. If you give permissions to SQL Server to this login, you end up with a big security flaw: everyone who is able to run an IIS application will have access to your database (or worse, they could administer your whole SQL instance as in this specific case). Other web applications won’t even need a password because IIS automatically impersonate this account!

The same is true if you have an application running as a Windows service and you decide that you want to connect to SQL Server using Windows authentication while your service runs under the Local System, Network Service or Local Service accounts. If you give these special accounts access to your database, any service (current or new) will be able to access your data without even providing a password…

So what should be done ?

1- Create a local account:

2- Give this user minimal permissions to your database:

3- In IIS, modify the Anonymous Authentication credentials from IUSR to Application pool identity. More details here.

4- Create an IIS Application Pool for your web application and set its Identity to the local user you just created:

Your web application will now be running under your local account and be able to connect to its database using Windows Authentication mode. You can do the same if your application is a Windows service: depending on your needs, you can modify the account identity under the Log On tab or impersonate your local account when you wish to interact to your database.

Should you shrink your database in your maintenance plan?

Francois — Fri, 18 Nov 2011 00:38:51 +0000

Management Studio provides a neat GUI interface to define maintenance plans for your SQL databases. This tool is available under Management -> Maintenance Plans and provides many great Maintenance Plan Tasks that you can configure easily. There are many useful tasks such as “Check Database Integrity Task”, “Back Up Database Task”, “Rebuild Index Task”, etc.

One of the task is “Shrink Database Task” and should never be used as part of a scheduled maintenance plan. Essentially because it does nothing useful and does many awfull things.

Let’s start by listing the cons of shrinking a database:

1- First it takes time and a lot of IO and CPU. If you are lucky enough to have a time window when you can perform heavy maintenance tasks, you should use this time to do useful tasks such as rebuilding some indexes, updating you statistics, verifying the integrity of your databases, backing up your databases, testing your backups, loading and warming your cubes, etc.

2- Second a SHRINKDATABASE without the TRUNCATEONLY argument (which is essentially calling SHRINKFILE for each database data and log files) will attempt to move data to the beginning of the file. Doing so, it will induce index fragmentation: the data won’t be contiguous anymore and index scans will perform slower compared to non-fragmented index scans.

3- Also, by shrinking your files you are releasing space back to Windows (unless you specify NOTRUNCATE – and I can not imagine a case where specifying this argument would make sense). But is this a good thing? I have yet to see a database that shrinks over time so you will eventually have to grow the data file back. File grow takes time and recurrent file growth induces physical fragmentation which increases disk io latency.

The advantage of shrinking a database:

1- Shrinking the database release disk space. As a pointed out in the last bullet, this is not necessary a good thing.

You should consider shrinking only when the following conditions are met:

You desperately need the space and can’t add disk space to the system
You deleted a lot of data (or grew the data file too much) and you don’t expect the data files to grow back. A good example would be to shrink files which are on a filegroup that will become read-only, such as with date-based partitions.
This is a manual operation and the shrink is not part of a recurrent maintenance task
You first tried SHRINKFILE with the TRUNCATEONLY argument. The operation won’t induce fragmentation and the SHRINKFILE will be very fast.

Great article by Paul S.Randal who recommends creating a new filegroup instead of using SHRINKDATABASE: http://www.sqlskills.com/BLOGS/PAUL/post/Why-you-should-not-shrink-your-data-files.aspx.

Linked servers and distributed queries

Francois — Sun, 23 Oct 2011 22:37:38 +0000

Linked servers allows to issue SQL commands against OLE DB providers. Because Microsoft also makes available an “OLE DB Provider for ODBC Drivers”, it is also possible to issue queries against a variety of ODBC drivers. With linked servers and distributed queries, you can query all sorts of data sources and merge them on the fly with your SQL Server database. Example of data sources includes Analysis Services (SSAS), Access, Excel, Text files, Oracle, MySQL as well as SQL Server instances and many, many other sources.

This article will focus on distributed queries over SQL Server instances. Organizations frequently uses linked server in order to use data that is available on other servers or instances. This flexible strategy eliminates the need of synchronizing data over several servers (for example with replication). It is not a bad choice but one must understand the pitfalls in order to make enlightened decisions.

Creating a linked server
The easiest way to start is to create a linked server that references your own server:

-- 'localhost' is the standard "loopback" hostname
-- It points to the local machine
sp_addlinkedserver 'localhost'

Once the “localhost” linked server is created, you can reference objects using a four-part name in the form linkedserver.catalog.schema.object, for example:

SELECT * FROM localhost.MY_DATABASE.dbo.MY_TABLE;

Network Bandwidth and latency
The first obvious drawback of using a linked server is the network speed cost. For this single reason, linked server should not be used when we seek optimal performance (unless you need to scale your database on multiple servers but then again it’s not necessarily a good approach). There is a lot of overhead involved with SQL Server having to query the object metadata, the statistics (if possible) and send the query and results over the network. Note that we are not talking about end results here but intermediate query results so it doesn’t matter if your query return only one row. All this overhead makes remote query a lot more expensive than local query and joins between tables won’t be optimal. In general you want to use linked server when coupling is low, that is when you do not need to join intermediate results with the local database objects.

Transactions
There is also significantly more overhead involved in distributed transactions. All servers involved in a transaction must have MSDTC service (Distributed Transaction Coordinator) – which must be properly installed and configured. Avoid distributed transactions unless absolutely necessary.

Distribution statistics
The query processor uses statistics in order to produce the best possible query plan and SQL Server is able to use linked server statistics to optimize the query execution plan. However, the user running the query must have appropriate permissions on the remote server in order for the engine to use them. Awkwardly, for SQL Server, it turns out that the user running the query must have the permission to run DBCC SHOW_STATISTICS. MSDN documentation states that : “(…) to obtain all available statistics, the user must own the table or be a member of the sysadmin fixed server role, the db_owner fixed database role, or the db_ddladmin fixed database role on the linked server.” (link: http://msdn.microsoft.com/en-us/library/ms189811.aspx). This is much, much more permissions that is needed to read a table. Let’s hope Microsoft will fix this flaw in the near future. You can vote for the Microsoft Connect suggestion here.

Collations
Collations are used by SQL Server to compare and order strings. When working with remote SQL Server instances, the engine will correctly compare and order strings based on the remote column collation. Therefore, if remote and local columns have different collations it will result in collation conflicts. When defining a linked server, you have the option of using remote or local collation (“Use Remote Collation” in Server Options). If that option is set to true, SQL Server will try to push the ORDER BY and the WHERE clauses to the remote server. If Use Remote Collation is set to false, SQL Server will use the default collation of the local server instance. If the default collation of the local server instance do not match with the remote server column collation, this will result in poor performance. The local server will have to filter and order the data, thus having to transfer each row beforehand. It is obviously much faster to filter and order the data on the remote server. Then again, deciding to use the remote collation could lead to incorrect results.

Moreover, it is not possible to join on columns that have a different collation. The workaround is to explicitly cast the collation when querying the remote server with the COLLATE clause. But this is an expensive operation if you must scan millions of rows, especially if you need to access the column frequently. In that case, you should manually transfer the data to a local table with the proper collation. This problem can also arise on the same local database since collations are defined at the column level.

Table variable for large tables (vs temporary tables)

Francois — Wed, 14 Sep 2011 02:45:16 +0000

The main reason why Microsoft introduced table variable in SQL Server 2000 is to reduce stored procedure recompilations (a recompilation occurs when the stored procedure execution plan is recreated). Table variables also have a well defined scope (the current procedure or function) and they induce less logging and locking (because their transaction last for a single SQL statement). These are great advantages when dealing with short simplier OLTP-style queries and processes.

However, there are huge drawbacks of using table variables when you process a lot of rows. For a large table, using a table variable is very often a bad practice…

Statistics
First, they do not have any statistics (statistics are used by the query optimizer to produce the most efficient query plan based on data distribution). The following example demonstrates that the query optimizer has no clue about how many rows a table variable has when building the query plan:

SET NOCOUNT ON
-- Declare table variable
DECLARE @TABLE_VARIABLE TABLE (ID INT PRIMARY KEY CLUSTERED)
DECLARE @I INT = 0

-- Insert 10K rows
BEGIN TRAN
WHILE @I < 10000
BEGIN
INSERT INTO @TABLE_VARIABLE VALUES (@I)
SET @I=@I+1
END
COMMIT TRAN

-- Display all rows and output execution plan
set statistics profile on
SELECT * FROM @TABLE_VARIABLE
set statistics profile off

Result:

Rows	StmtText	…	EstimateRows
10000	\|–Clustered Index Scan(OBJECT:(@TABLE_VARIABLE))	…	1

The optimizer do not recompile queries that use table variables. In our example, although SQL Server performs a clustered index scan, it assumes the index has only one row because the engine does not have access to the table variable/clustered index statistics. Of course, such an assumption can make a huge impact on performance when a suboptimal query plan is used on a large table. A workaround is to use the OPTION (RECOMPILE) hint.

Indexes
You can’t add indexes to a table variable. Creating specific indexes obviously helps to improve query performance. The workaround is to specify constraints when declaring the table. Specifying a PRIMARY KEY CLUSTERED will create a clustered index and specifying a UNIQUE column will create a nonclustered index. However, you won’t always have the necessary flexibility of the indexes. For example, it won’t be possible to create non unique clustered indexes or nonclustered index with included columns.

Parallel plans
When executing INSERTs, UPDATEs or DELETEs on a table variable, the SQL Server storage engine never generate a parallel execution plan. This is a huge handicap and affects heavily the query performance when playing with large datasets.

Using NOT IN operator with null values

Francois — Fri, 19 Aug 2011 02:15:00 +0000

The IN operator compares a value with a list of values. It must however be used with
care when we are dealing with nulls.

Let’s create a table containing three city names and a null value. The goal is check
whether a city is in the list or not.

-- By default ANSI_NULLS is off so null comparisons follows the SQL-92 standard.
-- In future version of SQL Server, it won't be possible to modify this setting.
SET ANSI_NULLS OFF

IF EXISTS(select 1 from INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME='CITIES')
DROP TABLE [CITIES]

CREATE TABLE [CITIES] (CITY [varchar](50) NULL)

INSERT INTO CITIES
SELECT 'PARIS' UNION ALL
SELECT 'MONTREAL' UNION ALL
SELECT 'NEW YORK' UNION ALL
SELECT NULL

The table now contains the following city names:

PARIS

MONTREAL

NEW YORK

NULL

Let’s use the IN operator to determine if Montreal is in the city table:

SELECT'Found Montreal'
WHERE'Montreal' IN (SELECT city from CITIES)

Montreal is found so everything is all right. Now let’s try the following query to find out if Sidney appears in the table:

SELECT 'Found Sidney'
WHERE 'Sidney' IN (SELECT city from CITIES)

We still get the right result: Sidney is not in the list so no row is returned. Now to find out if Sidney is missing in the table, we would write something like that:

SELECT 'Sidney Not Found'
WHERE 'Sidney' NOT IN (SELECT city from CITIES)

However here something is definitevely wrong. Sidney is not in the list and still no rows is returned. Let’s try a different approach:

SELECT'Sidney Not Found'
WHERE 'Sidney' NOT IN ('Paris','Montreal','New York')

That one works. The null value affects the outcome of the NOT IN operator. This is because the operator compares each city in the list; the previous query is logically equivalent to the following query:

SELECT 'Sidney Not Found'
WHERE 'Sidney'<>'Paris'
AND 'Sidney'<>'Montreal'

We therefore get this logically equivalent query if we add a null value:

SELECT 'Sidney Not Found'
WHERE 'Sidney'<>'Paris'
AND 'Sidney'<>'Montreal'
AND 'Sidney'<>null

… and since, by default, “Sidney <>null” is UNKNOWN (neither true or false), no row is returned because every condition must be true in order for the AND operator to return a TRUE result. The same counter-intuitive result happens with the IN operator, like in this example:

SELECT city from CITIES
WHERE city in (select city from CITIES)

Here null is in the list disappeared, because NULL<>NULL.

When checking for existence, you should use the EXISTS operator if the columns involved are nullables. Using IN operator might produce an inferior plan and can lead to misleading results if a null value is inserted in the table. In our example, we can rewrite our query as:

SELECT 'Sidney Not Found'
WHERE NOT EXISTS
(SELECT 1/0 FROM CITIES WHERE CITY = 'Sidney')

The EXISTS operator returns TRUE if the subquery returns at least a row and FALSE otherwise. Also note that the columns returned by the subquery are never evaluated because there is no need to. That is why the previous query didn’t throw a “Divide by zero error”.

Using local variables in T-SQL queries

Francois — Fri, 15 Jul 2011 12:00:59 +0000

A query plan is a set of steps generated by the database engine to retrieve data. Query plans are produced by the query optimizer from SQL statements.

SQL Server automatically caches query plans and try to reuse them whenever possible. For many applications (such as OLTP transactional applications), plan reuse is a very good thing since it avoids unneeded compilations that may take much time to complete each time a query is executed. SQL Server caches query plans (execution plans based on parameter assumptions) but not execution contexts (execution plans based on the actual parameters values). If you execute a query or stored procedure several times per second, you want to reuse the query plan as much as possible. However, when querying large tables, using the optimal plan is preferable since the queries may take several minutes to complete. In these cases, it is obviously better to save minutes (sometime hours) with an optimal plan at the cost of that extra 1-second of plan compilation.

In many cases you may end up with a sub-optimal query plan because the queries are compiled before the actual parameter values are known. Such is the case when local variables are used. Let’s create a sales table with 1 million sales on July 1st and 5 sales on July 2th with a index on the sale date (tested on SQL Server 2008 R2):

SET NOCOUNT ON

-- Drop Sales Table
IF EXISTS(select 1 from INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME='SALES_TABLE')
DROP TABLE SALES_TABLE

-- Create Sales Table
CREATE TABLE SALES_TABLE(
[SALES_ID] [int] NOT NULL IDENTITY(1,1) PRIMARY KEY CLUSTERED
,SALE_DATE [datetime] NOT NULL
,SALE_AMOUNT [numeric](28,10) NOT NULL
)

-- Insert Sales Data
DECLARE @I INT = 0
BEGIN TRAN
WHILE @I < 1000000
BEGIN
INSERT INTO SALES_TABLE(SALE_DATE,SALE_AMOUNT) SELECT '20110701', RAND() * 100.0
SET @I=@I+1
END

SET @I=0
WHILE @I < 5
BEGIN
INSERT INTO SALES_TABLE(SALE_DATE,SALE_AMOUNT) SELECT '20110702', RAND() * 100.0
SET @I=@I+1
END
COMMIT TRAN

-- Create index on Sale Date
CREATE NONCLUSTERED INDEX [IX_SALE_DATE] ON [SALES_TABLE]
(
[SALE_DATE] ASC
)

Let’s summarized the sales for July 2th:

-- Query with constant value
set statistics profile on
SELECT SUM([SALE_AMOUNT]) FROM SALES_TABLE WHERE [SALE_DATE]='20110702'
set statistics profile off

You will notice in the results that the engine does an Index Seek on the IX_SALE_DATE index. This is the optimal plan since there is only 5 sales on July 2th. Now let’s declare the local variable @mydate and set it to July 2th:

-- Parametrized query with local variable
declare @mydate datetime = '20110702'
set statistics profile on
SELECT SUM([SALE_AMOUNT]) FROM SALES_TABLE WHERE [SALE_DATE]=@mydate
set statistics profile off

SQL Server does not use the optimal plan (it does a clustered index scan instead). Why is that? This is because the engine simply ignores the local variable value and compiles a plan based on general statistics assumptions. The compiled query plan is “good enough” for just about any value of @mydate (note that sometimes the query optimizer is way off, you must make sure that you have enough statistics and that they are up-to-date). When you are using local variables in your queries and you can afford to lose one or two seconds, you should force query recompilation using this syntax:

Use OPTION (RECOMPILE)

-- Parametrized query with local variable and OPTION (RECOMPILE)
declare @mydate datetime = '20110702'
set statistics profile on
SELECT SUM([SALE_AMOUNT]) FROM SALES_TABLE WHERE [SALE_DATE]=@mydate OPTION (RECOMPILE)
set statistics profile off

Note that the same principles applies to stored procedure compilations (you can use WITH RECOMPILE argument with the EXEC statement) when you want the procedure to recompile with the provided parameters, thus avoiding parameter sniffing).

Enabling Boost SQL Server priority Option

Francois — Sun, 12 Jun 2011 11:15:29 +0000

On occasion I have seen Database Administrators enabling the SQL Server “Boost SQL Server priority” option. This option is available on the Server Properties Window under Processors:

If you enable this option, SQL Server will run the sqlservr.exe process and threads as High Priority instead of its usual Normal priority. Hence, when SQL Server service will request CPU, other processes in need of CPU time won’t be prioritized. In some scenarios it can lead to problems and most of the time it won’t bring any benefit. Microsoft do not recommend to enable this feature, see this Microsoft Support article (search for Priority Boost).

Production server generally falls into one of these two categories:

1. One server box handling everything. For example, SQL Server is installed on the same machine as a web server running a web application with perhaps Analysis Services or SISS processes that runs once in a while. This is the worst scenario for enabling the priority boost because you need to do several jobs at the same time therefore you want all your process to run smoothly together. One example would be to launch a CPU-intensive query within SSIS while computing complex business rules or building Data Mining models. You certainly do not wish SQL Server to take all CPU resources here because other computations are running simultaneously.

2. Dedicated SQL Server server. If SQL Server is the only service running then there won’t be other processes fighting for CPU time so enabling the “Boost SQL Server priority” feature brings no benefit. Plus, you could get some unpredictable results because core processes such as Windows processes or device drivers may not get enough resources when SQL Servers runs CPU-intensive queries (for example several queries performing hash joins between large tables).

To be fair, Microsoft mentions in this Microsoft Connect article that you might see some performance improvements in “high-end servers primarily with OLTP workloads”. My opinion is that even in the very rare occasions that you get small improvement for SQL Server, your overall system performance may worsen.

Here’s an example demonstrating that the option is very dangerous. Do not try this if you are connected to a local SQL Server instance because your system will become completely unresponsive and you will have to manually shut down your computer. Do NOT try this on a production server…

1. Configure the priority boost Option (see How to: Configure the priority boost Option for the steps) and restart SQL Server as indicated.

2. Open a new query window in Management Studio and run the following script:

DECLARE @I INT
SELECT @I = 1
FROM sys.objects a
CROSS JOIN sys.objects b
CROSS JOIN sys.objects c
CROSS JOIN sys.objects d
CROSS JOIN sys.objects e
CROSS JOIN sys.objects f
CROSS JOIN sys.objects g

3. Repeat step #2 until the system is totally unresponsive…

4. Cancel all query executions and the system will come back to its normal state.

Nice article on the priority boost option: http://blogs.msdn.com/b/arvindsh/archive/2010/01/27/priority-boost-details-and-why-it-s-not-recommended.aspx.

UPDATE: MSDN states in the SQL Server 2008 R2 documentation that the priority boost option will be removed in future versions of SQL Server: http://msdn.microsoft.com/en-us/library/ms180943.aspx .

Keeping Maximum Server Memory default value

Francois — Fri, 03 Jun 2011 01:00:01 +0000

The default value for SQL Server 2008 Maximum Server Memory setting is 2,147,483,647 MB (or 2.1 petabytes!). Therefore, by default, SQL Server will use all available memory for its own use:

If you don’t lower this setting, you will reduce the memory available for other services such as Integration Services (SSIS), Analysis Services (SSAS), Reporting Services (SSIS) as well as other Windows services (you should also disable the services that you do not use).

I have run into cases where SQL Server 2008 had difficulties handling memory pressure and would throw the infamous “There is insufficient system memory to run this query” error.

Depending on the amount of total memory of your production server, and if you don’t use other memory-intensive services, you should at least reserve 1-2 Go to the Windows operating system (for example, by specifying 14000 MB on a 16 Go machine). If your server is 64-bit, you will be able to easily verify that the sqlserv.exe process in Windows Task Manager will never be way higher than the value you specified (it can go a little higher). If you have a 32-bit version of SQL Server, the sqlserv.exe process will be limited to 2 or 3 Go depending on your server configuration.

To allow SQL Server to use more than 3 Go of memory on 32-bit systems, you have to configure AWE memory allocation. Thereafter, you can verify the memory utilization by issuing the DBCC MEMORYSTATUS command or use SQL Server counters in Performance Monitor (here Windows Task Manager will not report the correct memory utilization). That being said, you should not enable AWE if your 32-bit server doesn’t have more than 4 Go of memory. Also, keep in mind that AWE setting will be ignored on 64-bit setups.

It goes without saying that if you are running multiple SQL Server instances or other memory-intensive processes such as Analyses Services on the same server, you should carefully configure each services’ memory settings. For example, Analysis Services provide the same memory control mechanism with Memory \ TotalMemoryLimit server property.

SSIS Non cached Lookups without a covering index

Francois — Fri, 20 May 2011 02:28:20 +0000

The vast majority of the time, you will use SSIS Lookup component in Full Cache mode. This mode is the fastest because it queries the database only once (before the data flow starts) and apply hashing in order to do high-performance comparisons.

Sometimes however you will have to use Non Cached lookups. For example if your reference table doesn’t fit in memory or if you wish to lookup rows that you just inserted in your reference table at the beginning of your data flow. You might also run into cases where you need to do inequalities lookups or where you have very few rows at the source and you wish to lookup a table which has several million rows.

Because the Non Cached mode will query the database for each row, it has to be fast.

If your reference table is large and you omit to create a covering index, you will get very poor performance. A covering index, by definition, is an index that contains each column that is used by your query. Note that only the columns in your WHERE statement (the Lookup join columns) must be in the index, the other columns (Lookup reference columns) need only to be in the “Included Columns” of the index. Including your reference columns in the index makes it easy for SQL Server to get your reference values without going back to the table pages.

Let’s create a lookup table:

IF NOT EXISTS(select 1 from INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME='LOOKUP_TABLE')
CREATE TABLE LOOKUP_TABLE(
[LOOKUP] [varchar](36) NULL,
[REFERENCE] [int] NOT NULL PRIMARY KEY CLUSTERED)

And add 1 million rows:

SET NOCOUNT ON

TRUNCATE TABLE LOOKUP_TABLE
DECLARE @I INT = 0
BEGIN TRAN

WHILE @I < 1000000
BEGIN
INSERT INTO LOOKUP_TABLE(LOOKUP,REFERENCE) SELECT NEWID(), @I
SET @I=@I+1
END

COMMIT TRAN

Select No Cache or Partial Cache in the SSIS Lookup Component dialog box:

The SSIS Lookup query should be defined like this:

And the SSIS Lookup Columns should be defined like this:

The covering index is then created this way:

CREATE NONCLUSTERED INDEX
[IX_NON_CLUSTERED_LOOKUP_INCLUDE_LOOKUP]
ON [LOOKUP_TABLE] ([LOOKUP] ASC)
INCLUDE ([REFERENCE])

We can appreciate the performance gain for 1000 rows:

Index	Elapsed Time
Clustered Index Scan	37.3 sec
Covering Index	0.109 sec

Another way of judging the performance of the lookup is to display SQL Server query statistics for a single query:

Force SQL Server to use clustered index instead of covering index:

-- Clear memory cache and show some stats
DBCC DROPCLEANBUFFERS
set statistics io on
set statistics time on

-- Use default SSIS Lookup syntax
select * from (
SELECT [LOOKUP],[REFERENCE] FROM [LOOKUP_TABLE]
WITH (INDEX (0)) -- ...but tell SQL to scan the table instead of using the covering index
) [refTable]
where [refTable].[LOOKUP] = '77EE7751-6BC1-4088-874E-D8F4440BB01A'

Results:
Table ‘LOOKUP_TABLE’. Scan count 9, logical reads 6699, physical reads 152
CPU time = 296 ms, elapsed time = 624 ms.

Here SQL Server will use the covering index:

-- Clear memory cache and show some stats
DBCC DROPCLEANBUFFERS
set statistics io on
set statistics time on

-- Use default SSIS Lookup syntax
select * from (
SELECT [LOOKUP],[REFERENCE] FROM [LOOKUP_TABLE]
) [refTable]
where [refTable].[LOOKUP] = '77EE7751-6BC1-4088-874E-D8F4440BB01A'

Results:
Table ‘LOOKUP_TABLE’. Scan count 1, logical reads 3, physical reads 2
CPU time = 0 ms, elapsed time = 31 ms.

How NOT to retrieve IDENTITY value

Francois — Fri, 06 May 2011 21:17:21 +0000

It is a common business case to have to reuse the auto-generated SQL Server’s IDENTITY value. One way to deal with the problem is to use the system function @@IDENTITY. For example:

-- Create test table
IF NOT EXISTS(select 1 from INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME='Identity_Test')
CREATE TABLE Identity_Test(id int IDENTITY, value int)

-- Insert row and retrieve IDENTITY value
INSERT INTO Identity_Test(value) VALUES(NULL);
SELECT @@IDENTITY;

The problem with this code is that you may not retrieve the identity value that you inserted. For example, if there is a trigger on the table performing an insert on another table, you will get the last created identity value. Even if you never create any trigger, you may get skewed results with replicated tables since SQL Server creates his own replication triggers.

One way to deal with the problem is to use SCOPE_IDENTITY():

-- Create test table
IF NOT EXISTS(select 1 from INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME='Identity_Test')
CREATE TABLE Identity_Test(id int IDENTITY, value int)

-- Insert row and retrieve IDENTITY value
INSERT INTO Identity_Test(value) VALUES(NULL);
SELECT SCOPE_IDENTITY();

In theory, that should always provide the last value that you inserted. However, there is a nasty bug in SQL Server affecting SCOPE_IDENTITY() results when a query plan involving parallelism is generated. This is not the case in our example because INSERT INTO … VALUES won’t generate a parallel plan but it still is a serious issue when dealing with INSERT INTO … SELECT queries.

The recommended way of retrieving identity values (and the only one that can retrieve multiple values) is to use the OUTPUT clause:

-- Create test table
IF NOT EXISTS(select 1 from INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME='Identity_Test')
CREATE TABLE Identity_Test(id int IDENTITY, value int)

-- Insert two rows and retrieve identity values
DECLARE @IDs TABLE (id INT)
INSERT INTO Identity_Test(value) OUTPUT inserted.id INTO @IDs VALUES(NULL);
INSERT INTO Identity_Test(value) OUTPUT inserted.id INTO @IDs VALUES(NULL);
SELECT id FROM @IDs;

See Microsoft Knowledge Base Article: http://support.microsoft.com/default.aspx?scid=kb;en-US;2019779.

Heap Tables

Francois — Fri, 29 Apr 2011 00:00:12 +0000

Heap tables (tables without a clustered index) are generally not part of a good database design.

Unless you never actually query your table, you should always put a clustered index on it. Heap table are generally slower on selects, updates and deletes. They are also generally slower on inserts if you decide to use a nonclustered index instead of a clustered index (which you shouldn’t do).

By default, Management Studio creates a clustered primary key so heap are sometimes created because no primary key has been defined on a table. A primary key is NOT the same thing as a clustered index. A primary key cannot have duplicate or null values but a clustered index certainly can.

Moreover, a heap table is way bigger and has generally high fragmentation. Fragmentation occurs on non heap tables but you can at least rebuild your index to get rid of it.

If you have a log table that almost never gets queried and you don’t care about disk space then you could use a heap table. If you need a table to insert large sets of data before creating a clustered index you should create a temporary heap table for performance on inserts. Otherwise, always define a clustered index on all your tables.

Nice article on performance of heap+index versus clustered index: http://msdn.microsoft.com/en-us/library/cc917672.