Contact  |  Site Map  |  Privacy      Call for more info
 
Latest articles

SQL Optimization

SQL optimization is a transformation of SQL statements, which produces exactly the same result, as an original query. This process requires a lot of creativity and intuition; it is more an Art than a Science.

It is impossible to teach it in a small article, so I don't even try to do it. Instead, I want to limit myself by classifying types of optimization. So, when you are asked to optimize a database, you need to know, what to ask and what to expect.

The result, produced by the optimized code must be exactly the same as it was before: the second worst nightmare of a DBA (the first one is losing data) is someone running to him crying 'you know, after you last optimization all customer balances became negative!' So I will focus on Potential danger on every level.

For that reason all types of optimization are sorted in order of increasing intrusiveness, and I begin from the non-intrusive optimization.

0. Non-intrusive optimization

It is upgrading SQL server hardware and changing of SQL server parameters. Adding more memory or processor power wont harm, you don't need to test your product again.

Upgrading hardware you can spend a lot without any effect, because it is possible that your server is suffering from Locks, while you will be adding more memory - for vain. There is an easy way to determine, what (in general) is the most serious problem on the server: CPU power, IO, or Locks. To do it, use Lakeside Upgrade Advisor.

You can accelerate your IO by putting LOG and DATA on different devices, separating indexes from data, but unfortunately, it is more and more difficult to apply these old good recommendations in the modern world. In many companies 'a server' is built based on the same specification no matter if it is a file server or a database server. One big RAID-5 drive is almost standard for the small servers. If you ask them for separate drives for LOG and DATA, they will give you 2 separate LUNs on the same RAID-5 drive and on the same physical channel, which is absolutely useless.

For the bigger servers more and more companies use big disk arrays like SAN EMC - these arrays are very good for the OLAP systems, but are bad for the OLTP.

Changing SQL server parameters also relatively safe, because you don't change the code. Unfortunately, there is no parameter 'TURBO=TRUE', so you can play with these parameters, but in most cases SQL server self-tuning is quite good, and effects of tuning can be beyond measurement.

For the given hardware, applying all the recommendations you can find in the Internet, you can gain 20-30%, even less if there is nothing really wrong with you server. So, if your queries are running several times slower then expected - then you don't have other choice but change your TSQL code.

Potential danger: None

1. Indexes

On this level of optimization, we change the database by creating and changing indexes, but we don't touch the code.

Of course, any DBA is very conservative about changing the code, especially on the production. In some cases it is possible to make some modifications which are not really considered as 'changes' by the conventional programming.

At first, all you can do with indexes, can affect the execution plan, but the result is guaranteed to be the same (unless there are bugs in the SQL server, but Microsoft spent much more time on testing it then it is spent on any query you work with, so below we can think about SQL server as a 'bug-free' product, which is not true, but is almost true in comparison with the quality of our code)

You can create, drop indexes, change index types (clustered/non-clustered), include or exclude columns from indexes. On SQL 2005, check a new index option 'INCLUDING'. Don't try to index everything - you will slow down updates. If SQL server does not use an index, there may be a very good reason behind that decision.

Potential danger: Is it absolutely safe? For most of the databases, it is safe. SQL server will automatically adjust plans based on new indexes. However, if your database uses not only DML but also a DDL, and it is generating tables and indexes on the fly, then it is possible that some

exec('drop index '+@indname)

would fail. So be careful.

2. Hints and options

To use a hint you should modify an existing SQL code. Hence, it is a modification, and theoretically, you need to retest all the system. But with an assumption that SQL server is almost bug-free, a query with any hint produces exactly the same result as without a hint.

So check the execution plan and check the following:

  • (index=...) -  but be careful, usually, when SQL server ignores an index, there is a very good reason why.
  • OPTION (HASH JOIN) -  try also other types of joins.
  • OPTION (FORCE ORDER)
  • OPTION (FAST 1) - these 3 options above are good candidates to try even without thinking, sometimes it helps.
  • (NOEXPAND) for indexed views
  • Option WITH RECOMPILE for a stored procedure

       and some others...

But if it does not help, there is no other choice then change the code

Potential danger:

  • Some modifications can have a catastrophic impact on the performance.
  • Some options can result in error type
'Query processor could not produce a query plan because of the hints defined in this query.'
  • Some queries can fail on some data sets, check
OPTION (ROBUST PLAN)

for details

3. Syntax-level optimization

The most primitive type of the optimization is based only on the text of a query. We don't know even the underlying database structure, so we are limited by the subset of syntax transformations, which do not change the result in any case. For example:

select 'Low',A,B from T
union
select 'High',A1,B1 from T1

In this case we merge both datasets together and get rid of the duplicates. However, there are no duplicates because the first column is different for 2 sub-queries. Hence we can rewrite it as:

select 'Low',A,B from T
union all
select 'High',A1,B1 from T1

Another example.

update T set Flag='Y'
update T set Ref=0

In both queries we affect all the records, so we can change both columns at the same time:

update T set Flag='Y',Ref=0

We can merge together updates in even more complicated cases:

update T set Flag='Y'
update T set Ref=0 where Comment like '%copied%'

Both updates use a full table scan, so lets do both updates during the same pass:

update T set Flag='Y',
Ref = case when Comment like '%copied%'
then 0
else Ref -- keep the same
end

A Trap:
Such type of optimization can be dangerous. For example,

update T set Ref=0
update T set Amount=StoredValue

It is almost the same case as we had before. So we can rewrite it as

update T set Ref=0,Amount=StoredValue

Right? Yes, this transformation is correct... in 99% of cases, but not always. The possible problem is that StoredValue can be a calculated column, which depends on Ref. For example, lets say it is defined as StoredValue=Ref+1

The original statement sets Ref to 0, and Amount to 1. Our optimized query works differently, as UPDATE at first calculates all right parts and only then assigns them to the columns. Therefore, it will set Ref to 0, but Amount will be set to the old value of Ref plus 1.

Potential danger: As it was demonstrated above, we should test the code after that modification. All levels above are intrusive and require verification.

4. When we know database schema

When we know database scheme and we have a list of indexes we can make more serious changes. For example, when we see:

update T set Flag='A' where Opt=1
update T set Flag='B' where Opt=2

We can check if there is an index on the column Opt. Assuming, that an index is created on a column with a high selectivity, we can leave there statements as is. However, if there is no index on Opt, we can rewrite it as:

update T
set Flag=case when Opt=1 then 'A' else 'B' end
where Opt in (1,2)

to make an update during the same scan.

Potential danger: Testing is required

5. When we have an access to the actual data and know the meaning of the data

We can do it only when we have an access to the actual data. In the example above, we could determine if Opt is a column with a high selectivity. To do it we should compare the results of 2 queries:

select count(*) from T
select count(distinct Opt) from T

What is more important, we can find values with irregular selectivity and handle them properly. Check this article to read more about irregular selectivity: http://www.sqlsolutions.com/articles/articles/How_Values_with_Irregular_Selectivity_Impact_SQL_Server_Database_Performance.htm

But sometimes table and column names help, so we understand the situation even without looking at the actual data. In what case, do you think, SQL server uses an index and where it uses a full table scan?

update Patients set Flag='Y' where Gender='F'
update Sales set Flag='N' where Id=13848

In the first case SQL server uses a table scan, because Gender column (M/F) has extremely low selectivity (about 50%). We can even say, that Flag is a low selectivity value in both cases with values 'Y','N' and probably few others. But syntaxically, both queries are identical.

Another example. Assuming that Gender is NOT NULL, does this statement update all the records?

update Patients set Flag=0 where Gender in ('M','F')

Yes. Now let's change the table name:

update Nouns set Flag=0 where Gender in ('M','F')

We update a noun, which have a grammar category of Gender, which may be 'M', 'F' or Neutral! As you see, the only difference is a table name, so we actually guessed the meaning.

We can base our optimization based on many real world constraints, like

  • age is less then 255 and non-negative,
  • mass is positive,
  • InternalId int identity correlates with a date of the document
  • SSN is unique.

Potential danger: Requires not only testing but also verification, that constraints are not violated now (such type of verification is possible by querying the existing data) and in the future (not all data may be inserted). You can find that:

  • Special value of -1 is used for age as some weird flag
  • mass can be negative because sometimes it is delta from the actual and expected value
  • InternalId int identity correlates with a date of the document except for the documents, imported from another database a year ago
  • SSN is unique, except for the value 'N/A', used for all illegal immigrant workers.

5. Actual or expected workload.

Some solutions have 2 sides: they accelerate data retrieval, for example, but slow down the modification. Very good example is a technique of 'indexed' or 'materialized' views. Indexed views can accelerate selects, but they slow down any updates significantly, especially massive updates.

So before you make any modifications, you should know if data is relatively static or not, how man by updates you expect per one select etc.

On the actual data load the best way to make such analysis is to use SQL Trace Analyzer.

Conclusion

You need to ask right questions what you are asked to 'look, what is wrong with this query' or 'optimize a stored procedure'. Before you start an optimization, you need to know, what level of intrusiveness is allowed and what information do you have as input.

Disclaimer:

In many articles you will find the execution times of different queries. Please do not interpret these results scientifically or as official benchmarks. You will not even find the configuration of the computer used for these experiments because it is rather unimportant.

What is important is that query A is running 2 times slower then query B. On your computer, it could be different: 1.5 or even 3 times slower. Read the full disclaimer