|
SQL optimization is a transformation of SQL statements, which
produces exactly the same result, as an original query. This process
requires a lot of creativity and intuition; it is more an Art than a Science. The result, produced by the optimized code must be exactly the same as it was before: the second worst nightmare of a DBA (the first one is losing data) is someone running to him crying 'you know, after you last optimization all customer balances became negative!' So I will focus on Potential danger on every level. For that reason all types of optimization are sorted in order of increasing intrusiveness, and I begin from the non-intrusive optimization. 0. Non-intrusive optimization It is upgrading SQL server hardware and changing of SQL server parameters. Adding more memory or processor power wont harm, you don't need to test your product again. Upgrading hardware you can spend a lot without any effect, because it is possible that your server is suffering from Locks, while you will be adding more memory - for vain. There is an easy way to determine, what (in general) is the most serious problem on the server: CPU power, IO, or Locks. To do it, use Lakeside Upgrade Advisor. You can accelerate your IO by putting LOG and DATA on different devices, separating indexes from data, but unfortunately, it is more and more difficult to apply these old good recommendations in the modern world. In many companies 'a server' is built based on the same specification no matter if it is a file server or a database server. One big RAID-5 drive is almost standard for the small servers. If you ask them for separate drives for LOG and DATA, they will give you 2 separate LUNs on the same RAID-5 drive and on the same physical channel, which is absolutely useless. For the bigger servers more and more companies use big disk arrays like SAN EMC - these arrays are very good for the OLAP systems, but are bad for the OLTP. Changing SQL server parameters also relatively safe, because you don't change the code. Unfortunately, there is no parameter 'TURBO=TRUE', so you can play with these parameters, but in most cases SQL server self-tuning is quite good, and effects of tuning can be beyond measurement. For the given hardware, applying all the recommendations you can find in the Internet, you can gain 20-30%, even less if there is nothing really wrong with you server. So, if your queries are running several times slower then expected - then you don't have other choice but change your TSQL code. Potential danger: None 1. Indexes On this level of optimization, we change the database by creating and changing indexes, but we don't touch the code. Of course, any DBA is very conservative about changing the code, especially on the production. In some cases it is possible to make some modifications which are not really considered as 'changes' by the conventional programming. At first, all you can do with indexes, can affect the execution plan, but the result is guaranteed to be the same (unless there are bugs in the SQL server, but Microsoft spent much more time on testing it then it is spent on any query you work with, so below we can think about SQL server as a 'bug-free' product, which is not true, but is almost true in comparison with the quality of our code) You can create, drop indexes, change index types (clustered/non-clustered), include or exclude columns from indexes. On SQL 2005, check a new index option 'INCLUDING'. Don't try to index everything - you will slow down updates. If SQL server does not use an index, there may be a very good reason behind that decision. Potential danger: Is it absolutely safe? For most of the databases, it is safe. SQL server will automatically adjust plans based on new indexes. However, if your database uses not only DML but also a DDL, and it is generating tables and indexes on the fly, then it is possible that some exec('drop index '+@indname) would fail. So be careful. 2. Hints and options To use a hint you should modify an existing SQL code. Hence, it is a modification, and theoretically, you need to retest all the system. But with an assumption that SQL server is almost bug-free, a query with any hint produces exactly the same result as without a hint. So check the execution plan and check the following:
and some others... But if it does not help, there is no other choice then change the code Potential danger:
'Query processor could not produce a query plan because of the hints defined in this query.'
OPTION (ROBUST PLAN) for details 3. Syntax-level optimization The most primitive type of the optimization is based only on the text of a query. We don't know even the underlying database structure, so we are limited by the subset of syntax transformations, which do not change the result in any case. For example: select 'Low',A,B from T union select 'High',A1,B1 from T1 In this case we merge both datasets together and get rid of the duplicates. However, there are no duplicates because the first column is different for 2 sub-queries. Hence we can rewrite it as: select 'Low',A,B from T union all select 'High',A1,B1 from T1 Another example. update T set Flag='Y' update T set Ref=0 In both queries we affect all the records, so we can change both columns at the same time: update T set Flag='Y',Ref=0 We can merge together updates in even more complicated cases: update T set Flag='Y' update T set Ref=0 where Comment like '%copied%' Both updates use a full table scan, so lets do both updates during the same pass: update T set Flag='Y', Ref = case when Comment like '%copied%' then 0 else Ref -- keep the same end A Trap: update T set Ref=0 update T set Amount=StoredValue It is almost the same case as we had before. So we can rewrite it as update T set Ref=0,Amount=StoredValue Right? Yes, this transformation is correct... in 99% of cases, but not always. The possible problem is that StoredValue can be a calculated column, which depends on Ref. For example, lets say it is defined as StoredValue=Ref+1 The original statement sets Ref to 0, and Amount to 1. Our optimized query works differently, as UPDATE at first calculates all right parts and only then assigns them to the columns. Therefore, it will set Ref to 0, but Amount will be set to the old value of Ref plus 1. Potential danger: As it was demonstrated above, we should test the code after that modification. All levels above are intrusive and require verification. 4. When we know database schema When we know database scheme and we have a list of indexes we can make more serious changes. For example, when we see: update T set Flag='A' where Opt=1 update T set Flag='B' where Opt=2 We can check if there is an index on the column Opt. Assuming, that an index is created on a column with a high selectivity, we can leave there statements as is. However, if there is no index on Opt, we can rewrite it as: update T set Flag=case when Opt=1 then 'A' else 'B' end where Opt in (1,2) to make an update during the same scan. Potential danger: Testing is required 5. When we have an access to the actual data and know the meaning of the data We can do it only when we have an access to the actual data. In the example above, we could determine if Opt is a column with a high selectivity. To do it we should compare the results of 2 queries: select count(*) from T select count(distinct Opt) from T What is more important, we can find values with irregular selectivity and handle them properly. Check this article to read more about irregular selectivity: http://www.sqlsolutions.com/articles/articles/How_Values_with_Irregular_Selectivity_Impact_SQL_Server_Database_Performance.htm But sometimes table and column names help, so we understand the situation even without looking at the actual data. In what case, do you think, SQL server uses an index and where it uses a full table scan? update Patients set Flag='Y' where Gender='F' update Sales set Flag='N' where Id=13848 In the first case SQL server uses a table scan, because Gender column (M/F) has extremely low selectivity (about 50%). We can even say, that Flag is a low selectivity value in both cases with values 'Y','N' and probably few others. But syntaxically, both queries are identical. Another example. Assuming that Gender is NOT NULL, does this statement update all the records? update Patients set Flag=0 where Gender in ('M','F') Yes. Now let's change the table name: update Nouns set Flag=0 where Gender in ('M','F') We update a noun, which have a grammar category of Gender, which may be 'M', 'F' or Neutral! As you see, the only difference is a table name, so we actually guessed the meaning. We can base our optimization based on many real world constraints, like
Potential danger: Requires not only testing
but also verification, that constraints are not violated now (such type
of verification is possible by querying the existing data) and in the
future (not all data may be inserted). You can find that:
5. Actual or expected workload. Some solutions have 2 sides: they accelerate data retrieval, for example, but slow down the modification. Very good example is a technique of 'indexed' or 'materialized' views. Indexed views can accelerate selects, but they slow down any updates significantly, especially massive updates. So before you make any modifications, you should know if data is relatively static or not, how man by updates you expect per one select etc. On the actual data load the best way to make such analysis is to use SQL Trace Analyzer. Conclusion You need to ask right questions what you are asked to 'look, what is wrong with this query' or 'optimize a stored procedure'. Before you start an optimization, you need to know, what level of intrusiveness is allowed and what information do you have as input. |
|
Disclaimer:
In many articles you will find the execution times of different queries. Please do not interpret these results scientifically or as official benchmarks. You will not even find the configuration of the computer used for these experiments because it is rather unimportant. What is important is that query A is running 2 times slower then query B. On your computer, it could be different: 1.5 or even 3 times slower. Read the full disclaimer » | |
|
|