Contact  |  Site Map  |  Privacy      Call for more info
 
Latest articles

How Values with Irregular Selectivity Impact SQL Server Database Performance

Values with irregular selectivity might dramatically degrade database performance. Well, the term ëvalues with irregular selectivityí may sound too scientific, but actually it is quite simple. I would even say that it is a regular thing.

Letís make our own example. We have a table with Customers:

create table Customers (
    n int identity primary key,
    SSN char(9) not null,
    Amount money not null,
    Name varchar(128)
)
GO
create index IX_SSN on Customers (SSN)
GO

Now insert 100í000 fake lines in this table for our US customers. Of course, each US customer has a unique social security number (SSN):

set nocount on
declare @n int
set @n=100000
loop:
    insert into Customers (SSN,Name,Amount)
        select convert(varchar,@n),'This is our US customer N'+
            convert(varchar,@n),@n*100
    set @n=@n-1
    if @n>0 goto loop

The problem is we also have customers abroad, without SSNs. For all customers outside the US we put value ëN/Aí into this column:

declare @n int
set @n=10000
loop:
    insert into Customers (SSN,Name,Amount)
        select 'N/A','This is our foreign customer N'+convert(varchar,@n),@n*30
    set @n=@n-1
    if @n>0 goto loop

For that reason index on SSN is not unique.
Letís check how SQL server handles such values.
NOTE: SET STATISTICS IO ON must be present to show table statistics

select min(Amount) from Customers where SSN=í345?
Table ëCustomersí. Scan count 1, logical reads 6,
physical reads 0, read-ahead reads 0.

SQL server uses index and finds a record using only 6 logical reads.

Now do the same for the ëN/Aí:

select min(Amount) from Customers where SSN=íN/Aí
Table ëCustomersí. Scan count 1, logical reads 908,
physical reads 0, read-ahead reads 0.

There is no surprise that SQL server used 908 logical reads instead of 6: it needs to read 10,000 records (10% of a table). But is SQL server right that it does not use an index but uses full scan instead? Letís force it to use index using hints:

select min(Amount) from Customers (index=IX_SSN) where SSN=íN/Aí
Table ëCustomersí. Scan count 1, logical reads 31289,
physical reads 0, read-ahead reads 0.

Now it is much worse: SQL server has to read 10000 entries in index and to find corresponding values in data. So SQL server optimizer has made a right decision. Based on the statistics, it perfectly optimized ad-hoc queries with hardcoded values. But wait a minute, usually, value is not known in advance:

create procedure SelectMinAmount
    @ssn char(9)
as
    select min(Amount) from Customers where SSN=@ssn
GO
exec SelectMinAmount ë345?
Table ëCustomersí. Scan count 1, logical reads 6,
physical reads 0, read-ahead reads 0.

This is good, but then:

exec SelectMinAmount 'N/A'
Table 'Customers'. Scan count 1, logical reads 30023,
physical reads 0, read-ahead reads 0.

SQL server has prepared a plan based on a regular value(í345?), but this plan is not good for the irregular one(íN/Aí). . It is even worse than that.

Letís create another procedure with the exactly same text:

create procedure SelectMinAmount_Copy
    @ssn char(9)
as
    select min(Amount) from Customers where SSN=@ssn
GO

But the first call will be with ëN/Aí

exec SelectMinAmount_Copy 'N/A'
Table 'Customers'. Scan count 1, logical reads 908,
physical reads 0, read-ahead reads 0.
exec SelectMinAmount_Copy ë345?
Table ëCustomersí. Scan count 1, logical reads 908,
physical reads 0, read-ahead reads 0.

The execution plan, which was prepared, uses a table scan. It is now optimal to the irregular value(íN/Aí), but for the regular values(í345?) it does not use an index at all! So, the version of a plan, which is generated for the procedure, depends on the parameters of the very first call! Just imagine potential problems: after a restart, SQL server behavior can be absolutely different.

So, how should we handle values with irregular selectivity in programs? There are 3 solutions.

Solution 1. using Ad-hoc queries and dynamic SQL:

create procedure SelectMinAmount2
    @ssn char(9)
with recompile
as
    exec('select min(Amount) from Customers '+
        'where SSN='''+@ssn+'''')
GO

This solution would require extra CPU for parsing SQL statements, it can also be dangerous because of *SQL Injection* problem, so, it would work, but definitely, it is not a solution for this case. However, using dynamic SQL can be useful in some very complicated cases, where conditions are generated dynamically and are too different.

Solution 2. WITH RECOMPLE

We use the hint WITH RECOMPLE

create procedure SelectMinAmount1
    @ssn char(9)
with recompile
as
    select min(Amount) from Customers where SSN=@ssn
GO

It forces SQL server to recompile procedure every time. Again, it requires additional CPU cycles, but irregular values are handled automatically.

Solution 3. Handling it manually

Finally, we can handle irregular values on our own:

create procedure SelectMinAmount3
    @ssn char(9)
as
    if @ssn = 'N/A'
        select min(Amount) from Customers where SSN='N/A'
    else
        select min(Amount) from Customers(index=IX_SSN) where SSN=@ssn
GO

This procedure does not require a recompilation (hence it is a fastest solution, even a code looks a little bit weird), but inside it contains both plans:

Finally, how can we say if our table contains irregular selectivity values? It is very easy. You can use the query like:

select top 10 count(*),SSN from Customers
group by SSN order by count(*) desc

just replace ëCustomersí with your table name and ëSSNí with you column name. Values with irregular selectivity will be in the beginning.

Disclaimer:

In many articles you will find the execution times of different queries. Please do not interpret these results scientifically or as official benchmarks. You will not even find the configuration of the computer used for these experiments because it is rather unimportant.

What is important is that query A is running 2 times slower then query B. On your computer, it could be different: 1.5 or even 3 times slower. Read the full disclaimer