|

Values with irregular selectivity might dramatically degrade
database performance. Well, the term ‘values with irregular
selectivity’ may sound too scientific, but actually it is quite simple.
I would even say that it is a regular thing.
Let’s make our own example. We have a table with Customers:
create table Customers (
n int identity primary key,
SSN char(9) not null,
Amount money not null,
Name varchar(128)
)
GO
create index IX_SSN on Customers (SSN)
GO
Now insert 100’000 fake lines in this table for our US customers. Of
course, each US customer has a unique social security number (SSN):
set nocount on
declare @n int
set @n=100000
loop:
insert into Customers (SSN,Name,Amount)
select convert(varchar,@n),'This is our US customer N'+
convert(varchar,@n),@n*100
set @n=@n-1
if @n>0 goto loop
The problem is we also have customers abroad, without SSNs. For all
customers outside the US we put value ‘N/A’ into this column:
declare @n int
set @n=10000
loop:
insert into Customers (SSN,Name,Amount)
select 'N/A','This is our foreign customer N'+convert(varchar,@n),@n*30
set @n=@n-1
if @n>0 goto loop
For that reason index on SSN is not unique.
Let’s check how SQL server handles such values.
NOTE: SET STATISTICS IO ON must be present to show table statistics
select min(Amount) from Customers where SSN=’345?
Table ‘Customers’. Scan count 1, logical reads 6,
physical reads 0, read-ahead reads 0.

SQL server uses index and finds a record using only 6 logical reads.
Now do the same for the ‘N/A’:
select min(Amount) from Customers where SSN=’N/A’
Table ‘Customers’. Scan count 1, logical reads 908,
physical reads 0, read-ahead reads 0.

There is no surprise that SQL server used 908 logical reads instead
of 6: it needs to read 10,000 records (10% of a table). But is SQL
server right that it does not use an index but uses full scan instead?
Let’s force it to use index using hints:
select min(Amount) from Customers (index=IX_SSN) where SSN=’N/A’
Table ‘Customers’. Scan count 1, logical reads 31289,
physical reads 0, read-ahead reads 0.
Now it is much worse: SQL server has to read 10000 entries in index
and to find corresponding values in data. So SQL server optimizer has
made a right decision. Based on the statistics, it perfectly optimized
ad-hoc queries with hardcoded values. But wait a minute, usually, value
is not known in advance:
create procedure SelectMinAmount
@ssn char(9)
as
select min(Amount) from Customers where SSN=@ssn
GO
exec SelectMinAmount ‘345?
Table ‘Customers’. Scan count 1, logical reads 6,
physical reads 0, read-ahead reads 0.
This is good, but then:
exec SelectMinAmount 'N/A'
Table 'Customers'. Scan count 1, logical reads 30023,
physical reads 0, read-ahead reads 0.
SQL server has prepared a plan based on a regular value(’345?), but
this plan is not good for the irregular one(’N/A’). . It is even worse
than that.
Let’s create another procedure with the exactly same text:
create procedure SelectMinAmount_Copy
@ssn char(9)
as
select min(Amount) from Customers where SSN=@ssn
GO
But the first call will be with ‘N/A’
exec SelectMinAmount_Copy 'N/A'
Table 'Customers'. Scan count 1, logical reads 908,
physical reads 0, read-ahead reads 0.
exec SelectMinAmount_Copy ‘345?
Table ‘Customers’. Scan count 1, logical reads 908,
physical reads 0, read-ahead reads 0.
The execution plan, which was prepared, uses a table scan. It is now
optimal to the irregular value(’N/A’), but for the regular
values(’345?) it does not use an index at all! So, the version of a
plan, which is generated for the procedure, depends on the parameters
of the very first call! Just imagine potential problems: after a
restart, SQL server behavior can be absolutely different.
So, how should we handle values with irregular selectivity in programs? There are 3 solutions.
Solution 1. using Ad-hoc queries and dynamic SQL:
create procedure SelectMinAmount2
@ssn char(9)
with recompile
as
exec('select min(Amount) from Customers '+
'where SSN='''+@ssn+'''')
GO
This solution would require extra CPU for parsing SQL statements, it
can also be dangerous because of *SQL Injection* problem, so, it would
work, but definitely, it is not a solution for this case. However,
using dynamic SQL can be useful in some very complicated cases, where
conditions are generated dynamically and are too different.
Solution 2. WITH RECOMPLE
We use the hint WITH RECOMPLE
create procedure SelectMinAmount1
@ssn char(9)
with recompile
as
select min(Amount) from Customers where SSN=@ssn
GO
It forces SQL server to recompile procedure every time. Again, it
requires additional CPU cycles, but irregular values are handled
automatically.
Solution 3. Handling it manually
Finally, we can handle irregular values on our own:
create procedure SelectMinAmount3
@ssn char(9)
as
if @ssn = 'N/A'
select min(Amount) from Customers where SSN='N/A'
else
select min(Amount) from Customers(index=IX_SSN) where SSN=@ssn
GO
This procedure does not require a recompilation (hence it is a
fastest solution, even a code looks a little bit weird), but inside it
contains both plans:

Finally, how can we say if our table contains irregular selectivity values? It is very easy. You can use the query like:
select top 10 count(*),SSN from Customers
group by SSN order by count(*) desc
just replace ‘Customers’ with your table name and ‘SSN’ with you
column name. Values with irregular selectivity will be in the
beginning.
|