all bits considered data to information to knowledge

2Sep/110

When sum of parts is more than the whole: INDEX splitting

Not all indices are created equal; some are more useful than the others.

In general, having a useless index should not hurt query performance – as long as your RDBMS query optimizer chooses the right one to use… Applying simple logic cold help you to speed up your query performance without delving into black art of execution plans and query hints.

Consider a fairly common data piece such as address, say, “100 NE Elm Street, Laurel, Nebraska, 68745

The [city], [state] and [zip] are natural candidates to be placed into fields of their own; in some data models you could see a highly normalized schema where each data element in street address would also  be assigned its own field (and sometimes would be placed into different tables, too) – [100],[NE],[Elm],[Street]   But it is far more likely that the entire character string “100 NE Elm Street” will be stuffed into a single [ADDRESS] field like this:

adress

100 NE Elm Street

100 SE Oak Street

300 NE Elm Street

300 NW Aspen Street

700 SW Elm Street

An index on such a field would be fairly useless for pattern searches because the numbers in front of the address are treated as part of the address character string, and full table scan is likely to be performed for a query like this:

SELECT * FROM [table] WHERE address LIKE ‘%ELM%’

Creating a FULL TEXT index would help in this situation, but it is a fairly expensive solution for small text fields like this. One possible alternative would be to split the address into the separate fields of more appropriate data types (e.g. INTEGER for the street number) – or add these fields to the table (yes, it will result in increased maintenance and/or data duplication – so the trade-offs must be carefully considered)

street_num

street_geo

adress

100

NE

Elm Street

100

SE

Oak Street

300

NE

Elm Street

300

NW

Aspen Street

700

SW

Elm Street

Now indices created on the columns (fields) [STREET_NUM], [STREET_GEO] and [ADDRESS] – or combinations thereof - can be efficiently utilized by query optimizer.  

Comparative tests run in Microsoft SQL Server 2005 RDBMS on a table with a half million rows in it showed approximately 10-fold performance increase for the query with LIKE predicate.

As with every index, it has to be maintained and statistics for the table must be updated regularly to keep optimal performance.

Again, TANSTAAFL principle fully applies here – the increase in speed is paid for by increased complexity and maintenance (more fields to populate and keep in sync, more indices to create and maintain, etc.)

NB: the original idea for the post was formulated by Andrew Tappert - a senior developer with Oregon Health Authority.

Comments (0) Trackbacks (0)

No comments yet.


Leave a comment

No trackbacks yet.