When should I use a composite index?

84.4k views Asked by At
  1. When should I use a composite index in a database?
  2. What are the performance ramification by using a composite index)?
  3. Why should I use use a composite index?

For example, I have a homes table:

CREATE TABLE IF NOT EXISTS `homes` (
  `home_id` int(10) unsigned NOT NULL auto_increment,
  `sqft` smallint(5) unsigned NOT NULL,
  `year_built` smallint(5) unsigned NOT NULL,
  `geolat` decimal(10,6) default NULL,
  `geolng` decimal(10,6) default NULL,
  PRIMARY KEY  (`home_id`),
  KEY `geolat` (`geolat`),
  KEY `geolng` (`geolng`),
) ENGINE=InnoDB  ;

Does it make sense for me to use a composite index for both geolat and geolng, such that:

I replace:

  KEY `geolat` (`geolat`),
  KEY `geolng` (`geolng`),

with:

KEY `geolat_geolng` (`geolat`, `geolng`)

If so:

  • Why?
  • What is the performance ramification by using a composite index)?

UPDATE:

Since many people have stated it entirely dependent upon the queries I perform, below is the most common query performed:

SELECT * FROM homes
WHERE geolat BETWEEN ??? AND ???
AND geolng BETWEEN ??? AND ???

UPDATE 2:

With the following database schema:

CREATE TABLE IF NOT EXISTS `homes` (
  `home_id` int(10) unsigned NOT NULL auto_increment,
  `primary_photo_group_id` int(10) unsigned NOT NULL default '0',
  `customer_id` bigint(20) unsigned NOT NULL,
  `account_type_id` int(11) NOT NULL,
  `address` varchar(128) collate utf8_unicode_ci NOT NULL,
  `city` varchar(64) collate utf8_unicode_ci NOT NULL,
  `state` varchar(2) collate utf8_unicode_ci NOT NULL,
  `zip` mediumint(8) unsigned NOT NULL,
  `price` mediumint(8) unsigned NOT NULL,
  `sqft` smallint(5) unsigned NOT NULL,
  `year_built` smallint(5) unsigned NOT NULL,
  `num_of_beds` tinyint(3) unsigned NOT NULL,
  `num_of_baths` decimal(3,1) unsigned NOT NULL,
  `num_of_floors` tinyint(3) unsigned NOT NULL,
  `description` text collate utf8_unicode_ci,
  `geolat` decimal(10,6) default NULL,
  `geolng` decimal(10,6) default NULL,
  `display_status` tinyint(1) NOT NULL,
  `date_listed` timestamp NOT NULL default CURRENT_TIMESTAMP,
  `contact_email` varchar(100) collate utf8_unicode_ci NOT NULL,
  `contact_phone_number` varchar(15) collate utf8_unicode_ci NOT NULL,
  PRIMARY KEY  (`home_id`),
  KEY `customer_id` (`customer_id`),
  KEY `city` (`city`),
  KEY `num_of_beds` (`num_of_beds`),
  KEY `num_of_baths` (`num_of_baths`),
  KEY `geolat` (`geolat`),
  KEY `geolng` (`geolng`),
  KEY `account_type_id` (`account_type_id`),
  KEY `display_status` (`display_status`),
  KEY `sqft` (`sqft`),
  KEY `price` (`price`),
  KEY `primary_photo_group_id` (`primary_photo_group_id`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=8 ;

Using the following SQL:

EXPLAIN SELECT  homes.home_id,
                    address,
                    city,
                    state,
                    zip,
                    price,
                    sqft,
                    year_built,
                    account_type_id,
                    num_of_beds,
                    num_of_baths,
                    geolat,
                    geolng,
                    photo_id,
                    photo_url_dir
            FROM homes
            LEFT OUTER JOIN home_photos ON homes.home_id = home_photos.home_id
                AND homes.primary_photo_group_id = home_photos.home_photo_group_id
                AND home_photos.home_photo_type_id = 2
            WHERE homes.display_status = true
            AND homes.geolat BETWEEN -100 AND 100
            AND homes.geolng BETWEEN -100 AND 100

EXPLAIN returns:

id  select_type  table        type  possible_keys                                    key                  key_len  ref     rows  Extra
----------------------------------------------------------------------------------------------------------
1   SIMPLE       homes        ref   geolat,geolng,display_status                     display_status       1        const   2     Using where
1  SIMPLE        home_photos  ref   home_id,home_photo_type_id,home_photo_group_id   home_photo_group_id  4        homes.primary_photo_group_id   4  

I don't quite understand how to read the EXPLAIN command. Does this look good or bad. Right now, I am NOT using a composite index for geolat and geolng. Should I be?

10

There are 10 answers

4
Alexander On

Composite index can be useful when you want to optimise group by clause (check this article http://dev.mysql.com/doc/refman/5.0/en/group-by-optimization.html). Please pay attention:

The most important preconditions for using indexes for GROUP BY are that all GROUP BY columns reference attributes from the same index, and that the index stores its keys in order (for example, this is a BTREE index and not a HASH index)

8
Mitch Wheat On

There is no Black and White, one size fits all answer.

You should use a composite (or multi-column) index, when your query work load would benefit from one.

You need to profile your query work load in order to determine this.

A composite index comes into play when queries can be satisfied entirely from that index: meaning all the columns required by the query are in (covered) by an index.

UPDATE (in response to edit to posted question): If you are selecting * from the table the composite index may be used, it may not. You will need to run EXPLAIN PLAN to be sure.

12
Emre Yazici On

Imagine you have the following three queries:

Query I:

SELECT * FROM homes WHERE `geolat`=42.9 AND `geolng`=36.4

Query II:

SELECT * FROM homes WHERE `geolat`=42.9

Query III:

SELECT * FROM homes WHERE `geolng`=36.4

If you have seperate index per column, all three queries use indexes. In MySQL, if you have composite index (geolat, geolng), only query I and query II (which is using the first part of the composit index) uses indexes. In this case, query III requires full table search.

On Multiple-Column Indexes section of manual, it is clearly explained how multiple column indexes work, so I don't want to retype manual.

From the MySQL Reference Manual page:

A multiple-column index can be considered a sorted array containing values that are created by concatenating the values of the indexed columns.

If you use seperated index for geolat and geolng columns, you have two different index in your table which you can search independent.

INDEX geolat
-----------
VALUE RRN
36.4  1
36.4  8
36.6  2
37.8  3
37.8  12
41.4  4

INDEX geolng
-----------
VALUE RRN
26.1  1
26.1  8
29.6  2
29.6  3
30.1  12
34.7  4

If you use composite index you have only one index for both columns:

INDEX (geolat, geolng)
-----------
VALUE      RRN
36.4,26.1  1
36.4,26.1  8
36.6,29.6  2
37.8,29.6  3
37.8,30.1  12
41.4,34.7  4

RRN is relative record number (to simplify, you can say ID). The first two index generated seperate and the third index is composite. As you can see you can search based on geolng on composite one since it is indexed by geolat, however it's possible to search by geolat or "geolat AND geolng" (since geolng is second level index).

Also, have a look at How MySQL Uses Indexes manual section.

1
Jim Ferrans On

I'm with @Mitch, depends entirely your queries. Fortunately you can create and drop indexes at any time, and you can prepend the EXPLAIN keyword to your queries to see if the query analyzer uses the indexes.

If you'll be looking up an exact lat/long pair this index would likely make sense. But you're probably going to be looking for homes within a certain distance of a particular place, so your queries will look something like this (see source):

select *, sqrt(  pow(h2.geolat - h1.geolat,  2) 
               + pow(h2.geolng - h1.geolng, 2) ) as distance
from homes h1, homes h2
where h1.home_id = 12345 and h2.home_id != h1.home_id
order by distance

and the index very likely won't be helpful at all. For geospatial queries, you need something like this.

Update: with this query:

SELECT * FROM homes
WHERE geolat BETWEEN ??? AND ???
AND geolng BETWEEN ??? AND ???

The query analyzer could use an index on geolat alone, or an index on geolng alone, or possibly both indexes. I don't think it would use a composite index. But it's easy to try out each of these permutations on a real data set and then (a) see what EXPLAIN tells you and (b) measure the time the query really takes.

9
Mark Canlas On

You should use a composite index when you are using queries that benefit from it. A composite index that looks like this:

index( column_A, column_B, column_C )

will benefit a query that uses those fields for joining, filtering, and sometimes selecting. It will also benefit queries that use left-most subsets of columns in that composite. So the above index will also satisfy queries that need

index( column_A, column_B, column_C )
index( column_A, column_B )
index( column_A )

But it will not (at least not directly, maybe it can help partially if there are no better indices) help for queries that need

index( column_A, column_C )

Notice how column_B is missing.

In your original example, a composite index for two dimensions will mostly benefit queries that query on both dimensions or the leftmost dimension by itself, but not the rightmost dimension by itself. If you're always querying two dimensions, a composite index is the way to go, doesn't really matter which is first (most probably).

0
Will On

To do spacial searches, you need an R-Tree algorithm, which allows searching geographical areas very quickly. Exactly what you need for this job.

Some databases have spacial indexes built in. A quick Google search shows MySQL 5 has them (which looking at your SQL I'm guessing you're using MySQL).

0
Question Overflow On

There could be a misconception about what composite index does. Many people think that composite index can be used to optimise a search query as long as the where clause covers the indexed columns, in your case geolat and geolng. Let's delve deeper:

I believe your data on the coordinates of homes would be random decimals as such:

home_id  geolat  geolng
   1    20.1243  50.4521
   2    22.6456  51.1564
   3    13.5464  45.4562
   4    55.5642 166.5756
   5    24.2624  27.4564
   6    62.1564  24.2542
...

Since geolat and geolng values hardly repeat itself. A composite index on geolat and geolng would look something like this:

index_id  geolat  geolng
   1     20.1243  50.4521
   2     20.1244  61.1564
   3     20.1251  55.4562
   4     20.1293  66.5756
   5     20.1302  57.4564
   6     20.1311  54.2542
...

Therefore the second column of the composite index is basically useless! The speed of your query with a composite index is probably going to be similar to an index on just the geolat column.

As mentioned by Will, MySQL provides spatial extension support. A spatial point is stored in a single column instead of two separate lat lng columns. Spatial index can be applied to such a column. However, the efficiency could be overrated based on my personal experience. It could be that spatial index does not resolve the two dimensional problem but merely speed up the search using R-Trees with quadratic splitting.

The trade-off is that a spatial point consumes much more memory as it used eight-byte double-precision numbers for storing coordinates. Correct me if I am wrong.

0
Rick James On

Composite indexes are useful for

  • 0 or more "=" clauses, plus
  • at most one range clause.

A composite index cannot handle two ranges. I discuss this further in my index cookbook.

Find nearest -- If the question is really about optimizing

WHERE geolat BETWEEN ??? AND ???
  AND geolng BETWEEN ??? AND ???

then no index can really handle both dimensions.

Instead, one must 'think out of the box'. If one dimension is implemented via partitioning and the other is implemented by carefully picking the PRIMARY KEY, one can get significantly better efficiency for very large tables of lat/lng lookup. My latlng blog goes into the details of how to implement "find nearest" on the globe. It includes code.

The PARTITIONs are stripes of latitude ranges. The PRIMARY KEY deliberately starts with longitude so that the useful rows are likely to be in the same block. A Stored Routine orchestrates the messy code for doing order by... limit... and for growing the 'square' around the target until you have enough coffee shops (or whatever). It also takes care of the great-circle calculations and handling the dateline and poles.

More

I have written another blog; it compares 5 ways of doing lat/lng searches: http://mysql.rjweb.org/doc.php/latlng#representation_choices (It references the link given above as one of the 5.) One of the other ways is this, and it points out that they are optimal for the particular case:

INDEX(geolat, geolng),
INDEX(geolng, geolat)

That is, having both columns in two indexes, and not having single-column indexes on geolat and geolng is important.

0
TheGeeky On

Using a composite index in a database involves creating an index on multiple columns. The decision to use a composite index depends on various factors, including the types of queries your application performs, and there are implications for both performance and storage.

Reasons to Use a Composite Index:

1- Filtering and Sorting: If your queries involve conditions on both geolat and geolng, using a composite index (geolat_geolng) can improve query performance.

2- Covering Queries: If your queries can be satisfied using only the composite index without the need to access the actual table (covering index), it can reduce the number of disk I/O operations and improve performance.

3- Reduced Storage Overhead: While each index incurs some storage overhead, using a composite index on multiple columns may be more efficient than maintaining separate indexes on each individual column.

Performance Ramifications:

1- Query Performance: Composite indexes are effective for queries that involve conditions on both columns in the index. However, they might be less effective for queries that involve conditions on only one of the columns.

2- Insert and Update Overhead: While read performance can benefit from composite indexes, there might be a slight overhead for insert, update, and delete operations since the index structure needs to be maintained.

3- Index Size: Composite indexes generally have larger sizes compared to single-column indexes, as they store information about multiple columns. This can impact memory usage and disk space.

Decision Factors:

1- Query Patterns: Analyze the types of queries your application performs. If most queries involve conditions on both geolat and geolng, a composite index can be beneficial.

2- Covering Index Potential: Evaluate if the composite index can serve as a covering index for your queries, potentially reducing the need to access the underlying table.

3- Storage and Memory Considerations: Consider the impact on storage and memory, especially if the table is large and resources are a concern.

Example Modification:

-- Original
KEY `geolat` (`geolat`),
KEY `geolng` (`geolng`),

-- Modified with Composite Index
KEY `geolat_geolng` (`geolat`, `geolng`)

Conclusion: In the given example, if your queries often involve conditions on both geolat and geolng, creating a composite index (geolat_geolng) could be beneficial for improving query performance. However, it's important to analyze the specific query patterns and use cases in your application to make an informed decision. Monitoring and testing the impact of index changes on actual query performance are recommended.

1
ProfileTwist On

Composite indexes are very powerful as they:

  • Enforce structure integrity
  • Enable sorting on a FILTERED id

ENFORCE STRUCTURE INTEGRITY

Composite indexes are not just another type of index; they can provide NECESSARY structure to a table by enforcing integrity as the Primary Key.

Mysql's Innodb supports clustering and the following example illustrates why a composite index may be necessary.

To create a friends' tables (i.e. for a social network) we need 2 columns: user_id, friend_id.

Table Strcture

user_id (medium_int)
friend_id (medium_int)

Primary Key -> (user_id, friend_id)

By virtue, a Primary Key (PK) is unique and by creating a composite PK, Innodb will automatically check that no duplicates on user_id, friend_id exists when a new record is added. This is the expected behavior as no user should have more than 1 record (relationship link) with friend_id = 2 for instance.

Without a composite PK, we can create this schema using a surrogate key:

user_friend_id
user_id
friend_id

Primary Key -> (user_friend_id)

Now, whenever a new record is added we will have to check that a prior record with the combination user_id, friend_id does not already exist.

As such, a composite index can enforce structure integrity.

ENABLE SORTING ON A FILTERED ID

It is very common to sort a set of records by the post's time (timestamp or datetime). Usually, this means posting on a given id. Here is an example

Table User_Wall_Posts (think if Facebook's wall posts)

user_id (medium_int)
timestamp (timestamp)
author_id (medium_int)
comment_post (text)

Primary Key -> (user_id, timestamp, author_id)

We want to query and find all posts for user_id = 10 and sort the comment posts by timestamp (date).

SQL QUERY

SELECT * FROM User_Wall_Posts WHERE user_id = 10 ORDER BY timestamp DES

The composite PK enables Mysql to filter and sort the results using the index; Mysql will not have to use a temporary file or filesort to fetch the results. Without a composite key, this would not be possible and would cause a very inefficient query.

As such, composite keys are very powerful and suit more than the simple problem of "I want to search for column_a, column_b so I will use composite keys. For my current database schema, I have just as many composite keys as single keys. Don't overlook a composite key's use!