zen of coding

Speeding things up with materialized views and MySQL

First let’s see what Wikipedia says about materialized views:

A materialized view takes a different approach [from the regular sql view] in which the query result is cached as a concrete table that may be updated from the original base tables from time to time. This enables much more efficient access, at the cost of some data being potentially out-of-date. It is most useful in data warehousing scenarios, where frequent queries of the actual base tables can be extremely expensive.

This feature exists in some databases, but we do need a work-around for MySQL, still let’s see if it’s even worthwhile to consider.

All that being said (well quoted), there are a couple of ways to look at this situation…

1. You will have speed of access, but the data might be out of date
2. You perform a lot of queries for the same data, which does not change very often

I prefer to be more optimistic and look at it from the second point of view.

Let’s consider the following situation:

You are looking for a listing of the best Irish pubs in your area.
The greatest solution would be to query best_irish_pubs table with $this->BestIrishPub->find(‘all’); and be done with it…

While that would be an excellent solution, unfortunately our database is likely to be highly normalized and because of that a simple query is not going to cut it.
To follow along with the example, we can imagine that the following information might be required:

Pub type = Irish
Pool table = Yes/No
Bartender first name
Bartender last name
How many bartenders on shift from 4-8?
Out of the above how many are female?
Hours of operation?
Pub latitude
Pub longitude
Distance units = miles or kilometers
Pub owner’s name

The point of all this is that the data is likely to be scattered all over the database and can potentially require some complex calculations on the fly (using longitude and latitude to determine distance); attempting to squeeze all of these requirements into a single query is not impossible, but highly inefficient for the life of your database, web application… as well as your own.

To make the long story short, it becomes increasingly difficult to optimize a single complex query to squeeze out a few seconds of improved performance. If you’ve got some complex JOIN’s, an ORDER BY and other well known performance grinders, it is not unreasonable to see a single query take as much as… let’s say 30 seconds.
Even if you manage to make it 50% faster, it is still completely unacceptable by any web standards. You’d have to struggle to make it thousands times faster (literally) just to get to a reasonable speed.

1. What about query caching?
In many cases this will be your silver bullet, although in some cases it won’t be possible to implement. If our web app is going to show the best Irish pub in any city in the world with a pile of different conditions (options) not to mention any required calculations.., you can imagine that caching a specific query is going to be quite impossible.

2. What if we had the desired data in a single table?
That would certainly speed things up, but you don’t have to sacrifice your database design and ignore normalization in order to increase performance. Your data should be stored in such a way that your tables aren’t growing horizontally. For example, it is easy to be tempted to add a new field to some table rather than normalizing and separating the data into a few smaller and more manageable tables.
However, just as you like to organize your code using MVC to keep things from getting out of control ;) the database should be fundamentally well designed. One does not want to build on top of a crippled foundation.

3. Materialized views to the “rescue”…
The goal can be met half-way… Unfortunately MySQL does not have a materialized view capability. (One implementation I’ve found is here, but without trying it I am just going to proceed with a more manual approach).
Good news is that we know what sort of results are going to be required, therefore we can take our complicated query and make a regular view out of it. (It is quite simple to create a view once the SQL query is known, a little google’ing will have you up and running in a few minutes).
Hold on a second, wouldn’t that be the same exact query and now we throw in an underlying complexity of a MySQL view?! Shouldn’t things become even slower now?
Indeed the application would become even slower… if we just left everything as is.

4. Tables and cron to the rescue!
Here’s what’s going to happen next…
We will query our view and create a table that will hold all the relevant data (exactly like the view does). With one major difference: once the query is completed, the data in the table stays as is, regardless of what is going on in the rest of your database. Think of it as cached data ready to be queried, worked on calculated, etc. further.
The query can still take 30 seconds to execute but in our case an update to a pub doesn’t happen very often. If we create a cron, which updates our “cache table”, and runs every 3-4 hours it could very well work for this “pub scenario”. For some application the update could happen weekly for others, every few minutes. Bottom line is that there is always a little trade off for up-to-the-second data accuracy and a dramatic increase in performance. Running that same query against a live DB will certainly upset your users, even if they get the most up-to-date results.
That being said, there are ways to manually trigger updates of your “cache table” by using model’s afterSave() and afterDelete() callbacks, so “technically” you don’t have to worry about missing any updates. The frequency of these updates should be the best guide on how you’d implement this approach to properly balance any trade offs.

5. Recap and a little SQL
First, we identify our complex or slow query that should not be running against a live DB.
Secondly, based on this query we build our view to hold the essential data in a single place.
Next, we create a “cache table” to replicate the materialized view functionality.
The best and simplest implementation I’ve found would look something like this:
[sql]
DROP TABLE IF EXISTS `best_irish_pubs_new`;
CREATE TABLE `best_irish_pubs_new` SELECT * from `best_irish_pubs_view`;
ALTER TABLE `best_irish_pubs_new` ADD INDEX (`lat`)’;

— Continue to add as many indexes to the table as needed. The line above is for reference.
— You can also set your database engine and collation at this point.

RENAME TABLE `best_irish_pubs` = `best_irish_pubs_old`, `best_irish_pubs_new` = `best_irish_pubs`;
DROP TABLE IF EXISTS `best_irish_pubs_old`;
[/sql]
(Thanks to the original poster http://dev.mysql.com/doc/refman/5.0/en/create-view.html… see comments for more details).
Now we can perform calculations and querying on a single table where the results have already been thoroughly prepared to be extracted. For example, the distance calculation can be done since we have latitude and longitude in the table, we can order by bartender’s name since we brought all this info in form other tables via JOIN’s… but the best news, of course, is that we have dramatically improved performance.

A real world number is that by employing this technique we’ve cut down a 27.89 second query to under one second for a pretty complex resultset.

And there we have it… a long ass post, but hopefully a useful one nonetheless ;)

  • keymaster

    Great post.

    Basically, what you’re doing is creating a simple front end data warehouse for query purposes.

    Obviously this is not for apps with high update rates – might even make things worse due the the need to update the materialized view.

  • teknoid

    @keymaster

    Yes, there are certain trade offs and this should be used for cases where query caching, view caching and simple query optimization tend to fail.

    Overall if you already have a need to do something like this, chances are you’ve exhausted the basics…

  • Pingback: Interesujące z sieci – 2010/09/21 « webbricks()

  • That’s an insane approach to creating a materialized view. Your table rows will spend their entire life locked and your performance,.. well, you won’t have any.

    http://www.nicey.com/blogs/rnice/?p=24

  • teknoid

    @Robert Nice

    So far the increase in performance is dramatic. This approach was both recommended on #mysql, their web site and a few other places…

    Once improvement would be to create the table without creating the view.

    I am always open for new ideas, so thank you for sharing your approach.

  • Pingback: Speeding things up with materialized views and MySQL | DEEP IN PHP()

  • Hi,

    Keep in mind that Flexviews can incrementally refresh materialized views. It can handle millions of changes per day, support joins and aggregations, and it isn’t difficult to use.

    It also supports views similar to the ones you are creating (which are refreshed completely) except it has a data dictionary so no template view is required.

  • teknoid

    @Justin Swanhart

    Thanks, we are planning to experiment with FV in the near future, just haven’t got around to it. From what I’ve read and based on your post it is clearly a superior approach to the hack above.

  • Pingback: #1030 – Got error 1 from storage engine « Espresso+()

  • Great work, Thanks much.

%d bloggers like this: