As promised a couple of days ago, today I have spent half
an hour on the SQL queries emitted by a default WP installation. To do
so, I have logged the quries in mysql and written a simple PHP test
script. Here is the first run of the script, comparing 500 runs of
each of WP's and LP's queries (LP is missing two queries, one for
users as I have not enabled Jerome's AdminOptions plugin, and one for
static pages as they are not emitted by default):
WP queries ---------- 1. users: 0.090523 ms 2. options: 0.118955 ms 3. posts: 0.222328 ms 4. posts (count): 0.049981 ms 5. posts (categories): 0.190947 ms 6. posts (comments): 0.100817 ms 7. posts (meta): 0.053948 ms 8. categories (flat): 0.138303 ms 9. categories (flat, again): 0.137282 ms 10. categories (posts count): 0.125479 ms 11. archives: 0.060956 ms 12. archives (again): 0.063106 ms 13. static pages: 0.086774 ms 14. link categories: 0.073251 ms 15. links (first category): 0.262116 ms 16. links (second category): 0.289843 ms total: 2.064609 ms LP queries ---------- 1. posts: 0.192770 ms 2. posts (count): 0.049936 ms 3. posts (comments): 0.056783 ms 4. posts (categories): 0.190775 ms 5. categories: 0.168887 ms 6. archives: 0.062130 ms 7. links: 0.085729 ms total: 0.807010 ms
DB Layout
I am not a huge fan of WP's database layout: in a SELECT-driven system
like a blog (where there are lots of views and relatively few data
insertions or manipulations) it is best to contradict a few of the
golden rules of relational database design, and optimize tables for
SELECTS by reducing JOINS to a minimum.
One of the many possible examples is a numeric field in the posts
table to hold the number of comments for each post, very easy to
implement by adding one INSERT query in just a few places. Such a
field would save one query and some code every time the index page or
a category/archive page is viewed, and the same technique could be
applied to categories (number of posts) and link categories (number of
links).
Another, slightly more complex, example would be using SQL Trees for
categories (and link categories), so as to be able to retrieve a
complex nested hierarchy with a single, superfast SELECT without JOINs.
There are other issues with the WP DB layout, but to learn more about
those you will have to wait for our own backend to be finished.
Indexes
Before getting to the queries, let's briefly examine a few potentially
missing indexes. They are not very important, as everything works even
without indexes especially with small data sets, but adding them wont'
hurt, and may even help by adding unique constraints in the DB to some
fields.
The first missing index is on the options table, since the query to
retrieve them asks for all records having the
autoload
field set to 1. The options table is so small that an index won't make
much difference, but adding it won't hurt
create index idx_options_autoload on wp_options (autoload);
Another slightly more useful index is in the categories table: since
this table has an index on the
category_nicename
field,
it may well have another one on
category_name
, which will
be used when MySQL needs to retrieve a category from its short name,
as LP does in category pages. Creating an index on this field has an
additional benefit: by making the index unique we avoid duplicate
category shortnames in the DB, which are much worse than duplicate
descriptive names as category URLs use short names:
create unique index idx_cat_name on wp_categories (cat_name);
The last two missing indexes are in the posts table: every time you
look at a list of posts, MySQL has to examine their date to ensure
they are not scheduled for publication in the future, and to order
them. It makes sense then to add two indexes on the relevant date
fields:
date
, and
date_gmt
:
create index idx_post_date_gmt on wp_posts (post_date_gmt) create index idx_post_date on wp_posts (post_date);
If you run the tests after having added the indexes, you see we have
gained almost nothing in terms of speed. You'd need a much larger
database (I'm using a copy of a production DB with 1k posts) to notice
the effect of the indexes. I add them anyway to my installations,
since the storage space needed and the impact on INSERT operations are
minimal.
Queries
Without touching WP's code, let's examine of few of WP's queries to see if we can make
them a little bit faster, or at least deuglify them.
The first query we will examine is the base query used to retrieve post IDs
SELECT DISTINCT *
FROM wp_posts
WHERE
1=1 AND
post_date_gmt < = '2005-06-03 16:12:59' AND
(post_status = 'publish')
GROUP BY wp_posts.ID
ORDER BY post_date DESC
LIMIT 0, 40
There are a couple odd things in this query apart from the 1=1
monstruosity: the DISTINCT *, a bit weird as we are querying posts,
and they are already different, one for each row; the GROUP BY, as we
are not using aggregate functions, and so we have absolutely no need
for it. Deuglyfying this query, we get something like
SELECT * FROM wp_posts
WHERE
post_date_gmt < = '2005-06-03 16:12:59' AND
(post_status = 'publish')
ORDER BY post_date_gmt DESC
LIMIT 0, 40
Our simplified query yields the exact same results as the original one, and if we run both
queries with MySQL
EXPLAIN
, we see that our new query
saves MySQL the effort of building a temporary table (
using
temporary
) to store and analyze intermediate data. Let's see if
it runs faster
WP queries ---------- 3. posts: 0.286305 ms WP modified queries ---------- 3. posts: 0.219474 ms
The speed increase varies between 15% and 25%, and though it won't
make much difference to the overall performance of your site, it comes
with a simpler and more readable query than the default WP one.
A second query we might simplify a bit is the one used to gather category post counts
SELECT
cat_ID, COUNT(wp_post2cat.post_id) AS cat_count
FROM wp_categories
INNER JOIN wp_post2cat ON (cat_ID = category_id)
INNER JOIN wp_posts ON (ID = post_id)
WHERE
post_status = 'publish' AND
post_date_gmt < '2005-06-03 16:12:35'
GROUP BY category_id
The two JOINs are a bit odd, as we do not need the categories table
in this query since category IDs are already in post2cat, and we need
just those and their counts
SELECT
p2c.category_id as cat_ID, count(p2c.post_id) as cat_count
FROM wp_post2cat p2c
INNER JOIN wp_posts p ON (p.ID = p2c.post_id)
WHERE
p.post_status = 'publish' AND
p.post_date_gmt < '2005-06-03 16:12:35'
GROUP BY p2c.category_id
The speed difference is minimal, but again it makes no sense to keep a
more ugly, complex query when a simpler one yields the same results.
As many Open Source projects, WP is not only a piece of software but
also a textbook for novice programmers, and quality should matter.
Nice job, Ludo. And that doesn't even address the duplicated queries. Try loading a page in WP (static in LP) and see how many extra queries are run. IIRC, each time
get_the_permalink()is called, WP executes a query (or more).
Another note: on large datasets, the order of the WHERE clauses can affect performance. Normally you want the first clause to be the one that excludes the most records, followed by the second most exclusive, etc.
So the posts query above, it might be better to put the 'publish' condition first, followed by the future date condition (assuming that there are more unpublished posts than post-dated ones).
A more in-depth analysis needs to look at WP code, something which I don't have enough heart to do.
Maybe you can write WP Queries - Part II? :)
No, I get frustrated too easily with this topic. :) WP's code, queries and table designs collide and I'm left with one question: "Why?"
I'd rather focus my energy on making LightPress better. Someone has to write all those unit tests, after all! ;)
It's late here, but two notes:
I've done benchmarking on extremely large datasets here about a year ago and adding an index to the date fields makes no discernable difference whatsoever, and also has the potential to confuse the MySQL optimizer. Remember the query parser picks which index to use based on the query, just because an index is there doesn't mean it'll be used or help any.
Second, the parts you find odd about the main posts query are necessary due to all of the possible permutations created by the WP_Query class based on the query string arguments. Without the "ugly" bits WP would not work.
The dupe queries are something to take a look at, however most of WP's overhead comes from the size of the codebase, the DB queries are hardly a bottleneck and usually less than 15% of total generation time. I think optimization time would be better spent targetting some places we can do more selective loading of includes.
Matt: the main posts query can still be optimised; most of what ludo said is valid.
DISTINCT *isn't required as we're selecting entire rows.
GROUP BY IDis also a waste of time as the ID field is both
AUTO_INCREMENTand a primary key, so uniqueness is guaranteed. I can understand that
1=1is necessary, though.
While I agree that it's mainly the large number of includes that is causing WordPress to be slow, queries are still something to be looked at, especially when fixes are easy and performance gains are noticable. When I've got a bit more time I'll put together a bug report and a couple of diffs.
Matt, thanks for your comment.
MySQL has no use for the date indexes as the optimizer prefers the post_status field (it's a constant) and index. If you force the use of the post_date_gmt index in the query with
SELECT * FROM wp_posts USE INDEX (post_date_gmt)the index is used for the sorting, and EXPLAIN gives Using where instead of Using where; Using filesort. I guess that the larger the dataset, the more the post_date_gmt index is able to speed filtering and sorting data. Using the post_status index to weed out the non-publish posts still leaves MySQL with most of the work to do.
As for needing DISTINCT and GROUP BY in the posts query, I find it odd that the most used permutation of this query (i.e. the one used in the index page) is not optimized. GROUP BY clauses could be inserted later, even though I guess that with a careful refactoring you may simplify all the other permutations of this query.
I agree with you that the codebase is your real bottleneck, and this was the point of this whole post in response to some of the statements expressed in one of the WP support topics.
While I agree there could/should be an optimized case for the standard date-based query for the average homepage:
1. I'd be interested to know more info from Matt. Seems like he has some larger, real-world performance info to contribute.
2. The query system would have to be 'split', with optimized queries based upon the type of query being run.
3. #2 makes it difficult for plugins to modify the current query about to run… I guess the query system itself could be overhauled and made much more complex in code, just so the queries get simplified
4. However, with query caching, many of these issues go away, right? And it's in MySQL 4.0.1+ or something. Granted, on heavy, heavy hosted environments, it's a wash — on a more single-use machine, all your major queries should be in the cache already.
5. The performance issues seem more to have to do with running a dynamic site. That's not simply 'adding posts every day', or 'having lots of comments', that's also things like plugins that generate randomized output (sidebar OR post-content), or other output that is in fact time-dependent. In those cases, things like PreFormatted and WP-Cache do great performance optimization, but at the loss of proper dynamics. ;)
-d
Well the whole date thing is a hack anyway, if we could integrate something like WP-Cron and a better way of doing future posts we could eliminate the date clause that changes every minute and makes the query cache pretty much useless.
You Don’t Have To Be Pretty To Be Successful I think most people would agree that over the last year or so WordPress (WP) has been successful as an open source project. It’s been around longer than that, but it has certainly gained a lot more attention in the last 12 to 18 months. It was…
Uhm Matt, my humble opinion on the date thing is it's not a hack, it's the simplest way to the add "post to the future" feature. It works well, and an extra WHERE clause in SQL does not hurt (provided you do not have too many). I don't know the details, but adding a generalized scheduling solution and more code to manage future posts seems a bit overkill.
Scheduling is best left to the underlying OS, but then I understand you have a large userbase of (l)users who do not use Unix and don't know anything about cron/at, and who might need a WP scheduler…
It is the simplest way, which is why it was done, but it's far from the best. There are ways to do scheduling using cron if it's available and having other ways to kick off events if it's not, like they do in Drupal.
Matt, it's good to see you frequent this site. I was going to mention some of the issues addressed here on the wp-hackers list, but I guess there's no need since you're already here. Thanks to you and Ludo both. Cheers.