How Drupal's cron is killing you in your sleep + a simple cache warmer
A lot of what's written about performance tuning for Drupal is focused on large sites, and benchmarking is often done by requesting the same page over and over in an attempt to maximize the number of requests per second (à la ab). Unfortunately, this differs from the real world in two key ways:
- Most of our projects aren't regularly driving traffic at millions of hits per day.
- Our users request a lot of different pages.
In this post I'll model a site with 500 nodes, and 1200 hits per day. That's fewer than 1 request per minute, yet for many small businesses this would be a fairly healthy traffic flow. In this case, it might at first seem that high performance doesn't matter. A clever hacker could probably manage to install Linux on the office coffee maker and get acceptable HTTP throughput. However, latency still matters a great deal, even for small sites:
User impatience is measured in units of 1 tenth of a second starting at 200 milliseconds or so.
Drupal's page cache is capable of delivering blazing fast response times, but only when the cache is warm. And the reality for most small to mid-sized sites is the page cache is being cleared out far quicker than it's regenerated.


These graphs show what happens to a site that runs cron.php every six hours. Remember, system_cron() calls cache_clear_all(), and every time the cache is cleared the hit rate crashes to zero. The traffic follows a Pareto distribution: a few pages are very popular with a "long tail" of less common requests. Initially the hit rate jumps back up as the popular pages get cached. But the "long tail" pages never really gain an acceptable hit rate before it starts all over again. To make matters worse, I've seen site operators that run cron far more often – imagine if cron were run every fifteen minutes; there would be almost no page caching at all. So what can be done?
Run cron.php less often
Somewhere between six hours and one day is likely adequate for most sites. If you have tasks than need to run more often (such as notifications), consider breaking up your cron runs with Elysia cron or perhaps a drush script. Still, as the graph shows it can take a long time for the page cache to kick in no matter what the frequency. Furthermore, the cache is also cleared during other operations such as node edits or comment posting.
Run a crawler
Almost every smart Drupal developer I've discussed this problem with has the same answer: run a crawler. I'm not thrilled about this solution because in some ways it seems very inefficient; but some testing shows the impact can be minimal. In my tests 500 nodes took less than a minute to regenerate from a cold cache; and less than 5 seconds when fetching from the cache. This assumes you have the XML Sitemap module installed. While not required, a sitemap certainly make's wget's job easier.
wget --quiet http:∕∕example.com∕sitemap.xml --output-document - |\ perl -n -e 'print if s#</?loc>##g' | wget -q --delete-after -i -
Invent a preemptive cache handler
It seems the ideal cache handler would regenerate cache entries before they expire, sadly as far as I know this doesn't exist yet for Drupal 6. (There's Pressflow Preempt for D5, but it seems it never made it to D6). If you have ideas about this please let me know!


Comments
But cach_clear_all does a
But cach_clear_all does a check. If
NULLis passed as the$cidthen it does a check on some variables. It makes sure that cache_clear_all is only run on the tables once per "period" defined by thecache_lifetimesettings.Check your sites
admin/settings/performance. Is the minimum cache lifetime set to "none"?But even if it is set to none, it only deletes items which are expired...
Boost for 6.x is the
Boost for 6.x is the answer!
It has a crawler built in so your site will never have an expired page. It can regenerate the cache before it expires. It solves all the issues you brought up. I can have a cache that lasts several weeks using Boost.
Please also get involved in
Please also get involved in getting finer grained cron controls into core
http://drupal.org/node/19173
Also consider setting your
Also consider setting your 'Minimum cache lifetime' setting in admin/settings/performance. That way if you run cron every hour it doesn't have to clear your caches every single run.
Minimum cache lifetime isn't
Minimum cache lifetime isn't much help for small sites. What it's really good for is preventing the caches from being cleared too often in cases where you have large numbers of writes. (As in thousands of comments per day). In the site charted above, one would need to set the lifetime to 12 hours or more to make a significant difference, and even then the caches will still expire eventually.
Regarding Boost: I have only tinkered with it, so I can't comment in detail on how it handles cache expiration. The built-in crawler is good idea. So far, the thing that has kept me off boost is that it's not actually cache handler – so it's completely unaware of cache clear events. (Correction: see mikeytown2's comment below.) In a follow-up post, I'll try to compare Boost to cacherouter with the file backend.
Actually Boost is a cache
Actually Boost is a cache handler; it has some of the most tightly integrated cache expiration logic ever seen in any CMS. The default for Boost is to ignore the cache_clear_all cron call, but if wanted this can happen by setting Ignore cache flushing: to Disabled.
Boost will clear or expire the cached page under the following hooks - Nodeapi; comments; voting api.
In addition to clearing that node it can also clear the views containing the node; cck node reference fields; taxonomy term pages containing that node; and the menu item above it (parent), next to it (siblings), and sub items below it (children).
You can also set custom expiration times by content container (view, node, taxonomy, panels, ect...); content type (page, story, ect...) or by ID (nid, tid, view display, ect...). Boost in short allow you to spend a lot of time tinkering with the cache to get it exactly how you want it to work. It is very powerful and there is a reason it is used by a lot of sites out there.
mikeytown2, Thanks for the
mikeytown2,
Thanks for the correction! I was thinking that because of Moshe's comment in this issue, and the fact that it does not include a cache.inc handler, (a.k.a.
$conf['cache_inc'] = boost.inc) that it didn't respond to things likenode_save().I will definitely be giving Boost a closer look. I wonder if the project page could be updated to clarify that it is bypassing cache.inc and cache_clear_all() in order to provide more fine-grained control.
This won't solve everything,
This won't solve everything, but it sure helps out alot.
Super Cron Module
Thanks for this post. I have
Thanks for this post. I have my cron set to run every 5 minutes and it never occurred to me that there would be any harm in it. I'm going to investigate the options in the comments.
Thanks again!
Michelle
Thanks for the post Dylan,
Thanks for the post Dylan, and to all who commented! This will definitely help, as a good portion of the sites we build are in the low-medium traffic range, and performance is always a concern.
Post new comment