Add new comment

Fixing Drupal site locks during menu rebuild

Difficulty: 
Come Get Some

This is a follow up to the previous post Database Transactions in Drupal where we saw in detail how little attention transaction management has had (and still has) in Drupal.

In this article we will see:

  • How we used proper transaction management to completely avoid site lockups during menu rebuild. Site is 100% usable from processes other than the one regenerating the menu.
  • What we had to do to get proper transaction management working.

At the end of this post you will se how we made it possible for a site to be 100% usable with this crazy index.php file (note the menu_rebuild_needed being set to TRUE all the time) and - of course - any sort of high level caching (page, varnish, cdn, etc.) disabled:

<?php

define('DRUPAL_ROOT', getcwd());

require_once DRUPAL_ROOT . '/includes/bootstrap.inc';

drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);
variable_set('menu_rebuild_needed', TRUE);
menu_execute_active_handler();

The problem with menu rebuild

I revisited Drupal's DBTNG transaction implementation in preparation to fix some site locking issues. You know what I'm talking about. Some operations on D7 can completely take your site down when using "serious/consistent/transactional"  data backends. Among other things I'm specifically talking about rebuilding the menu. 

Menu rebuilding has had (and still has) a noticeable performance impact on your website because:

  • It is by design extremely slow (and scales bad with sites with many routes/menu items)
  • It is run inside a database transaction yielding a high probablity of query locks (even more if you take into account that if using memcache some invalidations operate on the variable table)
  • It is designed to keep all incoming requests waiting until the menu has been rebuilt (only when triggered with menu_rebuild_needed  = TRUE)
  • If the previous is not bad enough, during registry rebuilds or updates, menu rebuild happens at least twice

This is how the overal logic of the rebuild works:

  • Incoming request hits Drupal
  • We need to route this request to a callback, so call menu_get_item()
  • If the menu needs a rebuild goto to menu_rebuild()
  • If I can rebuild the menu, do so, otherwise wait until it has been rebuilt

Rebuilding the menu involves the following steps:

  • Acquire lock
  • Start a transaction
  • Rebuild the menu router (about 10% of total time). Affected (locked) tables include menu_router and variable.
  • Rebuild menu links. Affected tables include menu_link
  • Clear page, block and menu cache
  • End transaction
  • Release lock

If I now told you that rebuilding the menu can take something between 5s and 1minute - depending on setup, hosting and size of the menu tables -, and that this process can happen up to 3 times in a row, that means a potential 2-4 minute downtime for your website. And if this happens during peak hours, the problem can get worse.

How can this happen 3 times in a row?

  • If you are in maintenance mode menu rebuild is designed to run at least twice.
  • The default lock duration is 30s and if your menu rebuild takes more than 30s you are going to start racing (more than one request generating the menu)

A few ideas to try and fight this issue:

  • [1] Make menu_rebuild much more faster
    • This has been attempted in several issues at Drupal.org, but there is little to be done without completely refactoring how all menu rebuilding works, unfeasible at the current D7 stage.
  • [2] Make menu_rebuild work fully in-memory and then drop the results into the database at once
    • This also means some heavy rewrite...
  • [3] Make menu_rebuild work on a snapshot transaction
    • Using this transactional model, the rest of the website will still be able to use menu router and menu links while the new menu is regenerated. It requires very little modification of the current menu generation logic.
  • [4] Split menu generation transaction into two parts
    • Menu rebuilding has two distinct parts (menu router and menu links) affecting different tables. If each one had it's own transaction, we will have two shorter database locks instead of a big one.
  • [5] Cache menu_get_item results
    • We can serve cached items while the menu is being rebuilt if we are fine with some stale data for a while.

At first sight [3], [4] and [5] look like the ones with the least impact (and development effort) until you find out that transaction management in D7's DBTNG is completely underpowered.

As I mentioned on the previous article transaction management in D7 is:

  • Not convenient to use
  • No control on transaction nesting
  • No control on transaction isolation level

Revisiting transaction management in Drupal

The first step in this adventure is to get proper transaction management.

To have control over global statement transactional behaviour (the transaction behaviour for implicit transactions) there is a flag that you can use in settings.php:

$conf['MSSQL_DEFAULT_ISOLATION_LEVEL'] = 'PDO::SQLSRV_TXN_READ_COMMITTED';

You can use any of the following

  • PDO::SQLSRV_TXN_READ_UNCOMMITTED
  • PDO::SQLSRV_TXN_READ_COMMITTED
  • PDO::SQLSRV_TXN_REPEATABLE_READ
  • PDO::SQLSRV_TXN_SNAPSHOT
  • PDO::SQLSRV_TXN_SERIALIZABLE

Read more about the different levels here.

Now we need a convient and robust way to control transaction behaviour for our transactional code.

That is done through the new db_transaction_sane() method.

db_transaction_sane() needs to be called with an instance of DatabaseTransactionSettings that defines the behaviour for the transaction.

  /**
   * DatabaseTransactionSettings constructor.
   * 
   * @param mixed $sane 
   * True if commits are explicit and rollbacks implicit. This is the opposite as Drupal's default behaviour.
   * 
   * @param DatabaseTransactionScopeOption $ScopeOption 
   * How transaction nesting will work.
   * 
   * @param DatabaseTransactionIsolationLevel $IsolationLevel 
   * Level of isolation for this transaction.
   */
  public function __construct($sane = FALSE, 
      DatabaseTransactionScopeOption $ScopeOption = NULL, 
      DatabaseTransactionIsolationLevel $IsolationLevel = NULL) {

There are 3 settings you can control here.

  • Sane
    • Set this to TRUE to get implicit rollbacks and explicit commits. No more double try-catch blocks.
  • Scope Option:
    • RequiresNew: Will start a new transaction or savepoint.
    • Required: Will only start a transaction if there is no active ambient transaction.
    • Supress: The scope is isolated from any ambient transaction, that will be restored once finished. This is done by using a different connection to the database during the scope of this "supress transaction".
  • Isolation Level:
    • Already explained before.

A very simple example:

$t = db_transaction_sane(new DatabaseTransactionSettings(TRUE,
                                DatabaseTransactionScopeOption::Required(),
                                DatabaseTransactionIsolationLevel::Snapshot()));

// No need to wrap this in a try-catch, if an exception
// is thrown it will be automatically rollbacked.

// My transactional code.

$t->commit();

Dealing with the menu_rebuild issue

Now that proper transaction support is working we can try preventing the site from locking during menu rebuilds without having to rewrite the whole menu rebuild logic and implementation.

We will do 3 things:

  • Use snapshot isolation so that the transaction that is rebuilding the menu does not affect the rest of the system while doing so (no locks, waits, or inconsitent data).
  • Break menu rebuilding into two separate pieces, each one with it's own transaction.
  • Don't make requests to menu_get_item() wait unless we failed to obtain a router. Because menu rebuilding is running on a snapshot transaction, we will have the latest menu state until the menu is completely rebuilt and commited.

This is what menu_rebuild() and menu_get_item() look like after the refactoring:

function menu_get_item($path = NULL, $router_item = NULL) {

  [...]

  if (!isset($router_items[$path])) {
    // TODO: Consider moving this block of code (rebuild check and rebuild)
    // only for cache miss.
    // Rebuild if we know it's needed, or if the menu masks are missing which
    // occurs rarely, likely due to a race condition of multiple rebuilds.
    $needs_rebuild = FALSE;
    if (variable_get('menu_rebuild_needed', FALSE) || !variable_get('menu_masks', array())) {
      if (_menu_check_rebuild()) {
        $needs_rebuild = TRUE;
        menu_rebuild();
      }
    }
    $original_map = arg(NULL, $path);

    $parts = array_slice($original_map, 0, MENU_MAX_PARTS);

    $ancestors = menu_get_ancestors($parts);
    
    // Menu router is a delicate table, that can get locked
    // for hours at a time when rebuilding the menu. Relying on a
    // cache makes sure that while menu is being rebuilt, other
    // requests won't get locked.
    sort($ancestors);
    $cache_item_key = 'MGI:' . MD5(implode('.', $ancestors));
    if($cache = cache_get($cache_item_key, 'cache_menu')){
      $router_item = $cache->data;
    }
    else {
      if ($router_item = db_query_range('SELECT * FROM {menu_router} WHERE path IN (:ancestors) ORDER BY fit DESC', 0, 1, array(':ancestors' => $ancestors))->fetchAssoc()) {
        cache_set($cache_item_key, $router_item, 'cache_menu', CACHE_TEMPORARY);
      }
      else {
        if ($needs_rebuild) {
          // Not found and rebuild needed, let's wait...
          menu_rebuild(TRUE);
        }
      }
    }

    [...]
}

function menu_rebuild($wait = FALSE) {
  // Menu rebuild puede ser muy tardado, si el lock expira
  // será muy mal asunto y tendremos a un montón de peticiones
  // regenerando el menú.
  if (!lock_acquire('menu_rebuild', 180)) {
    if (!$wait) {
      return FALSE;
    }
    // Wait for another request that is already doing this work.
    // We choose to block here since otherwise the router item may not
    // be available in menu_execute_active_handler() resulting in a 404.
    while (!lock_acquire('menu_rebuild')) {
      // All requests in the system
      // are going to wait for this,
      // so introduce some variability in the
      // wait times.
      lock_wait('menu_rebuild', 30);
    }
    lock_release('menu_rebuild');
    return FALSE;
  }

  try {
    // First build menu router.
    $transaction = db_transaction_sane(DatabaseTransactionSettings::GetBetterDefaults());
    list($menu, $masks) = menu_router_build();
    _menu_router_save($menu, $masks);
    $transaction->commit();

    // Then build menu links.
    $transaction = db_transaction_sane(DatabaseTransactionSettings::GetBetterDefaults());
    _menu_navigation_links_rebuild($menu);
    $transaction->commit();

    // Clear the menu, page and block caches.
    menu_cache_clear_all();
    _menu_clear_page_cache();

    if (defined('MAINTENANCE_MODE')) {
      variable_set('menu_rebuild_needed', TRUE);
    }
    else {
      variable_del('menu_rebuild_needed');
    }
  }
  finally {
    // Always release the lock!
    lock_release('menu_rebuild');
  }
  
  return TRUE;
}

What we have done here is:

  • Split the process into two smaller transactions instead of a big one.
  • Use a different transaction isolation level.
  • Don't make incoming requests wait for the menu rebuild unless we were unable to find a matching route.

The result is that all requests are properly served in a timely manner even while the menu is being rebuilt.

We could even go further and move the menu rebuilding process to a shutdown function (only in cases of not already having a built menu in the database) in which case no one will notice menu rebuilds at all.

Looking forward

One of the big issues of Drupal is that it wants to be as portable as possible. This means that the database abstraction layer is giving up features in order to work on less "feature full" database backends. That is a big mistake. Those advanced features are there for something, some of them aiding at scaling big and complex systems.

Just take a look at how DBTNG has not changed at all from D7 to D8 (well not really only some minor changes). Something quite disturbing considering that data and storage are the backspine of any application. No matter what your application does, it is all about managing data. If Drupal fails to acknowledge this need it will fail to position itself in the right spot between corporte applications and site building. The move to entity based storage instead of field based storage is an example of this. The inability to retrieve data atomically from within entities is another example of this. Beware that document based storage (what Drupal looks like heading to) is not the right tool for all jobs. From the MongoDB is web scale:

MySQL is slow as a dog. MongoDB will run circles around MySQL because MongoDB is web scale.

"MongoDB does have some impressive benchmarks, but they do some interesting things to get those numbers. For example, when you write to MongoDB, you don't actually write anything. You stage your data to be written at a later time. If there's a problem writing your data, you're fucked. Does that sound like a good design to you?"

By writing to memory mapped files, MongoDB can improve its write-throughput factor by 10.

"What the fuck? Lets abandon transactions, consistency, durability and pin our mission crititcal data to a table and give it a night it will never forget. I mean, who cares what we store as long as we do it quickly. Oh. Sorry. That's right. I'm on the farm now suffocating from the stench of a thousand cow farts. But it smells like roses to me, because I'm nowhere near this moronic conversation."

Entity based storage means (sort of) that if you want to list a single field from an entity you need to load the full entity. This is against the basic principle of loading only what you need.

Wordpress is already eating the site building side of the spectrum, and that is a more than probable lost battle. So Drupal should be looking into providing more powerful tools for advanced application development to find a sweetspot between bare frameworks and ORM's (such as Symfony, Laravel or Propel) and CMS's such as Wordpress. And the first step is to acknowledge that data management and data tools - DBTNG, the Entity API (wrappers, EFQ, etc.) - need to be first class citizens of the Drupal ecosystem.