This Week's Drupal Fire Drill

This week many in the Drupal community lost a lot of sleep Tuesday night because the security team treated us to a warning about major security updates due out on Wednesday. Fortunately for many it wasn’t a crisis in the end, but it gave us all a chance to practice for the worst. Basically, it was like a fire drill in a elementary school: we got to prepare like there was a disaster, but since they wasn’t one we don’t really know how it would have gone if there was actually a fire. We haven’t had a stop-drop-and-roll type of emergency in a while, so it was a good refresher on how to handle a crisis.

For those who don’t know what I’m talking about here’s a quick review. At Cyberwoven, like many Drupal shops, we follow the Drupal Security twitter feed on one of our Slack channels so we saw this mid-afternoon Tuesday:

Slack posting of tweet from the security team.

I read the PSA with images of Drupalgeddon dancing in my head:

There will be multiple releases of Drupal contributed modules on Wednesday July 13th 2016 16:00 UTC that will fix highly critical remote code execution vulnerabilities (risk scores up to 22/25). These contributed modules are used on between 1,000 and 10,000 sites. The Drupal Security Team urges you to reserve time for module updates at that time because exploits are expected to be developed within hours/days. Release announcements will appear at the standard announcement locations.

Drupal core is not affected. Not all sites will be affected. You should review the published advisories on July 13th 2016 to see if any modules you use are affected.

Oh that bold line up there wasn’t part of the original announcement. On Tuesday we didn’t have a sense of scale, were we talking about modules that everyone uses on almost every site (ctools came up more than once). It’s the one thing I wish the security team had done differently: given us that sense of scale.

I read all security postings, and make sure we take prompt steps to address them for clients as needed, but the potential here was that we’d have to update all 70+ sites in a few hours or less which is very different from your run-of-the-mill security update that often aren’t related to use cases and threat profiles for the majority of sites.

Here’s what we did next:

Tuesday

Took a minute to panic, complain, and joke about pending illnesses. This is actually a useful step because it allowed me to burn off some nervous energy and then to focus on the real work.
Pulled out the list of all active clients with Drupal sites, and doubled checked it for accuracy.
Made sure a developer had a working repo for all sites (70+). Since we had a couple people out of the office, and some projects had been reassigned recently to different developers, this was an important step to make sure no sites fell through the cracks during a rush to update them all.
Made sure we knew which were 6 sites, and which were 7 in case we are able to determine that 6 is also affected. Since I knew the announcement would likely skip D6, we needed to accept that we might have to take those sites offline for a time.
Made sure leadership knew that all developers may be busy start at 16:00 UTC on Wednesday. We didn’t actually cancel anything right away, but I didn’t want anyone surprised if we were all too busy posting and testing updates to worry about things like meetings.
Made sure complex projects were thought about ahead of time: sites with unusual setups or ongoing dev work that make sudden updates complex. For example we have one client that has 16 sites that all have an unusual set up, so we agreed who would handle those and made sure she was prepared.

Wednesday

First thing in the morning I saw the update to the notice that gave us a sense of scale and relaxed a little, but still made sure we were fully prepped.
Noon: the announcement of what modules were affected was released and a couple other developers and I immediately reviewed the releases. We relaxed once we determined none of our clients were using any of the modules listed.
I reviewed code from each of the three modules to see what the change was to look for ways to improve my own code to avoid similar errors.
Looked for ways to improve our response for the next time it’s not a drill.

Things would have been more exciting if we’d had to update our sites. Since we were prepared it was a matter of minutes for us to check that all our sites were secure. Each developer checked all the sites in their sandbox, and since I knew all sites were in someone’s sandbox that gave us 100% coverage without having to do lots of double checks.

I think it is too easy to look past doing the code review of modules we weren’t using but I find this kind of follow up really useful. Looking back on Drupalgeddon it’s amazing how much pain was caused by such a small error (16 characters are all that were needed to fix it). And by seeking to understand what went wrong you can look for places that you make similarly invalid assumptions.

If you read my post on making new mistakes, you also know I believe that looking for improvement is the most important detail (particularly when it turned out to be a drill, not a fire). Here’s my initial list of things to improve:

Have a system to automatically check every site for specific modules (this was already under development but will take a little while longer to complete).
Make sure at least two developers have a working sandbox for all projects at all times in case something comes out during a vacation.
Improve internal messaging about what to expect – template message and process. I tossed together some disjointed thoughts for account managers. But disjointed developer thinking does not make people feel like you’re on top of things.
Have better tracking of outliers: I completely missed that I had a demo site on Pantheon that did have the Coder module running. Since it was Pantheon they alerted me to this problem and had taken steps until I could do the update myself. But it would have been bad if that site had been someplace else and/or in production.
Make sure everyone one knows when the actual release is coming out, and what the outcome was. Several developers were hoping to get lunch before the updates, but hadn’t done so when the announcement came out (which could have been a problem if we’d ended up busy). And I spent the rest of the day answering one-off questions from the account team who wanted to know if the announcement had been bad news.

Ideally we’d come up with a method to automate security updates (maybe all updates), but that’s not totally straightforward. We have to worry about required patches, non-standard setups, automated testing, and other details. There has been discussion on the Pantheon power user’s mailing list, but every shop has a slightly different workflow (like the fact that we don’t use Pantheon much at all) so we’ll need to come up with a system that accommodates our system.