Responding to Drupal Break-ins

If you support any web site long enough you will suffer a break in. If you support lots of web sites you will suffer them more often than you’ll want to admit in public. A few weeks ago my number came up again in the attack lottery when we discovered a client’s web site was being used as a proxy and redirect to a fake shoe site.

It wasn’t the first time I’d suffered a break in, and unfortunately I don’t expect it to be the last. My last experience with a major break in was shortly after Drupalgeddon (I patched all the clients I was supporting before they were breached but had to clean up sites that weren’t patched by other vendors), and the attackers had learned a few new tricks in the meantime.

If you are responding to a break in on a Drupal site there are directions on drupal.org to help guide you through an attack response, but I thought it might be helpful to talk through a version of what response can look like in practice. I think it’s also useful for us all to admit our weaknesses from time to time to help us all make sure we’re making new mistakes.

Overview

At the outset I’m going to admit we never found the initial source of the attack, what we did find were the tools they placed after the break in. The most likely cause was poor server patching practices by the client’s host, but there were also some Drupal security patches that had been slow to be get installed as well. During the attack I worked with members of the Drupal Security team (particularly Greg Knaddison who generously provided feedback on this article as well – of course any remaining mistakes are mine), who were helpful in giving me suggestions and who were clearly interested in helping us make sure we resolved the problem.

The site was being used as part of a scam advertising network. The attacker was leveraging the reputation of the site to create records in search engine indexes that were redirects to a fake shoe sales site. There were also a number of tools placed on the server that gave them full access to the Drupal database and the ability to run arbitrary PHP scripts. And it was clear by the end they had placed additional backdoors we never found – they may have had full control over the OS as well.

How we found out

Google told us.

We got an alert from Google reporting SPAM content on the site. At first we couldn’t find the content they were talking about, which unfortunately slowed our escalating our response, because it was only directly available to search engines. The junior developer who was initially assigned to review the message from Google eventually figured out how to find the listing on Google (a Google site search for Nikes and some hash codes the attacker was using), but couldn’t figure out how out it got there, and escalated the task to me.

Once I saw what she’d figured out my stomach sank. At first I was still hoping there might be some other explanation, or some simple matter of a single user account getting exploited, but that seemed unlikely (since we couldn’t find the content on the server) and I quickly knew it was going to be a mess.

Initial Response

The first thing I did was make sure we had a copy of the exploited site: code, files, and database. I would have rolled the site back to a recent backup, but our five-day rolling database snapshots were not enough to get back to before the attack began. We spun up new virtual machines for myself and another senior developer to start reviewing copies in environments isolated from other work.

Since the URLs we had for testing were a fairly unique pattern we started to Google those – and we got lots of hits. As soon as we knew the problem was larger than our site, we opened an issue with the Drupal security team and started to feed them all the information we had gathered. While their practice is not to get involved in resolving attacks directly (their role is to ensure the security of Drupal core and contributed modules), they were supportive and helpful in suggesting places to look for problems and resolution strategies.

Attacks we found

By the time I was alerted to the problem there were already several malicious tools installed, some of which I’d seen versions of before, and some were new to me – all were designed to be hidden from sight through some simple but effective obfuscation. Over the course of the next couple of days I found several backdoors manually, wrote tools to help me find more, and played entirely too much whack-a-mole (more on that in a bit).

There were two main categories of attack I was chasing: PHP scripts scattered around the public files directory, and records added to Drupal’s database tables.

Database table exploits

If you dealt with sites in the aftermath of Drupalgeddon, or other hacked Drupal sites, you have probably seen what happens when an attacker inserts PHP into carefully targeted parts of a Drupal database. In the ones I’d seen before attackers replaced the callback functions in Drupal’s menu_router table with PHP of their own. In this case the attacker used the Block module’s ability to use PHP to place a block to provide themselves a way to execute arbitrary PHP by sending a post request to the server. They leveraged the fact that the main system block is always available and therefore is a reliable place to insert a backdoor. By posting a form with a specific form element they were able to execute arbitrary PHP and therefore use that to place additional malicious code.

The attacker also leveraged Drupal’s system table to get more complex attack code loaded. They created a record for a file to be loaded as a module and then uploaded that file to the site’s files directory where they were guaranteed Drupal had write access.

filename: sites/default/files/styles/medium/public/57h3d21.jpg
name:overly
type: module
owner:
status:1
bootstrap: 0
schema_version:0
weight: 0
info: a:11:{s:4:"name";s:6:"overly";s:11:"description";s:58:"Displays the Drupal administration interface in an overly.";s:7:"package";s:4:"Core";s:7:"version";s:4:"7.32";s:4:"core";s:3:"7.x";s:7:"project";s:6:"drupal";s:9:"datestamp";s:10:"1413387510";s:12:"dependencies";a:0:{}s:3:"php";s:5:"5.2.4";s:5:"files";a:0:{}s:9:"bootstrap";i:0;}

This was the script doing the redirects and filtering traffic so that pages only appeared to search engines. Usually these records have filenames that are .module, .php, or .inc files, but in this case it was a .jpg file named to be similarly to actual files on the site to make it hard to spot.

The content of that file was a PHP script not an image. The script did several things, and was the main tool the attacker was actively using during the time we were trying to stop them. It served as a simple proxy of content that they would present to the search engines, and redirect those same pages to the scam site for anyone else. It also provided code to make sure the content of user login forms was sent to the attacker, and a backup backdoor incase some of their others were lost.

We actually had to remove this particular attack more than once (always using the misspelled “overly” module) and each time it came back with a new file, and each time using a different but similar disguise to try to make their code blend in with legitimate files.

.htaccess files in public files

The other trick that was new to me (and a more aggressive stance by Drupal core on this approach is being discussed) was to take advantage of the .htaccess patterns in Apache to re-enable PHP execution within the public files directory. Drupal’s default .htaccess file disables PHP at the root of the public files directory and in theory all subdirectories, but that can simply be undone by a malicious .htaccess file (unless you block it in Apache’s main configuration – which in my opinion defeats the purpose of using .htaccess).

The attacker had placed a number of basic PHP-based exploits on the server using this technique to allow them to run the scripts. The tools themselves were not Drupal-specific, and likely the .htaccess file would work just as well on a number of other PHP-based CMS platforms.
Since the files directory gets deep and complicated there is no reasonable way to scan the whole thing by hand: particularly since several of the files were using inaccurate file extensions (like .jpg or no extension at all) and file names meant to blend into the background. So in addition to checking for any .htaccess files below the files directory root, I wrote a simple Python script to scan a directory for anything that includes the string <?php:

How we fixed it

We immediately made sure all code on the site was up-to-date, and I removed every exploit I could find. And for a couple days I played whack-a-mole with the attacker. Every day I would remove a series of exploits, disable their ability to redirect users to their scam, and every night they would break back in through a backdoor I’d failed to find.

The final solution was to replace the server, deploy a version of the code known to be good, and deploying copies of the database and files that have been scanned for any PHP in places it shouldn’t be – which involved a combination of the scanner above and hand checking every place the database stores PHP, not a fast process.

What we will do better next time

Part of any security event of this nature needs to be a full review of your internal processes and controls to make sure you reduce your risk and improve your response the next time something occurs (because unfortunately there will be a next time for all of us).

One of our first areas of improvement is a shared understanding that it’s more important to resolve the attack than determine the cause. This goes against the question that developers are constantly asked during and after an attack: “How did this happen?” While you need to know something about what happened, in the end it’s more important to make it stop. I still don’t know what happened that started the attack, I know I stopped it by blocking every attack vector I could think of and replacing every part of the stack with a version known to be fully up-to-date. It would have been faster and cheaper if I’d just started there: yes there is a risk I would have missed some of the code in the database if I hadn’t taken the time to review what I was finding, but frankly I doubt that risk is has high as the risk that new exploits will appear while I’m working to understand the previous one.

Beyond that basic shift in approach we developed a three part list of improvements:

  • Things we needed to improve right away.
  • Things we needed to improve soon.
  • Things that should be part of ongoing improvement.

The highest priority items were coming up with better internal process for initial response, and making sure we are deploying all security updates in a timely but still careful manner, including monitoring our hosting partners to ensure servers stay up-to-date as well. These are basics that are easy to let slip over time – particularly monitoring that your partners are doing their job correctly.

The second category of fixes is filled with workflow and procedure improvements. We were already were in the process of improving our code handling (migrating from SVN to Git, better production monitoring, more internal code review, etc), and we accelerated our plans to complete that work. This category also includes a complete review of our existing backup procedures to make sure they provide the level of coverage our clients need.

The final category of longer term adjustments includes tasks that include ensuring all developers are given (and expected to take) professional development opportunities around security best practices, doing more internal sharing about emerging ideas and trends, and encouraging more community engagement so we are better able to leverage the community resources in a crisis.

Yes I wear pants and other advice for working from home.

I’ve worked from home part-time or full-time for the past four years. The first question I’m always asked when people learn I work from home is “Do you wear pants?”. The answer is yes, I wear pants to work even when no one can see me. And frankly I don’t quite understand the desire of large numbers of people to work without pants, but it lead me to realize that it might be helpful to share a few tips for people starting to work remotely.

Wear Pants

Okay, it doesn’t have to be pants, but get up and get dressed in clothes you don’t mind being seen in. The clothes we pick communicate, not only to other people but also to ourselves. Pick clothes to help you take your work seriously.

Set a routine

Have a routine about when you start, and what you do to prepare yourself before you start for the day. Your commute might be a walk down the hall instead of a drive across town or a ride on public transit, but most people I know still benefit from a pre-work routine. Walk the dog, go for a run, eat breakfast, or some other basic activity. It doesn’t matter a whole lot what it is, but have a routine that gives you a few minutes to get settled into a frame of mind to be focused on what you are going to do at work.

Make sure the technology works

You never want to have to say “I can’t do that from here.” or “I’ll do that next time I’m in the office.” This is particularly true when you have a part-time remote arrangement (i.e. if you work from home 1 or 2 days a week), and the systems may not be setup to fully support remote workers. Push hard to get the technology fixed so you can do everything from home at least as well as you can in the office.

Have back-channels

Make sure you are set up with ways to informally communicate with colleagues. Whether that’s slack, hipchat, google hangouts, or some other tool, make sure there is a way to discuss totally unimportant stuff to maintain your personal relationships with your colleagues. When times gets stressful it’s critical to have good will built up and important to have a way to work out issues in private.

Have an office

It doesn’t have to be a nice office, but have a space you go to for work. Make sure it’s setup well for your work, and make sure you don’t spend lots of non-work time in that space. Don’t use the room with your TV or other simple distractions. Ideally you want a space with a door you can close so you can block out any other people in your home and focus.

Have boundaries

Like a morning to routine you need to have boundaries about when you are not working. Finish your day at a predictable time most days. Make sure your colleagues know when you are available outside normal hours and limit how much you let them cheat. Most importantly sometimes be truly unavailable.

Get out

Get out of your house/apartment at least once a day. When I first worked from home I realized that I’d gone a week without leaving my apartment complex, and I was starting to go stir crazy. Even if it’s just a trip to the store or a walk in the woods get out and remember there is more to the world than your work.

Be Productive

Lots of work places treat work from home as a privilege they want to take away again. Even if that doesn’t seem possible in your company, make sure you are proving you are able to be productive with the freedom working from home gives you. Ideally you have a work space free from the normal distractions of a conventional office, so use that environment to get more done than you can in the office.

Get together with colleagues

If you work remotely full-time you need to make sure you still spend face-to-face time with your colleagues. Humans are geared to appreciate time spent together, and for all the technology allows us new freedoms about where and when we work, there is still a different quality when everyone is in one place together. Some companies do this at conferences, some hold retreats, some just call the remote staff to the main office a few times a year. Figure out what makes sense for you and your company and push to make sure it happens.

Show personality

puzzleDo things that help give your colleagues insight into who you are outside work. I keep a puzzle table in my office to help me clear my head and avoid boredom during conference calls. At my previous job we did a daily stand up video conference, and some days I would put the camera at the puzzle and worked on it as we talked. It served as a friendly way to help people see me as a real person not just a source of code. Now I will share pictures of my dogs and other things that round out people’s understanding of my life.
Not all of these things are totally within your control, and you will need support from your company to make sure the environment is right for you to be successful. Work with your manager, other remote employees, and other colleagues to make sure the environment is going to allow you to be successful over time.

Documenting your work

Programming books
Your documentation does not need to look like this.

Early in my career I spent a lot of time as the only technical person on project, and therefore believed that I didn’t need to document my work carefully since I was the only person who had to understand it later. It turned out that if a project was back burnered for a few months the details were pushed out of my mind by the details of eight other projects.

Any project that takes more than a couple hours to complete involves too many details for most people to remember for more than a few days. We often think about project documentation as something for other people – and it is – but that other person may be you in six months.

I learned to start keeping notes that I could go back to, those notes would turn into documentation that I could share with other people as the need developed. My solutions were typically ad-hoc: freeform word documents or wiki pages. For a while I had a boss who wanted every piece of documentation created by IT to fit a very predictable format and to be in a very specific system. It took two years for him to settle on the system, process, and format to use. By then I had a mountain of information in wiki pages that documented the organization’s online tools in detail, and no one else is IT had anything substantial. It was two more years before the documentation of other team members got to be as good as my ad-hoc wiki.

That’s not to say a rogue solution is best, but the solution that I used was better than his proposed setup for at least three years. That experience got me to think about what makes documentation useful.

Rules of thumb for good project documentation:

  • Write up the notes you’d want from others when coming into a project: think of this as the Golden Rule of documentation. Think about what you’d want to have if you were coming into the project six months from now. You’d want an outline of the purpose of the project and the solution used, and places they deviated from any standards your team normally uses. You’ve probably read documents that are explaining something technical to an expert that are hard for anyone else to understand – if I’m reading the documentation I want to become an expert, but I’m probably not one already.
  • Keep it easy to create and edit while working: if you have to stop what you’re doing and write your notes in a totally different environment that your day-to-day work you will not do it. Wikis, markdown files, and other similar informal solutions are more likely to actually get written and updated than any formal setup that you can’t update while doing your main work.
  • Document as you go: we all plan to go back and write documentation later and almost none of us do. When we do get back to it, we’ve forgotten half the details we need to make the notes useful to others. So admit you’re not going to get back to it and don’t plan to: write as you go and edit as you need.
  • Make sure you can come in in the middle: People skim project documentation, technical specifications, and any other large block of text. Make sure if someone has skipped the previous three sections they can either pick up where they left off, or give them directions to the parts the need to understand before continuing.
  • Track all contributions: Use a system that automatically tracks changes so you you can see contributions from others and fix mistakes. Tools like MediaWiki, WordPress, and Drupal do this internally. Markdown or text files in a code repository also have this trait. Avoid solutions like MS Word’s track changes that are meant for editing a final document not tracking revisions over time.
  • Be boldDon’t fear editing: follow the Wikipedia community’s encouragement to Be Bold. You should not fear making changes to the team’s documentation. You will be wrong in some of what you write, and you should fix any mistake you find – yours or someone else’s. Don’t get mad if someone makes a change that’s not quite right, revert the change or make a new edit and more forward.
  • There is always an audience: even if you are the only person on the project you have an audience of at least your future-self. Even if it feels like a waste in the moment having documentation will help down the road.

Remember even if you are working alone you’re on a team that includes at least yourself today and yourself in the future. That future version of you probably won’t remember everything you know right now, and will get very annoyed at you if you don’t record what they need to know. And if the rest of your team members aren’t just versions of yourself they may expressed their frustration more directly.

Bad data systems do not justify sexist your behavior

This week we get a letter from Atlantic Broadband, our ISP, addressed to “Aaron & Eliza Crosman Geor”. My wife has never gone by Eliza and her last name is not “Geor”.

Atlantic Broadband to: Aaron & Eliza Crosman GeorIt’s been this way since we signed up with them, when we ask them to fix it they acknowledge that they cannot because their database cannot correctly handle couples with different last names who both want to appear on the account. Apparently it is the position of Atlantic Broadband that in 2016 it is reasonable to tell a woman she cannot be addressed by her legal name because it would be expensive for them to fix their database, and therefore she must be misaddressed or left out entirely.

I consider this unacceptable from old companies, but Atlantic BB was founded in 2004 – there are probably articles about not making assumptions about people’s names that are older than their company.

Folks, it is 2016, when companies insult people and then blame their databases it is because they do not consider all their customers worthy of equal respect.

So let’s get a few basics out of the way:

  • Software reflects the biases of the people who write it and buy it.
  • If your database tells someone their name is invalid your database is not neutral. Just because you don’t get the push-back that Facebook sees when they mess this up does not mean what you’re doing is okay.
  • If your database assumes my household follows 1950s social norms, the company that uses it considers 1950s social norms acceptable in 2016 – and there are probably a few of those they don’t want to defend (I hope).
  • When an email, phone rep, or letter calls me by my wife’s last name or her by mine, in both cases they are assuming she has my last name not that I have hers. This is a sexist assumption that the company has chosen to allow.

Of course Atlantic isn’t the only company that does this: Verizon calls me Elizabeth in email a couple times a week because she must be primary on that account (one person must lead the family plan), and Nationwide Insurance had to hack their data fields for years so my wife could appear on our car insurance card (as required by law) every time we moved because their web interface no longer allowed the needed changes. The same bad design assumptions can be insulting for other reasons such as ethnic discrimination. My grandmother was mis-addressed by just about everyone until she died because in the 1960s the Social Security Administration could not handle having an ‘ in her name, and no one was willing to fix it in the 50 years that followed SSA’s uninvited edit to her (and many other people’s) name.

In all these cases representatives all say something to the effect of “our computers cannot handle it.” And that of course is simply not true. Your systems may not be setup to handle real people, but that’s because you don’t believe they should be.

Let’s check Atlantic Broadband’s beliefs about their customers based on how they address us (I’m sure there are some additional assumptions not reflected here but these are the ones they managed to encode in one line in this letter):

  • They assume they are addressing one primary account holder: I happen to know from my interactions with them that they list my first name as: “Aaron & Eliza”, and my last name as “Crosman Geor”. Plenty of households have more than one, or even two, adults who expect equal treatment in their home. Our bank and mortgage company know we are both responsible adults why is this so hard for an ISP (or insurance company, or cell provider, credit card, etc)?
  • They assume my first name isn’t very long: They allowed 13 characters, but 4 more is too many. I went to high school with a kid who broke their database by exceeding the 26 character limit it had (they didn’t ask the kid to change his name, the school database admin fixed the database), but Atlantic can barely handle half that.
  • They assume my last name isn’t very long: Only 12 characters were used and they stopped in a strange place. I know many people with last name longer than that: frequently people who have hyphenated last names blow past 12. Also the kid with a 26 character first name – his surname was longer.
  • They assume my middle name isn’t an important part of my name: If they had a middle name field, they could squeeze a few more letters in and make this read more sensibly.  But they only consider first and last names important. Plenty of people have three names – or more – they like to have included on letters.
  • They assume it is okay to mis-address me and my wife: The name listed is just plain wrong, but they believe it’s okay to keep using this greeting. They assume this even after they have been told it’s not, and even after we’ve reduced service with them (if another ISP provided service to my house I’d probably cut it entirely although mostly for other reasons). They believe misaddressed advertisements will convince me I need a landline or cable package again.

Now I’ll be fair for just a minute and note something they got right: they allow & and spaces in a name so Little Bobby Tables might be able to be a customer without causing a crisis (partially because his name is too long for them to fit a valid SQL command into the field).

Frequently you’ll hear customers blame themselves because their names are too long or they have done something outside the “norm”. Let’s be clear: this is the fault of the people who write and buy the software. Software development is entirely too dominated by men, as is the leadership of large companies. When a company lacks diversity in key roles you see that reflected in the systems built to support the work. Atlantic’s leadership’s priorities and views are reflected in how their customers are addressed because they did not demand the developers correct their sexist assumptions.

These problems are too common for us to be able to refuse to do business when it comes up. I will say that when we switched our insurance to State Farm they did not have any trouble understanding that we had different last names and their systems accommodated that by default.

If you do business with a company that makes these (or other similar mistakes) I think it’s totally reasonable to remind them every time you reasonable can that it’s offensive. Explain that they company is denying you, your loved ones, and/or your friends a major marker of their identity. Remind them they are not neutral.

If you write data systems for a living: check the assumptions you’re building into your code. Don’t blame the technology because you used the wrong character set or trimmed the field too short: disk is cheap, UTF-8 has been standard for 15+ years, and processors are fast. If the database or report layout doesn’t work because someone’s name is too long the flaw is not the name.

We all make mistakes and bad assumptions sometimes, but that does not make it okay to deny people basic respect. When we make a bad assumption, that’s a bug, and good developers are obligated to fix it. Good companies are obligated to prevent it from happening in the first place.

Try doing it backwards

As part of my effort not to repeat mistakes I have tried to build a habit in my professional – and personal – life to look for ways to be better at what I do. I recently rediscovered how much you can learn when you try doing something you know well backwards: I drove on the left side of the road.

This is the Holden Barina we rented while in New Zealand.
This is the Holden Barina we rented while in New Zealand, a brand of car I’d never heard of before this trip. It was a good car for the mountain driving even if the wipers and lights controls were reversed from cars at home.

By driving on the left I discovered how many basic driving habits I have that are built around driving on the right. The clearest being that the whole time I was in New Zealand I never knew if anyone was behind me, and the whole time I couldn’t figure out why. The mirrors on the car worked just fine, but it turned out I wasn’t looking at them. Driving home from the airport after we returned to the US I realized that every few seconds my eyes jump to the upper right of my vision to check the mirror. In New Zealand I spent the whole time glancing at the post between the windshield and the driver’s side window (which had seemed massive to me while I was there) instead of the mirror. It made me conscious of my driving habits in a way I haven’t been in years, and as a consequence, I think it’s made me a better driver. I’m thinking about little details again; I’ve been more aware of where I am on the road and what I’m doing to keep track of the other cars around me.

My wife drove this section so I got to take some pictures. Amazing scenery but she had to adjust quickly.
My wife drove this section while I got to take some pictures. She got to learn to drive on the left on winding mountain roads – we don’t recommend that approach.

A few years ago I was watching videos from the MIT Algorithms course to refresh some of my basics, and because I wanted to know what had been added in the decade since I’d taken that class at Hamilton. During the review of QuickSort the professor mentioned that it wasn’t originally a divide-and-conqueror process, but a loop based approach meant to work on a fixed length array (so you could use a fixed block of memory). And as I recall he suggests that the students should work out the loop based version. So riding on the train home from work I pieced it together, and found that it’s an elegant process. It’s not something I ever expect to have cause to implement, but it did help me improve my thinking about when to use recursive functions vs when to use a loop, and helped me think about when to use recursion, loops, and other tools for processing everything in a list. There was a session by John Kary at DrupalCon this year on rethinking loops that pushed me again to revise some of how I made those decisions. Again his talk took the reverse view of much of my previous thinking and was therefore very much worth my time.

If you’re feeling like you are in a good groove on something, try doing it backwards and see what you discover.

Picking tools you’ll love: don’t make yourself hate it on day one.

Every few years organizations replace a major system or two: the web site, CMS, CRM, financial databases, grant software, HR system, etc. And too often organizations try to make the new tool behave just like the old tool, and as a result hate the new tool until they realize that they misconfigured it and then spend 5-10 years dealing with problems that could have been avoided. If you’re going to spend a lot of money overhauling a mission critical tool you should love it from day one.

No one can promise you success, but I promise if you take a brand new tool and try to force it to be just like the tool you are replacing you are going to be disappointed (at best).  Salesforce is not CiviCRM, Drupal is not WordPress, Salsa is not Blackbaud. Remember you are replacing the tool for a reason, if everything about your current tool was perfect you wouldn’t be replacing it in the first place. So here are my steps for improving your chances of success:

  1. List the main functions the tool needs to accomplish: This is the most obvious thing to do, but make sure your list only covers the things you need to do, not the ways you currently do it. Try to keep yourself at a relatively high level to avoid describing what you have now as the required system.
  2. List the pros and cons of what you have: Every tool I’ve ever used had pluses and minuses. And most major internal systems have stakeholders who love and hate it – sometimes that’s the same person – make sure you capture both the good and bad to help you with your selection later.
    Develop a list of tools that are well known in the field: Not just tools you know at the start of the project. Make sure you hunt for a few that are new to you. You might think you’ve heard of them all cause you walked around the vendor hall at NTC last year, but I promise you there are more companies that picked a different conference to push their wares, and there are open source tools you might have missed too.
  3. Make sure every tool has a salesperson: Open Source tools can be overlooked because no one sells them to you, and that may mean you miss the perfect tool for your organization. So for open source even the playing field by having a salesperson, or champion, for the tool. This can be an internal person who likes learning new things, or an outside expert (usually paid but sometimes volunteer).
  4. Let the sales teams sell, but don’t trust them: Let sales people run through their presentations, because you will learn something along the way. But at some point you also need to ask them questions that force them off your script. Force a demo of a non-contrived example, or of a feature they don’t show you the first time. Make them improvise and see what happens.
  5. Talk to other users, and make sure you find one who is not happy: Sure your organization is unique but lots of other organizations have similar needs for the basic tools – unless you have a software-based mission you probably do not want an email system that’s totally different from everyone else’s. A good salesperson will have no trouble giving you a list of references of organizations who love the tool, but if you want the complete picture find someone who hates it. They might hate it for totally unfair reasons, but they will shed light on the rough edges you may encounter. Also make sure you ask the people who love it what problems they run into, remember nothing is perfect so everyone should have a complaint of some kind.
  6. Develop a change strategy: In addition to a data migration plan you need to have a plan that covers introducing the new tool to your colleagues, training the users, communicating to leadership the risks and rewards of the new setup, and setting expectations about any disruptions the change over may cause.  I’ve seen an organization spend nearly a half million dollars on customization of a complex toolset only to have the launch fail because they didn’t make sure the staff understood that the new tool would change their day-to-day tasks.
  7. Develop a migration plan: Plan out the migration of all data, features, and functions as soon as you have your new tool selected. This is not the same thing as your change strategy, this is nuts and bolts of how things will work. Do not attempt to do this without an expert. You made yourself an expert in the field, but not of every in-and-out of the new system: hire someone who is.  That could be a setup team from the company that makes it, a 3rd party consultant, or a new internal staff person who has experience with different instances of the tool.
  8. Get staff trained on using the new tool: don’t scrimp on staff training. Make sure they have a chance to learn how to do the things they will actually be doing on a day-to-day basis.  If you can afford to have customized training arranged I highly recommend it, if you cannot have an outside person do it, consider custom building a training for your low-level internal users yourself.
  9. Develop a plan for ongoing improvement: you will not be 100% happy 100% of the time, and over time those problems will get worse as your needs shift. So make sure you are planning to consistently improve your setup. That can take many forms and what makes the most sense will vary from tool to tool and org to org, but it probably will mean a budget so ask for money from the start and build it into your ongoing budget for the project. Plan for constant improvement or you will find a growing list of pain points that push you to redo all this work sooner than expected.You’ll notice I never actually told you to make your choice. Once you’ve completed steps 1-6 you probably will see an obvious choice, of not: guess. You have a list, you listened to 20 boring sales presentations, you’ve read blogs posts, white papers, and ad materials. You now are an expert on the market and the tools. If you can’t make a good pick for your organization, no one else can either so push aside your imposter syndrome and go with your gut. Sure you could be wrong, but do the best you can and move forward. It’s usually better to make a choice than waffle indefinitely.

This Week’s Drupal Fire Drill

This week many in the Drupal community lost a lot of sleep Tuesday night because the security team treated us to a warning about major security updates due out on Wednesday. Fortunately for many it wasn’t a crisis in the end, but it gave us all a chance to practice for the worst. Basically, it was like a fire drill in a elementary school: we got to prepare like there was a disaster, but since they wasn’t one we don’t really know how it would have gone if there was actually a fire. We haven’t had a stop-drop-and-roll type of emergency in a while, so it was a good refresher on how to handle a crisis.

For those who don’t know what I’m talking about here’s a quick review. At Cyberwoven, like many Drupal shops, we follow the Drupal Security twitter feed on one of our Slack channels so we saw this mid-afternoon Tuesday:

Slack posting of tweet from the security team.

I read the PSA with images of Drupalgeddon dancing in my head:

There will be multiple releases of Drupal contributed modules on Wednesday July 13th 2016 16:00 UTC that will fix highly critical remote code execution vulnerabilities (risk scores up to 22/25).These contributed modules are used on between 1,000 and 10,000 sites. The Drupal Security Team urges you to reserve time for module updates at that time because exploits are expected to be developed within hours/days. Release announcements will appear at the standard announcement locations.

Drupal core is not affected. Not all sites will be affected. You should review the published advisories on July 13th 2016 to see if any modules you use are affected.

Oh that bold line up there wasn’t part of the original announcement. On Tuesday we didn’t have a sense of scale, were we talking about modules that everyone uses on almost every site (ctools came up more than once). It’s the one thing I wish the security team had done differently: given us that sense of scale.

I read all security postings, and make sure we take prompt steps to address them for clients as needed, but the potential here was that we’d have to update all 70+ sites in a few hours or less which is very different from your run-of-the-mill security update that often aren’t related to use cases and threat profiles for the majority of sites.

Here’s what we did next:

Tuesday

  1. Took a minute to panic, complain, and joke about pending illnesses. This is actually a useful step because it allowed me to burn off some nervous energy and then to focus on the real work.
  2. Pulled out the list of all active clients with Drupal sites, and doubled checked it for accuracy.
  3. Made sure a developer had a working repo for all sites (70+). Since we had a couple people out of the office, and some projects had been reassigned recently to different developers, this was an important step to make sure no sites fell through the cracks during a rush to update them all.
  4. Made sure we knew which were 6 sites, and which were 7 in case we are able to determine that 6 is also affected. Since I knew the announcement would likely skip D6, we needed to accept that we might have to take those sites offline for a time.
  5. Made sure leadership knew that all developers may be busy start at 16:00 UTC on Wednesday. We didn’t actually cancel anything right away, but I didn’t want anyone surprised if we were all too busy posting and testing updates to worry about things like meetings.
  6. Made sure complex projects were thought about ahead of time: sites with unusual setups or ongoing dev work that make sudden updates complex. For example we have one client that has 16 sites that all have an unusual set up, so we agreed who would handle those and made sure she was prepared.

Wednesday

  1. First thing in the morning I saw the update to the notice that gave us a sense of scale and relaxed a little, but still made sure we were fully prepped.
  2. Noon: the announcement of what modules were affected was released and a couple other developers and I immediately reviewed the releases. We relaxed once we determined none of our clients were using any of the modules listed.
  3. I reviewed code from each of the three modules to see what the change was to look for ways to improve my own code to avoid similar errors.
  4. Looked for ways to improve our response for the next time it’s not a drill.

Things would have been more exciting if we’d had to update our sites. Since we were prepared it was a matter of minutes for us to check that all our sites were secure. Each developer checked all the sites in their sandbox, and since I knew all sites were in someone’s sandbox that gave us 100% coverage without having to do lots of double checks.

I think it is too easy to look past doing the code review of modules we weren’t using but I find this kind of follow up really useful. Looking back on Drupalgeddon it’s amazing how much pain was caused by such a small error (16 characters are all that were needed to fix it). And by seeking to understand what went wrong you can look for places that you make similarly invalid assumptions.

If you read my post on making new mistakes, you also know I believe that looking for improvement is the most important detail (particularly when it turned out to be a drill, not a fire). Here’s my initial list of things to improve:

  • Have a system to automatically check every site for specific modules (this was already under development but will take a little while longer to complete).
  • Make sure at least two developers have a working sandbox for all projects at all times in case something comes out during a vacation.
  • Improve internal messaging about what to expect – template message and process.  I tossed together some disjointed thoughts for account managers. But disjointed developer thinking does not make people feel like you’re on top of things.
  • Have better tracking of outliers:  I completely missed that I had a demo site on Pantheon that did have the Coder module running.  Since it was Pantheon they alerted me to this problem and had taken steps until I could do the update myself. But it would have been bad if that site had been someplace else and/or in production.
  • Make sure everyone one knows when the actual release is coming out, and what the outcome was. Several developers were hoping to get lunch before the updates, but hadn’t done so when the announcement came out (which could have been a problem if we’d ended up busy).  And I spent the rest of the day answering one-off questions from the account team who wanted to know if the announcement had been bad news.

Ideally we’d come up with a method to automate security updates (maybe all updates), but that’s not totally straightforward.  We have to worry about required patches, non-standard setups, automated testing, and other details. There has been discussion on the Pantheon power user’s mailing list, but every shop has a slightly different workflow (like the fact that we don’t use Pantheon much at all) so we’ll need to come up with a system that accommodates our system.

Always Make New Mistakes

The first major online application I wrote was a petition for the American Friends Service Committee (AFSC) in an attempt to build support against the war in Iraq. The Iraq Peace Pledge was a success in that it gave people a place to voice their frustration and helped encourage the anti-war movement. It was a failure in the sense that the guy writing the software (me) had no idea what he was doing, MoveOn completely stole our thunder (gathering 100 times more names than we did), and it didn’t exactly prevent the war in Iraq.

What it did do was teach a small group of us that the online work was important, harder than we thought, and required skills we didn’t yet have.  I could list dozens of mistakes that we made in the course of the project, most of which were totally avoidable if any one of us had known then what we know now, but it was those mistakes that caused me to learn to constantly push to be better at what I do.

The biggest mistake we made was that we allowed ourselves to repeat mistakes. We were sloppy, and allowed the same errors to get posted over and over. My colleague and friend Mark looked at me at one point and said: “Let’s always make new mistakes from here on.”  And we pretty much did — for the next 10 years (although he still makes fun of me for misspelling “signatures” over and over on the peace pledge site).

“Always make new mistakes” became something of a slogan for us and the organization’s Web Team. We knew that we didn’t have the resources to having someone else teach us everything we needed to know, we were leading projects in a medium very few people had mastered, and we were human. So it was going to be impossible to avoid mistakes, but we could make sure that we learned from our mistakes, find ways to avoid them, and then push into fresh territory filled with new mistakes we didn’t know existed.

I still use the slogan for my own work, and encourage it with teams I work on. People laugh the first time they hear it. But they discover I’m very serious when I adjust my work process and products to deal with mistakes I failed to avoid during the previous project. And I expect them to do the same.

We all need to work hard to identify our mistakes and come up with ways to avoid them.  Some mistakes are easy to see and fix: if you have poor spelling, make sure you have someone double checking your writing.  Some mistakes are harder to see: if you use the wrong metrics you may appear to be succeeding while actually failing. And some mistakes are hard to admit: creating a web applications with no experience thinking about security meant I made just about every security blunder possible, and since I knew more about web development than anyone else around me, I resisted attempts by others to point out much I didn’t know.

Think hard about your work, look for mistakes, own them as scars you earned doing something new, and figure out how to make sure you make a different mistake next time.