The Queries Part 3 of 3

This is the third and final post in a series of posts to break down the questions from my Queries on Queries talk. The full talk is available here.

Is your solution reusable?

Migrations feel like one off processes, but teams that migrate once usually migrate again.

Have you ensured that as much of your solution as possible can be reused? Do you have a shared library of migration tools that your whole team can access? When you create new functionality are you thinking about ways to make it usable in your next project?

On any technology project you will generally benefit from designing for re-usability. I mentioned in my comments on the question about repeatability that people get tempted to see migration work as fundamentally one-off, but you need to plan for many runs. That question is focused on repeating the same project, this is about recycling parts of this project in another.

To a consultant, the value of reuse should be obvious: we like to sell projects to new clients based on successful projects for another client. For that I want libraries of tools the developer designed for rapidly assembled to meet a new client’s needs. 

But even when I was the client, I was moving similar data into the same systems over and over. I created API libraries, and rough interfaces, to handle some of that work so I didn’t have to do the same tedious work again and again.  

In both cases those libraries are only useful if whoever needs them knows they exist, has access to them, and can figure out how to leverage them.

Is your migration testable?

All good processes are rigorously tested.

Do you have an automated testing solution that validates your process? Can you tell if the data migrated accurately after each test run? Do your tests cover the positive and negative cases?

Testing migrations is hard. Testing software is hard. The testing tools that developers are most familiar with are unit testing tools, test one very small thing at a time. Multi-system data comparison is not their forté. The tools that do exist for such work tend to be quite expensive and/or so complex the task of creating tests is nearly as hard as the task of creating the migration jobs themselves.

But just because testing is hard does not mean you shouldn’t do what you can do within the budget and time you have. When you cannot use something like MuleSoft’s MUnit you can still create queries that sanity check the migrated and generated data. You select records for spot checking that cover edge cases you are aware of, and some that represent primary use cases. You can look for records that create invalid data states that would violate your new validation rules.

Is your work fixable?

Migrated data often needs to be updated after the jobs have all run.

Do you have a plan to fix your data if errors are found post migration? Does your plan include ensuring you have external Ids, or other connections, to be able to update all records of every type? Have you validated this plan will work in practice?

When you do a data migration, because everything is determinant, you feel like perfection is possible. But when you’re moving millions of records that were entered by humans, extracted by humans, mapped by humans, validated by humans, and represent human behaviors, there is a lot of room for human error.

You can either pretend your process is good enough to squeeze out the error, or build a process that allows you to fix the errors that slip through. Obviously I don’t believe the first is possible, so I encourage the second.

Make sure you can go back and update anything. If you’re migrating into a database that allows for a lot of easy changes – great. If you’re migrating into a financial system – make sure you understand the rules for editing. 

Planning for mistakes you don’t want to have makes it far easier to recover from those mistakes when they appear.

The Queries Part 2 of 3

This is the second in a series of posts to break down the questions from my Queries on Queries talk. The full talk is available here.

Is your work repeatable?

You will need to do this more than once.

Is your process designed so you can run it over and over without error? Can you easily erase test attempts and start over from a clean slate? Do you have the capacity to do all the practice runs you need to complete your project successfully and on schedule?

Because a migration is fundamentally a one-way operation, designed to move data once, it’s tempting to build the whole process as a one-off affair. I’ve seen (even used) migration processes that required hours or days of hand polishing data to get it to load – this is a terribly way to do the job.

A good migration process should be automated. To automate anything you need to test it. If you test something you should expect it to fail many times before it works. And when it fails you need to run it again and again until it works.

By their very nature data migrations create data – in a target system no less – and so you need a way to roll back your changes to migration to a pre-run state for each subsequent test. I like to use a staging database for the main complex parts of my migrations. I created Salesforce2Sql just to make that so easy no one would be tempted to skip that step. When I create processes in an ETL, I like to have jobs start by deleting data from the staging database related to the job, so I can make Idempotent jobs as much as possible. Run, test, adjust, repeat. If you know how many times you ran your migration process, you didn’t run the jobs enough.

Is your work measurable?

To know you moved all the data, you must know how much data is going in and how much should come out.

Can you accurately predict your output data volume based on the input size? Do you have valid estimates of the running time required for each stage based on the data volumes? Are the estimates of expected data set size from a reliable source?

It seems like knowing how to measure your work should be obvious, but in truth most interesting migrations are not a simple record-in, record-out – they involve splitting records, combining tables, filtering data, converting tables to fields, fields to tables, and other similar adjustments. But the only way to know if you got it all to work out right is to work out the math wherever you can.

It’s also important to know how long a process will take. Sometimes a few thousand records here or there doesn’t matter much, but sometimes that is a matter of hours. Particularly when running samples it’s important to know the average running time. I’m working on a project right now where we know that the first 3 million records will load in about 6 hours, the last 45,000 records will take 12 hours. 

In that project we’ve worked out those running times, and we have a good understanding of total records counts. In other projects we thought we knew, only to discover the person giving us the source record counts was talking about the per-year instead of total expected migration size. But with per-record estimates we can adjust expectations quickly when information changes.

Do you scope your data migrations carefully?

Limiting bad data in your system allows for better decisions in the future.

Do you only load data into the new system that you truly need? Can you easily spot the difference between new and old records? Are there data points getting loaded that have no use case or maintenance plan in the target system?

Everyone wants to keep all their data. My entire career I have understood that storage is cheap, and big data is king. AI driven data analytics have been around for a few years, and now we have all the attention on generative AIs, both benefit from large data sets.

These all tools are great, but they aren’t magic.

Big data processing, whether it be AI driven or not, is all about correlations. If you give a correlation engine bad data, it will give you bad results. Garbage in is still garbage out.

You only want to migrate data that’s good.  
You only want to migrate data that’s useful.
You only want to migrate data that you will maintain.

So before you start a migration make sure you know your data will fall into those categories. Organizations can always archive data they don’t migrate.

There are other reasons more data isn’t always better. 

If your system, or data archive, is ever breached that presents a risk to an organization. Privacy laws are steadily tightening, increasing the chances you will have to admit to your audience you were the cause of their information falling into the hands of bad actors. 

Also, old data is often bad data. Colleges often have the email address used by their applicants squirreled away in their alumni systems.  How useful do you think the AOL address I used in 1997 is to Hamilton College today? If they use it, they will fail to reach me. It provides them no value, but does provide them the chance to make mistakes. Same is true of old phone numbers, addresses, and more.

Keep the good stuff, let go of the stuff you don’t need.

The Queries Part 1 of 3

This is the first in a series of posts to break down the questions from my Queries on Queries talk. The full talk is available here.

Are your tools good enough?

Our migrations live and die by our tools.

Are your tools built for the scale of your project? Do they empower you to do your best work or impede rapid progress? Would a new tool serve you better now or in the future?

Having the right tools is critical to any job. In data migration we primarily talk about ETLs (Extract, Transform, and Load): tools like Jitterbit, Informatica, Mulesoft, Talend, etc. We also use additional tools to help support the process: a task tracker like Jira, a Data Modeler like Lucidchart, staging database prep like Salesforce2Sql, and more.

It’s easy to say that it’s a poor carpenter who blames his tools, but anyone who has spent time with actual carpenters knows they care a great deal about what tools they use. They might be able to make due with poor tools, but they will do their best work with the right tools for the job.

Each tool you use needs to meet your team’s needs. It should play to your strengths, supports the kinds of projects do you do, and has an eye to the future. A tool that works great for a team of declarative Salesforce consultants might drive developers crazy. A tool that works great for 10’s of thousands of records might struggle with millions; a tool scaled for 10s of millions of records may be overly complex for a project of 30,000.

Make sure you’re using the tools that let you do your best work, now and in the future.

Do you make the data atomic for processing?

Smaller pieces of data are easier to track, manipulate, and test.

Do you divide the source data into its constituent parts? Can you process individual pieces of data easily and cleanly? Can you stop your process after each stage to validate the results?

It can be tempting to process data as it comes: handling whole rows of data in the form they were provided and treating fields as a single data point. In practice exports may have extra rows or columns to deal with related records. Organizations may have encoded multiple points of data into fields like ticket names including a show name, date, and time into the name field. Fields can also contained semi-structured data, like Joomla’s use of arbitrary JSON blobs.

To process this data it is often easier and clearer to extract it from these structures prior to direct processing. It’s not always needed, and rarely required, but doing this clean up of structure – like creating interstitial database tables or predictable data objects – can greatly ease the rest of your job.

Like many problems in software engineering, it’s easier to do good work when you are operating on atomic pieces. Think about the right ways to pull your data into constituent parts when they aren’t there already.

Can you process samples of your data set?

When you have lots of data you need to test small parts to be sure your process works.

Do you know how to create and run small segments of your total input? Are your segments made up of complete and valid samples? Does your sample include all the errors and edge cases your data set will throw at your process?

If you are working with small data sets your sample can be all the data. But when you have a large data set you need to test your process with samples. When you have a multi-step migration you likely need to test the second phase while the first phase is still under construction – again a sample data set is critical.

Having valid test data, that covers all your edge cases, is critical to making sure you have a working solution.

A few years ago I worked on a project that involved a two stage migration for a membership organization with some 600,000 active contacts. Every one of them needed to be migrated into Salesforce and then into Drupal. To test the Drupal migration we needed samples of all the types of membership statuses we would see, which involved hand creating several hundred records. At the next Salesforce Commons Sprint I raised the idea of needing a better tool for this kind of work, that question eventually helped lead to Paul Prescod‘s creation of Snowfakery. Snowfakery will build you testing data sets of any size and complexity to make sure your processes will succeed.

Queries on Queries: Improve your data migration

Last week I gave my Queries on Queries talk, intended to help you improve your data migration process, as a webinar for Attain Partners.  It’s a revised and improved version from the last time I gave it.

These questions aren’t like the old Joel Test (which is still useful) where the right answer is “yes”. These questions are designed to point you in a direction but allow you to change your answer over time. I generally answer this questions with a paragraph not a word. Use these questions as a challenge to make you and your team better.

Over the next few weeks I’m planning to publish a series that will include each query and why I think it’s useful in helping you think about how to improve your process.