This is the second in a series of posts to break down the questions from my Queries on Queries talk. The full talk is available here.
Is your work repeatable?
You will need to do this more than once.
Is your process designed so you can run it over and over without error? Can you easily erase test attempts and start over from a clean slate? Do you have the capacity to do all the practice runs you need to complete your project successfully and on schedule?
Because a migration is fundamentally a one-way operation, designed to move data once, it’s tempting to build the whole process as a one-off affair. I’ve seen (even used) migration processes that required hours or days of hand polishing data to get it to load – this is a terribly way to do the job.
A good migration process should be automated. To automate anything you need to test it. If you test something you should expect it to fail many times before it works. And when it fails you need to run it again and again until it works.
By their very nature data migrations create data – in a target system no less – and so you need a way to roll back your changes to migration to a pre-run state for each subsequent test. I like to use a staging database for the main complex parts of my migrations. I created Salesforce2Sql just to make that so easy no one would be tempted to skip that step. When I create processes in an ETL, I like to have jobs start by deleting data from the staging database related to the job, so I can make Idempotent jobs as much as possible. Run, test, adjust, repeat. If you know how many times you ran your migration process, you didn’t run the jobs enough.
Is your work measurable?
To know you moved all the data, you must know how much data is going in and how much should come out.
Can you accurately predict your output data volume based on the input size? Do you have valid estimates of the running time required for each stage based on the data volumes? Are the estimates of expected data set size from a reliable source?
It seems like knowing how to measure your work should be obvious, but in truth most interesting migrations are not a simple record-in, record-out – they involve splitting records, combining tables, filtering data, converting tables to fields, fields to tables, and other similar adjustments. But the only way to know if you got it all to work out right is to work out the math wherever you can.
It’s also important to know how long a process will take. Sometimes a few thousand records here or there doesn’t matter much, but sometimes that is a matter of hours. Particularly when running samples it’s important to know the average running time. I’m working on a project right now where we know that the first 3 million records will load in about 6 hours, the last 45,000 records will take 12 hours.
In that project we’ve worked out those running times, and we have a good understanding of total records counts. In other projects we thought we knew, only to discover the person giving us the source record counts was talking about the per-year instead of total expected migration size. But with per-record estimates we can adjust expectations quickly when information changes.
Do you scope your data migrations carefully?
Limiting bad data in your system allows for better decisions in the future.
Do you only load data into the new system that you truly need? Can you easily spot the difference between new and old records? Are there data points getting loaded that have no use case or maintenance plan in the target system?
Everyone wants to keep all their data. My entire career I have understood that storage is cheap, and big data is king. AI driven data analytics have been around for a few years, and now we have all the attention on generative AIs, both benefit from large data sets.
These all tools are great, but they aren’t magic.
Big data processing, whether it be AI driven or not, is all about correlations. If you give a correlation engine bad data, it will give you bad results. Garbage in is still garbage out.
You only want to migrate data that’s good.
You only want to migrate data that’s useful.
You only want to migrate data that you will maintain.
So before you start a migration make sure you know your data will fall into those categories. Organizations can always archive data they don’t migrate.
There are other reasons more data isn’t always better.
If your system, or data archive, is ever breached that presents a risk to an organization. Privacy laws are steadily tightening, increasing the chances you will have to admit to your audience you were the cause of their information falling into the hands of bad actors.
Also, old data is often bad data. Colleges often have the email address used by their applicants squirreled away in their alumni systems. How useful do you think the AOL address I used in 1997 is to Hamilton College today? If they use it, they will fail to reach me. It provides them no value, but does provide them the chance to make mistakes. Same is true of old phone numbers, addresses, and more.
Keep the good stuff, let go of the stuff you don’t need.