Pantheon is an excellent hosting service for both Drupal and WordPress sites. But to make their platform work and scale well they have a number of limits built into the platform. These include process time limits and memory limits. While they are large enough for the majority of projects, large projects can have trouble.
For data loading their official answer is typically to copy the database to another server, run your job there, and copy the database back onto their server. That’s fine if you can afford to freeze updates to your production site. And have the time to setup a process to mirror changes into your temporary copy. And can afford some additional project overhead. But sometimes those things are not an option. Or the data load takes too long, or happens too often, for that to be practical on a regular basis.
I recently needed to do a very large import of records into Drupal on a Pantheon hosted site. The minimize user impact I started to play around with solutions that would allow me to ignore those time limits. We were looking at dong about 50 million data writes for the project. When I first estimated the process the running time was over a week.
The Outline
Since Drupal’s batch system was created to solve this exact problem it seemed like a good place to start. For this solution you need a file you can load and parse in segments, like a CSV file, which you can read one line at a time. It does not have to represent the final state of your data. While you can actually load the data raw, you can also load each record into a table or a queue to process later.
One quick note about the code samples, I wrote these based on the service-based approach outlined in my post about batch services and the batch service module I discussed there. It could be adapted to a traditional batch job, but I like the clarity the wrapper provides for this discussion.
The general concept here is that we upload the file and then progressively process it from within a batch job. My code samples below provide two classes to achieve this. The first is a form that provides a managed file field which create a file entity that can be reliably passed to the batch processor. From there the batch service uses a bit of basic PHP file handling to copy data into the database. If you need to do more than load the data into the database directly (say create complex entities or other tasks) you can set up a second phase to run through the values to do that heavier lifting.
Load your file
To get us started the form includes this managed file:
$form['file'] = [
'#type' => 'managed_file',
'#name' => 'data_file',
'#title' => $this->t('Data file'),
'#description' => $this->t('CSV format for this example.'),
'#upload_location' => 'private://example_pantheon_loader_data/',
'#upload_validators' => [
'file_validate_extensions' => ['csv'],
],
];
The managed file form element automagically gives you a file entity. The value in the form state is the id of that entity. This file will be temporary and have no references once the process is complete and so depending on your site setup the file will eventually be purged. Which adds up to mean we can pass all the values straight through to our batch processor:
$batch = $this->dataLoaderBatchService->generateBatchJob($form_state->getValues());
If the data file is small, a few thousand rows at most, you can load it right away. But that runs into both time and memory concerns and the whole point of this is to avoid those. With my approach we can ignore those and we’re only limited by Pantheon’s upload file size. If the file size is too large for that you can upload the file via sftp and read from the file system, so you have options.
As we setup the file for processing in the batch job, we really need the file path not the ID. The main reason to use the managed file is they can reliably get the file path on a Pantheon server without us needing to know where they have things stashed. Since we’re about to use generic PHP functions for file processing we need to know that path reliably:
$fid = array_pop($data['file']);
$fileEntity = File::load($fid);
$ops = [];
if (empty($fileEntity)) {
$this->logger->error('Unable to load file data for processing.');
return [];
}
$filePath = $this->fileSystem->realpath($fileEntity->getFileUri());
$ops = ['processData' => [$filePath]];
Create your batch
Now we have a file know where it is. Since it’s a csv we can load a few rows at time, process them, and then loop back.
Our batch processing function needs to track two things in addition to the file: the header values and the current file position. So in the first pass we initialize the position to zero and then load the first row as the header. For every pass after that we need to find point we left off. For this we use generic PHP files for loading and seeking the current location:
// Old-school file handling.
$path = array_pop($data);
$file = fopen($path, "r");
...
fseek($file, $filePos);
// Each pass we process 100 lines, if you have to do something complex
// you might want to reduce the run.
for ($i = 0; $i < 100; $i++) {
$row = fgetcsv($file);
if (!empty($row)) {
$data = array_combine($header, $row);
$member['timestamp'] = time();
$rowData = [
'col_one' => $data['field_name'],
'data' => serialize($data),
'timestamp' => time(),
];
$row_id = $this->database->insert('example_pantheon_loader_tracker')
->fields($rowData)
->execute();
// If you're setting up for a queue you include something like this.
// $queue = $this->queueFactory->get(‘example_pantheon_loader_remap’);
// $queue->createItem($row_id);
}
else {
break;
}
}
$filePos = (float) ftell($file);
$context['finished'] = $filePos / filesize($path);
The example code just dumps this all into a database table. This can be useful as a raw data loader. If you need to add a large data set to an existing site that’s used for reference data or something similar. It can also be used as the base to create more complex objects. The example code includes comments about generating a queue worker to run on cron or as another batch job. the Queue UI module provides a simple interface to run those on a batch job.
Final Considerations
I’ve run this process for several hours at a stretch. Pantheon does have issues with systems errors if left to run a batch job for extreme runs. I ran into problems on some runs after 6-8 hours of run time. So a prep into the database followed by running on queue restart has been more reliable.
Thanks, buddy. The article was on spot, and helped me a lot.