How to Avoid Painful Data Problems by Following these Best Practices
Many companies struggle to get visibility to their data but it doesn't have to be that hard. By following a few best practices, data integration can be easy to set up and manage, making it easy to get insights across all your systems.
Traceability
Most people think that you just get data from an API, stuff it into a database and then you are done. Voilà! Instant insights!
The reality is that data integration and aggregation is a little more complicated than that.
When we import data, we do several things in order to ensure that we can trace back the history of the data over time:
- Job History: We store the history of what data was picked up from where and when. If we need to re-load the data because of an incomplete file, this makes it easy to delete bad data and reload.
- Source Systems Logging: Many times data needs to be aggregated from multiple systems in for useful reporting. When we aggregate data, we also identify which system the data was sourced from, making to easy to see what systems are reporting what numbers.
- File Archiving: Occasionally data gets corrupt and needs to be re-loaded from source systems. We store the original raw data from imported systems as files so we can recover the original data from the original source without problem.
Re-Runnability
We build import processes that can be re-run when necessary (either at the database level or at the day-level.) To do this we follow a few best-practices when importing data:
- Stage Data: We stage the data before importing it into the destination table. This helps us verify the upload of the data was complete before adding it to existing tables in the database.
- Insert vs. Update: Many times (depending on timing) data needs to be updated instead of inserted. We follow best-practices for identifying which records should be updated vs. inserted.
- Raw Data vs. Aggregated Data: We always store a raw data table per api which is the 'source of truth.' From here data is aggregated and correlated into other tables for performance. This makes it easy to re-load aggregated data if failures happen along the way.
Logging
We built standardized logging at the database and application level in order to trace back history of every job. Some of the logs that we create are:
- Application-Level Logging: We store logs at the application level that show the overall health of all integration processes.
- Module-Level Logging: We create a log file per process that stores key updates as the process runs. This gives insights into errors that happen within each individual job.
- Job-Level Logging: We create database log updates showing the run times, file process history and source and destination files making it easy to verify the entire file processed correctly and making it easy to compare job run times and file sizes over time.
Quality Checking
We create checks as part of our processes that ensure the quality of the data across calculated tables including:
- Duplicate Checking: We check for duplicate inserts as part of our denormalization process to ensure that data isn't double-counted.
- Total Checking: We verify that totals of key metrics match in all denormalized tables to ensure that calculations are correct.
Alerting
We create alerts via channels like SMS, Slack and email when errors occur, making sure that problems can be quickly identified and resolved.