Many of the projects we end up working on are replacing existing systems with existing data either wholly or in part. In all of the above projects we end up writing data migration or data conversion code to move the data from legacy systems to the new systems. Many stake holders of the project such as business users, project managers, business analysts really care about the data conversion scripts and the quality of the conversion. Since this conversion is business entity related and matters a lot as future business/functionality depends on the data being logically equivalent to the legacy system.
Over the years we have found many techniques that help in testing the quality of the data conversion. Here are the 8 techniques that encompass our learnings when converting data over from legacy databases.
Start data conversion work earlier
We have found that converting data earlier in the project life cycle is very helpful even when our data model is not stable yet. This is useful in many ways
- It helps by forcing the developers to think and account for data in the legacy database that may not be covered by specs provided by the business analysts
- Business analysts can communicate with the business users using real data from the legacy system that the business is used to seeing and understands the data.
- Business users when using the new system before it goes live can understand how their new system behaves with legacy data and business entities such as Customers, Products etc. Giving the business users familiarity with the system and easing their transition to the new system.
- When its time to go-live we have converted the data so many times that its no longer a surprise, as all the bugs, data weirdness have been found and dealt with.
Automated compare of data
We compare the data from the legacy database with the new application database that is being developed. This comparison can be automated using sql that creates logically equivalent objects from both databases.
- I blogged about using Automated Data Compare of comparing data.
- We could also use DiffKit which is an open source framework that lets you compare databases, excel spreadsheets, flat files or custom formats on the legacy database side with databases, excel spreadsheets, flat files or custom formats with the new databases. Or use frameworks that export the database into xml, yml etc formats such as yaml_db
Dealing with duplicates
In legacy databases, as the system gets used over many years some business entities may get duplicated and during data conversion will get merged/collapsed into a single entity. So while comparing data we should remember to do UNIQUE or DISTINCT on the SQL we use to compare the data. In some cases we may end up normalizing some of the data and end up with multiple rows for something that was represented using one row in the legacy database.
Dealing with magic values
Systems use magic values to represent data or state of certain business process, such as NULL, 0, N/A, Nil and other strings. When converting we cannot convert these values as-is, we need to understand what do they mean, are these values being converted to mean something else on the application front end? are we using the same logic to convert the data?
Dealing with required data not existing
Some data elements maybe required in the new application that do not exist in the legacy database, we cannot just put null values or some made up data but have to carefully consider what kind of data is needed based on the properties of the new application and the domain entity being converted.
Dealing with string length, number precision
Some string data maybe longer in the legacy database than that which can be accepted by the new application database, we should raise exceptions for these conditions and fix the new application appropriately. Similar conditions can also apply for numeric columns such as ids, amounts, quantities etc, where we may loose precision of numbers being saved.
Soft deleted rows and their children
In some systems rows are not physically deleted but are marked as deleted using flags or boolean values also known as Soft Delete. When converting this data to the new system, we should remember to not convert these rows if the new system does not use soft deletes. At the same time, how do we deal with children's of rows that have been soft deleted?
Performance of the data conversion
How much time do we have to convert the data? can the data conversion code restart from the start when some unexpected error is reported and the process dies. What if we don't have time to do a clean cut over from the old system? how do we convert data over a period of time while the old (legacy) and the new systems are both in production? These requirements have to taken into consideration when writing the data conversion.