From Legacy Code to Pipeline

data science
reproducibility
public sector
Converting manual legacy processes to reproducible analytical pipelines.

Legacy code is a burden for any developer. Depending on the state of the code, maintenance and improvements are not necessarily simple. From my experience in the public sector, legacy code for analyses and ETL (Extract, Transform, Load) are neither written by developers who were conversant in modern software development practices (like version control, automation, unit testing etc.) nor is the code particularly well-documented. As a result, maintenance is time consuming, manual and unwieldy. As part of my recent role in the government, inherited code bases became a nightmare to maintain and use until I discovered the delightful trifecta of jupyter, saspy and exchangelib.

I’ve given a couple of talks on this topic: a long version with only slides here and a short version with slides and a 5 minute video here. I’ve also written an introductory blog post here.