Analyses as packages

data science
good practice
R
reproducibility
new zealand
public sector
File structure and clarity for analyses
Author

Shrividya Ravi

Published

February 9, 2024

R emphasises a functional style of decomposing a big problem into smaller pieces, then solving each piece with a function or combination of functions [1]. If a code project is comprised mainly of functions, they can be compiled into a package for ease of use across multiple projects or shipped to other users. The package structure enables thorough documentation (functions with Roxygen and usage with vignettes) and includes tests that can be run every time a change is pushed. These virtues make packages the right project structure for the three common data science outputs: analyses, tools and applications. In other words, everything is a package [2].

From a more pragmatic perspective the R package structure is consistent enough to be a cookie cutter structure for data analyses [3]. Project structures anchor development and offer a consistent interface for other users. Python has several options for cookie cutter structure due to the wide range of coding styles while the structure of an R package is powerful enough for common outputs like a web app, report or just a basic analysis.

Snapshot of files for an EDA project structured as an R packages. [3]

Denis Gontchorov provides an excellent rundown of how the package structure can be used for the most common task faced by data professionals: exploratory data analysis (EDA) [3]. Let’s summarise the key folders and how they facilitate an EDA:

More information on how to work with R packages can be found in the online book on R packages [4]. I recommend the chapter, “The Whole Game”, for a step-by-step tutorial for creating a basic package.

The R package structures code in order to build analyses as reproducible analytical pipelines. In all my years of working as a data scientist, simple approaches like this have yielded the greatest dividends in productivity and innovation. Reducing the overhead of thinking of project structure leaves headspace for innovative analyses and shipping better quality code.

Credit

Photo by Mediamodifier on Unsplash

References

[1]
Hadley Wickham, “Functional programming · Advanced R.” https://adv-r.hadley.nz/fp.html (accessed Apr. 14, 2022).
[2]
An overview of testing in R, (Sep. 10, 2020). Accessed: Jan. 15, 2024. [Online Video]. Available: https://www.youtube.com/watch?v=SBh-Z1tZtCk
[3]
D. Gontcharov, “Put your Data Analysis in an R PackageEven if You Don’t Publish it,” Feb. 29, 2020. https://towardsdatascience.com/put-your-data-analysis-in-an-r-package-even-if-you-dont-publish-it-64f2bb8fd791 (accessed Aug. 10, 2022).
[4]
H. Wickham, “R Packages (2e),” 2022. https://r-pkgs.org/ (accessed Aug. 10, 2022).