A different type of DAG - data pipelines for epidemiology

workflow

data

A tour of data pipeline techniques and tools for use in academia

Published

June 11, 2025

This talk was part of a symposium on data science tools and opportunities for adoption in epidemiology. The full session description is provided below:

Most applied research and education in epidemiology does not yet benefit from modern data science. Fledgling epidemiologists may receive cutting-edge education on the theory of epidemiologic methods, but remain largely untrained in how to collect data effectively, how to apply modern analytical methods to real data sets, how to reproducibly document code and results, and how to effectively work in teams in a digital workplace. Despite their own nagging concerns, they may rely on Dr. Google as their training on algorithms, document study procedures in e-mail chains, store data in spreadsheets, copy-paste analytical code, hard-code observations per person into separate variables, and manually type out estimates into results tables – only to discover that they are requested to do it all over when three study participants turn out to be ineligible for an analysis.

This symposium will illustrate success stories on how to efficiently practice data science in epidemiology and how to teach it along the way. There will be no exhortations how Excel is bad and that good people practice code sharing. Instead, the symposium will discuss cutting-edge approaches and real-life use cases of how modern data science has made research and teaching more efficient. The goal is for attendees to bring home a sparkling, vetted toolkit of new ideas and tools for research and teaching.