<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Emily Riederer</title>
<link>https://emilyriederer.com/</link>
<atom:link href="https://emilyriederer.com/index.xml" rel="self" type="application/rss+xml"/>
<description></description>
<generator>quarto-1.8.26</generator>
<lastBuildDate>Mon, 26 Jan 2026 06:00:00 GMT</lastBuildDate>
<item>
  <title>The Test Set Pod - Column selectors, data quality, and learning in public (Season 1, Episode 14)</title>
  <link>https://emilyriederer.com/talk/test-set-pod/</link>
  <description><![CDATA[ 





<section id="quick-links" class="level2">
<h2 class="anchored" data-anchor-id="quick-links">Quick Links</h2>
<ul>
<li><a href="https://posit.co/thetestset/episode/emily-riederer-column-selectors-data-quality-and-learning-in-public/">Test Set Episode Website</a><br>
</li>
<li><a href="https://podcasts.apple.com/us/podcast/emily-riederer-column-selectors-data-quality-and/id1823736938?i=1000746761256">Apple Podcasts</a> | <a href="https://open.spotify.com/episode/0JslJ1bpUKNd6L469ObZBl">Spotify</a> | <a href="https://www.youtube.com/watch?v=Yjmu18r_j64">YouTube</a></li>
</ul>
</section>
<section id="episode-notes" class="level2">
<h2 class="anchored" data-anchor-id="episode-notes">Episode Notes</h2>
<p>Emily’s had a wild ride through modeling, data engineering, machine learning, and back again, and she knows a thing or three about the evolution of SQL tooling (from nightmare multi-page scripts to the dbt renaissance). She reveals how building internal packages became her gateway to making work enjoyable. Plus: the surprising Stata origins of column selectors, the eternal struggle of naming packages across R and Python, and why watching people code teaches you more than any tutorial ever could. The conversation gets real about imposter syndrome and the magic of tacit knowledge.</p>


</section>

 ]]></description>
  <category>python</category>
  <category>rstats</category>
  <category>sql</category>
  <category>data</category>
  <guid>https://emilyriederer.com/talk/test-set-pod/</guid>
  <pubDate>Mon, 26 Jan 2026 06:00:00 GMT</pubDate>
  <media:content url="https://emilyriederer.com/talk/test-set-pod/featured.png" medium="image" type="image/png" height="144" width="144"/>
</item>
<item>
  <title>R + Python: From polyglot to pluralism</title>
  <dc:creator>Emily Riederer</dc:creator>
  <link>https://emilyriederer.com/post/py-rgo-crosspol/</link>
  <description><![CDATA[ 





<p>In October, the Python Software Foundation <a href="https://pyfound.blogspot.com/2025/10/NSF-funding-statement.html">announced</a> that it had made the difficult decision to forgo a $1.5M National Science Foundation grant. The grant was intended to improve structural vulnerabilities in the Python language and PyPI, but came with the unpalatable stipulation that no part of the PSF as a whole could “operate any programs that advance or promote DEI [diversity, equity, and inclusion]”. This followed a similar <a href="https://carpentries.org/blog/2025/06/announcing-withdrawal-of-nsf-pose-proposal/">announcement</a> over the summer from The Carpentries, a non-profit devoted to teaching researchers good practices in reproducible software engineering and data management, made a similar decision earlier this year.</p>
<p>But what even is open source without diversity? What would it even mean to “promise” to strip diversity from an ecosystem predicated on identifying shared needs and collaboratively developing solutions across different timezones, cultures, and contexts?</p>
<p>Looking back on 2025, nothing illustrated the importance of diversity across open source communities better than <a href="https://posit.co/conference/">posit::conf(2025)</a>. My most memorable session<sup>1</sup> was titled “Sparking Developer Joy”.</p>
<p>Taken at face value, the session offered fantastic and tangible idea for improving the developer/user experience in both R and python. I’d recommend it on those merits alone. But on subtler and unspoken level, the combination of talks served as a celebration of how much stronger both the R and python ecosystems have become as developers who “grew up” in their different contexts and communities cross-polinitate ideas.</p>
<p>I end the year with a brief reflection on the session and what they represent about the broader state of python and R tooling (with a few digressions on core R history). I won’t attempt to recap these great talks in full; I highly encourage everyone to watch them in full on YouTube for a clarity of thought and level of detail to which I won’t aspire. Instead, I’ll just briefly pull on the throughlines: <strong>intellectual diversity makes communities and tools better</strong>.</p>
<p><em>But don’t let me distract you. As the clock winds down on 2025, if you are considering making charitable donations, consider helping the PSF or The Carpentries fill the gap that their values have cost them.</em></p>
<section id="r-package-empathy-in-python" class="level2">
<h2 class="anchored" data-anchor-id="r-package-empathy-in-python">R package empathy in python</h2>
<p>R has always known its target audience. Built by statisticians, it just gets us data people.</p>
<p>Rich Iannone and Michael Chow gave stunning back-to-back talks on how to build python packages with better user experiences. Rich focused broadly on how to make a package “nice” while Michael detailed a specific, novel approach to address a common gap in documentation. Both pointed to examples from Great Tables (<a href="https://gt.rstudio.com/">R</a>/<a href="https://posit-dev.github.io/great-tables/articles/intro.html">python</a>) and pointblank (<a href="https://rstudio.github.io/pointblank/">R</a>/<a href="https://posit-dev.github.io/pointblank/">python</a>), two tools that they developed first for R and then for python.</p>
<p>Each talk illustrated how the strong community norms from R package development have already and can continue to improve the python analogs.</p>
<p><img src="https://emilyriederer.com/post/py-rgo-crosspol/rich.png" class="img-fluid"></p>
<p>Rich’s talk <a href="https://www.youtube.com/watch?v=J6e2BKjHyPg">Making things nice in python</a> defined five aspects of making python packages “nice”: making them compatible with existing workflows:</p>
<ul>
<li>Accommodating different needs / backends (e.g.&nbsp;DataFrame interoperability with <code>narwhals</code>)</li>
<li>Making cumbersome things easy by adopting language-standard syntactic sugar (e.g.&nbsp;<code>polars</code> column selectors)</li>
<li>Making learning easy (e.g.&nbsp;with user guides, examples, blogs, in addition to a standard API reference)</li>
<li>Shipping batteries-included for onboarding with built-in example datasets</li>
</ul>
<p>Of these, I claim 2-4 as having distinctly R-like roots. Interoperability, syntactic sugar, and making the cumbersome easy echoes the tidyverse design principles, and shipping sample data in packages is a “first class” activity in R package development with the standard <code>data/</code> directory for this specific purpose.</p>
<p>In fact, beyond packages, the core R language has shipped with sample data since… before it existed! In-built datasets trace back to R’s precursor S. The <code>iris</code> package help page cites to the 1988 book “The New S Language” as including this dataset, although it then appeared as a three-dimensional array now found in the built-in <code>iris3</code> object.</p>
<p><img src="https://emilyriederer.com/post/py-rgo-crosspol/michael.png" class="img-fluid"></p>
<p>Michael’s talk <a href="https://www.youtube.com/watch?v=ML8z8xkqIA0">The curse of documentation</a> doubled down on the importance of “medium weight” documentation to fill the chasm between an atomic API reference and an all-in gallery of fully baked examples. Specifically, he argued for the importance of long-form documentation that helps users understand the mental model and core philosophy behind a package API so they can reason about it. He then went on to provide some brilliant frameworks for writing such resources, but I’ll leave you to watch the video since that’s not the point of this post.</p>
<p>Here, again, I cannot help but feel the R-native influence on these suggestions. Vignettes have deep roots in the tradition of R packages. <a href="https://www.r-project.org/doc/obit/fritz.html">Fritz Leisch</a>, an early pioneer on the R Core Development Team, first unlocked the ability to render R code in LaTeX through Sweave (a precursor to R Markddown and then Quarto) which first enabled vignettes. The concept of vignettes was formalized with the release of <a href="https://stat.ethz.ch/pipermail/r-announce/2011/000544.html">R 2.14.0 (“The Great Pumpkin”) in 2011</a> which introduced the <code>vignettes/</code> directory distinct from the <code>inst/doc/</code> directory which stores reference-level documentation (the exact contrast Michael draws in his talk!) The convention became so prevalent that each Bioconductor package <a href="https://www.bioconductor.org/help/package-vignettes/">requires at least one vignette</a>. More recently, vignettes gained increasing visibility as R Markdown removed the LaTex dependency, making vignettes easy (and dare I say joyful?) to write, and more recently <code>pkgdown</code> helped create aesthetic documentation websites featuring vignettes framed as <a href="https://pkgdown.r-lib.org/reference/build_articles.html">articles</a>.</p>
</section>
<section id="python-rigor-devtools-for-r" class="level2">
<h2 class="anchored" data-anchor-id="python-rigor-devtools-for-r">Python-rigor devtools for R</h2>
<p>What made this session even more powerful, however, was that reminder that ideas do not flow in just one direction. We also saw <a href="https://www.youtube.com/watch?v=DJVSEOjvwb8">Davis Vaughn and Lionel Henry present on Air</a>, an R language server and code formatter which promises blazing speed and a stunning hex sticker.</p>
<p><img src="https://emilyriederer.com/post/py-rgo-crosspol/air.png" class="img-fluid"></p>
<p>While the Air <a href="https://github.com/posit-dev/air">README</a> acknowledges multiple non-R inspirations , the most obvious parallel is to Astral’s <code>ruff</code>, a performant python code styler also written in Rust.</p>
<p>Air isn’t the only present from the Posit to R developers. A similar and exciting project from Posit with a clear python analog is <a href="https://github.com/r-lib/rig">rig</a>. Rig helps developers manage and switch between multiple R installations with a look, feel, and aspirations similar to python’s <code>pyenv</code>. While installation woes and simultaneous installation management tend to be more critical to the python language, bringing this modern developer tooling to the R world is a useful contribution to developing more robust tooling.</p>
<p>Beyond these more concrete examples, python has long been a first-moved in bringing engineering practices into engineering data work due, in turn, to the diversity of its own community mixing DS and traditional software engineering audiences. Countless testing<sup>2</sup>, orchestration, CI/CD tools in R have followed python corrolaries, likely due to some amount of influence.<sup>3</sup></p>
</section>
<section id="it-goes-both-ways" class="level2">
<h2 class="anchored" data-anchor-id="it-goes-both-ways">It goes both ways</h2>
<p>This also is not to say that R is <em>uniquely and exclusively</em> good at design and python is <em>uniquely and exclusively</em> good at engineering. It goes both ways. For example, python’s <code>scikit-learn</code> API has an undeniable impact on <code>tidymodels</code>, as the R community sought an analogous all-encompassing tool for modeling with common workflows for training pipelines; conversely, while CRAN has its ups and downs, Julie Tibshirani recently published a <a href="https://jtibs.substack.com/p/if-all-the-world-were-a-monorepo">reflection</a> on the value of the paradigm, perhaps with takeaways for the wildwest of PyPI.</p>
<p>The examples are endless, and enumeration is not the point. I’ll just end the year saying that I’m thinking about the room I was in in Atlanta back in August where people with wildly different backgrounds came together to learn and incidentally happened on the perfect example of how different contexts colliding makes them better. Should we be so lucky.</p>


</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>And memory is no small feat given that I’d made the questionable decision to get up at 2AM to fly in that same day and had been awake about 14 hours by the time of the session↩︎</p></li>
<li id="fn2"><p>Testing was also the topic of the final talk in this session by Libby McKenna on <a href="https://www.youtube.com/watch?v=GEP61GjwTjE"><code>testthat</code></a>. I focus on it less here as there’s less of a clear R/python linkage, but it provides a very nice introduction to code testing. Check it out!↩︎</p></li>
<li id="fn3"><p>A related discussion happened on a recent episode of <a href="https://posit.co/thetestset/episode/kelly-bodwin-quarto-hacks-ai-in-the-classroom-and-why-r-should-stay-weird/">The Test Set</a> podcast as the crew debated whether R or python is more diverse with python spanning more distinct domains (e.g.&nbsp;DS vs SWE) but R having much more breadth <em>among data people</em> (e.g.&nbsp;biostats, polisci, etc.)↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>rstats</category>
  <category>python</category>
  <guid>https://emilyriederer.com/post/py-rgo-crosspol/</guid>
  <pubDate>Tue, 30 Dec 2025 06:00:00 GMT</pubDate>
  <media:content url="https://emilyriederer.com/post/py-rgo-crosspol/air.png" medium="image" type="image/png" height="81" width="144"/>
</item>
<item>
  <title>Python Rgonomics: User-defined functions in polars</title>
  <dc:creator>Emily Riederer</dc:creator>
  <link>https://emilyriederer.com/post/py-rgo-udf/</link>
  <description><![CDATA[ 





<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://emilyriederer.com/post/py-rgo-udf/featured.jpg" class="img-fluid figure-img"></p>
<figcaption>Photo credit to <a href="https://unsplash.com/@hansjurgen007">Hans-Jurgen Mager</a> on Unsplash</figcaption>
</figure>
</div>
<p><code>polars</code> API is a delight in part because of its consistency. Transformations are chained sequentially onto the DataFrame in a consistent series of steps without leaving the DataFrame. This helps developers get “in the flow”, produces highly readable and well-structured code, and cna make a very natural transition for users coming from R’s <code>tidyverse</code> who tend to think about data tranformations in a series of “pipes”.</p>
<p>While the API has excellent coverage over a wide range of standard transformations that data practitioners need<sup>1</sup>, users may often find the need to incorporate user-defined functions (UDFs) into their logic to leverage other python libraries for domain-specific problems. This can arise frequently in data science where models more germane to modeling and statistical testing are, by design, not built separately into polars.</p>
<p>This creates multiple points of potential friction:</p>
<ul>
<li><code>polars</code> is highly readable due to the API’s consistent use of method chaining; applying UDFs shouldn’t break the flow</li>
<li>users may wish to return more complex “non-scalar” datatypes (e.g.&nbsp;multidimensional arrays, model objects) into the DataFrame<sup>2</sup></li>
</ul>
<p>This post is a quick reference to demonstrate <code>polars</code>’ numerous capabilities for integrating different types of external (from other packages) or custom (user modularized) logic without breaking the flow of <code>polars</code> transformations. Using examples from data simulation, model evlauation, and inference, we will explore methods for applying UDFs for transformation and aggregation, transforming complex objects within a <code>polars</code> pipe, and easy “escape hatches” to break the abstraction when necessary.</p>
<section id="tldr" class="level2">
<h2 class="anchored" data-anchor-id="tldr">TLDR</h2>
<p>Whenever possible, it is most efficient to express your custom user-defined function (UDF) in the native <code>polars</code> API. When the API affords the logic you need to do this, you can modularize that <code>polars</code> code into a function that takes an expression or a DataFrame as it’s first argument and add it to your <code>polars</code> code with:</p>
<ul>
<li><code>pipe()</code> – allows piping of expressions and DataFrames into UDFs</li>
<li><code>map_columns()</code> – custom pipe function capable of handling contexts like selectors</li>
</ul>
<p>For arbitrary python logic to transform expressions (i.e.&nbsp;at the column-level), you can use <code>map_{batches|elemens()}</code> within <code>with_columns()</code>:</p>
<ul>
<li><code>map_batches()</code> – for applying non-polars vectorized functions (preferred)</li>
<li><code>map_elements()</code> – for applying nonvectorized functions (less efficient)</li>
</ul>
<p>Similarly, for arbitrary expression aggregation, <code>map_groups()</code> can be used inside of <code>agg()</code>:</p>
<ul>
<li><code>map_groups()</code> – to keep everything in the DataFrame</li>
</ul>
<p>However, there are numerour hacks and special cases to make your code either more efficient or more readable:</p>
<ul>
<li><code>polars</code> extensions may provide a more native Rust implementation of the logic</li>
<li>creating a generation function can mimic <code>polars</code>’s <a href="https://docs.pola.rs/user-guide/expressions/expression-expansion/">expression expansion</a>, allowing you to apply the same transformation to many columns at once</li>
<li>the ability to <code>map_*()</code> objects with return type <code>pl.Object</code> means you can fit any number of complex objects (e.g.&nbsp;models) into a <code>polars</code> pipeline that you wish to keep wrangling</li>
<li><code>partition_by()</code> provides an easy off-ramp for breaking out of the DataFrame abstraction for further processing with comfortable python-native patterns like list comprehensions</li>
</ul>
</section>
<section id="set-up" class="level2">
<h2 class="anchored" data-anchor-id="set-up">Set Up</h2>
<p>We’ll load a few packages to begin:</p>
<div id="ff01e89f" class="cell" data-execution_count="1">
<div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> polars <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pl</span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> polars.selectors <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> cs</span>
<span id="cb1-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> polars_ds <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pds</span>
<span id="cb1-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb1-5"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> numpy.random <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> binomial</span>
<span id="cb1-6"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.metrics <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> roc_auc_score</span>
<span id="cb1-7"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> statsmodels.api <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> sm</span></code></pre></div>
</div>
</section>
<section id="applying-polars-udfs" class="level2">
<h2 class="anchored" data-anchor-id="applying-polars-udfs">Applying <code>polars</code> UDFs</h2>
<p>Now, imagine you simply want to be able to apply and reuse a user-defined function (UDF) writeen with native <code>polars</code> logic. This is easily done with the <code>pipe()</code> method which can be chained onto either expressions (logic that computes variables in the DataFrame) or full DataFrames. Writing additional transformation logic with the native python API is preferable wherever it is possible since it allows <code>polars</code> to use the same data representations and optimizations.</p>
<p>We’ll start with a boring toy dataset.</p>
<div id="69144138" class="cell" data-execution_count="2">
<div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1">data_dict <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {</span>
<span id="cb2-2">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'group'</span>: [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'a'</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'b'</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>,</span>
<span id="cb2-3">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'x'</span>: np.arange(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">9</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>),</span>
<span id="cb2-4">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'y'</span>: np.arange(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>), </span>
<span id="cb2-5">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'p'</span>: np.arange(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">9</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span></span>
<span id="cb2-6">}</span>
<span id="cb2-7">df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.DataFrame(data_dict)</span>
<span id="cb2-8">df.glimpse()</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>Rows: 8
Columns: 4
$ group &lt;str&gt; 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'
$ x     &lt;i64&gt; 1, 2, 3, 4, 5, 6, 7, 8
$ y     &lt;i64&gt; 8, 7, 6, 5, 4, 3, 2, 1
$ p     &lt;f64&gt; 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8
</code></pre>
</div>
</div>
<section id="columns-pipe-and-map_columns" class="level3">
<h3 class="anchored" data-anchor-id="columns-pipe-and-map_columns">Columns (<code>pipe</code> and <code>map_columns()</code>)</h3>
<div id="7abec6c1" class="cell" data-execution_count="3">
<div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> cap(c:pl.Expr, ceil:<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> pl.Expr: <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> pl.when( c <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> ceil).then(ceil).otherwise( c )</span>
<span id="cb4-2"></span>
<span id="cb4-3">df.with_columns( pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'x'</span>).pipe(cap))</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="3">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (8, 5)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">group</th>
<th data-quarto-table-cell-role="th">x</th>
<th data-quarto-table-cell-role="th">y</th>
<th data-quarto-table-cell-role="th">p</th>
<th data-quarto-table-cell-role="th">literal</th>
</tr>
<tr class="odd">
<th>str</th>
<th>i64</th>
<th>i64</th>
<th>f64</th>
<th>i64</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>"a"</td>
<td>1</td>
<td>8</td>
<td>0.1</td>
<td>1</td>
</tr>
<tr class="even">
<td>"a"</td>
<td>2</td>
<td>7</td>
<td>0.2</td>
<td>2</td>
</tr>
<tr class="odd">
<td>"a"</td>
<td>3</td>
<td>6</td>
<td>0.3</td>
<td>3</td>
</tr>
<tr class="even">
<td>"a"</td>
<td>4</td>
<td>5</td>
<td>0.4</td>
<td>4</td>
</tr>
<tr class="odd">
<td>"b"</td>
<td>5</td>
<td>4</td>
<td>0.5</td>
<td>5</td>
</tr>
<tr class="even">
<td>"b"</td>
<td>6</td>
<td>3</td>
<td>0.6</td>
<td>5</td>
</tr>
<tr class="odd">
<td>"b"</td>
<td>7</td>
<td>2</td>
<td>0.7</td>
<td>5</td>
</tr>
<tr class="even">
<td>"b"</td>
<td>8</td>
<td>1</td>
<td>0.8</td>
<td>5</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<p>This is also works when applying a transformation to mutliple columns with selectors. However, the pipe can result in a conflict in which all variables have the same name (unlike native chaining). This is fixed by appending <code>.name.keep()</code> which access and reapplies the name of the initial column being mapped.</p>
<div id="47ccab20" class="cell" data-execution_count="4">
<div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1">df.with_columns( cs.numeric().pipe(cap).name.keep() )</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="4">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (8, 4)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">group</th>
<th data-quarto-table-cell-role="th">x</th>
<th data-quarto-table-cell-role="th">y</th>
<th data-quarto-table-cell-role="th">p</th>
</tr>
<tr class="odd">
<th>str</th>
<th>i64</th>
<th>i64</th>
<th>f64</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>"a"</td>
<td>1</td>
<td>5</td>
<td>0.1</td>
</tr>
<tr class="even">
<td>"a"</td>
<td>2</td>
<td>5</td>
<td>0.2</td>
</tr>
<tr class="odd">
<td>"a"</td>
<td>3</td>
<td>5</td>
<td>0.3</td>
</tr>
<tr class="even">
<td>"a"</td>
<td>4</td>
<td>5</td>
<td>0.4</td>
</tr>
<tr class="odd">
<td>"b"</td>
<td>5</td>
<td>4</td>
<td>0.5</td>
</tr>
<tr class="even">
<td>"b"</td>
<td>5</td>
<td>3</td>
<td>0.6</td>
</tr>
<tr class="odd">
<td>"b"</td>
<td>5</td>
<td>2</td>
<td>0.7</td>
</tr>
<tr class="even">
<td>"b"</td>
<td>5</td>
<td>1</td>
<td>0.8</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<p>If you have no other transformations you wish to do simultaneously, <code>map_columns()</code> is a slightly more concise alternative which accepts a column selector and a single transformation to be applied to all passed columns.</p>
<div id="06634957" class="cell" data-execution_count="5">
<div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1">df.map_columns( cs.numeric(), cap)</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="5">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (8, 4)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">group</th>
<th data-quarto-table-cell-role="th">x</th>
<th data-quarto-table-cell-role="th">y</th>
<th data-quarto-table-cell-role="th">p</th>
</tr>
<tr class="odd">
<th>str</th>
<th>i64</th>
<th>i64</th>
<th>f64</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>"a"</td>
<td>1</td>
<td>5</td>
<td>0.1</td>
</tr>
<tr class="even">
<td>"a"</td>
<td>2</td>
<td>5</td>
<td>0.2</td>
</tr>
<tr class="odd">
<td>"a"</td>
<td>3</td>
<td>5</td>
<td>0.3</td>
</tr>
<tr class="even">
<td>"a"</td>
<td>4</td>
<td>5</td>
<td>0.4</td>
</tr>
<tr class="odd">
<td>"b"</td>
<td>5</td>
<td>4</td>
<td>0.5</td>
</tr>
<tr class="even">
<td>"b"</td>
<td>5</td>
<td>3</td>
<td>0.6</td>
</tr>
<tr class="odd">
<td>"b"</td>
<td>5</td>
<td>2</td>
<td>0.7</td>
</tr>
<tr class="even">
<td>"b"</td>
<td>5</td>
<td>1</td>
<td>0.8</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</section>
<section id="data-frames-pipe" class="level3">
<h3 class="anchored" data-anchor-id="data-frames-pipe">Data Frames (<code>pipe</code>)</h3>
<p>Alternatively, you may wish to encapsulate logic operating at the level of the entire DataFrame versus an individual column.</p>
<div id="d5932d4e" class="cell" data-execution_count="6">
<div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> calc_diffs(df:pl.DataFrame, threshhold:<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> pl.DataFrame:</span>
<span id="cb7-2"></span>
<span id="cb7-3">    df_out <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (</span>
<span id="cb7-4">        df</span>
<span id="cb7-5">        .with_columns(</span>
<span id="cb7-6">            <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">abs</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'x'</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'y'</span>)).<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">abs</span>(),</span>
<span id="cb7-7">            abs_gt_t <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'x'</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'y'</span>)).<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">abs</span>() <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> threshhold,</span>
<span id="cb7-8">        )</span>
<span id="cb7-9">    )</span>
<span id="cb7-10">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> df_out </span></code></pre></div>
</div>
<p>This too can be chained onto a DataFrame using the <code>pipe()</code>:</p>
<div id="a20334f3" class="cell" data-execution_count="7">
<div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1">df.pipe(calc_diffs)</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="7">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (8, 6)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">group</th>
<th data-quarto-table-cell-role="th">x</th>
<th data-quarto-table-cell-role="th">y</th>
<th data-quarto-table-cell-role="th">p</th>
<th data-quarto-table-cell-role="th">abs</th>
<th data-quarto-table-cell-role="th">abs_gt_t</th>
</tr>
<tr class="odd">
<th>str</th>
<th>i64</th>
<th>i64</th>
<th>f64</th>
<th>i64</th>
<th>bool</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>"a"</td>
<td>1</td>
<td>8</td>
<td>0.1</td>
<td>7</td>
<td>true</td>
</tr>
<tr class="even">
<td>"a"</td>
<td>2</td>
<td>7</td>
<td>0.2</td>
<td>5</td>
<td>false</td>
</tr>
<tr class="odd">
<td>"a"</td>
<td>3</td>
<td>6</td>
<td>0.3</td>
<td>3</td>
<td>false</td>
</tr>
<tr class="even">
<td>"a"</td>
<td>4</td>
<td>5</td>
<td>0.4</td>
<td>1</td>
<td>false</td>
</tr>
<tr class="odd">
<td>"b"</td>
<td>5</td>
<td>4</td>
<td>0.5</td>
<td>1</td>
<td>false</td>
</tr>
<tr class="even">
<td>"b"</td>
<td>6</td>
<td>3</td>
<td>0.6</td>
<td>3</td>
<td>false</td>
</tr>
<tr class="odd">
<td>"b"</td>
<td>7</td>
<td>2</td>
<td>0.7</td>
<td>5</td>
<td>false</td>
</tr>
<tr class="even">
<td>"b"</td>
<td>8</td>
<td>1</td>
<td>0.8</td>
<td>7</td>
<td>true</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<p>Values of other arguments to your function can be passed with kwargs<sup>3</sup></p>
<div id="69136519" class="cell" data-execution_count="8">
<div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1">df.pipe(calc_diffs, threshhold <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="8">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (8, 6)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">group</th>
<th data-quarto-table-cell-role="th">x</th>
<th data-quarto-table-cell-role="th">y</th>
<th data-quarto-table-cell-role="th">p</th>
<th data-quarto-table-cell-role="th">abs</th>
<th data-quarto-table-cell-role="th">abs_gt_t</th>
</tr>
<tr class="odd">
<th>str</th>
<th>i64</th>
<th>i64</th>
<th>f64</th>
<th>i64</th>
<th>bool</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>"a"</td>
<td>1</td>
<td>8</td>
<td>0.1</td>
<td>7</td>
<td>true</td>
</tr>
<tr class="even">
<td>"a"</td>
<td>2</td>
<td>7</td>
<td>0.2</td>
<td>5</td>
<td>true</td>
</tr>
<tr class="odd">
<td>"a"</td>
<td>3</td>
<td>6</td>
<td>0.3</td>
<td>3</td>
<td>false</td>
</tr>
<tr class="even">
<td>"a"</td>
<td>4</td>
<td>5</td>
<td>0.4</td>
<td>1</td>
<td>false</td>
</tr>
<tr class="odd">
<td>"b"</td>
<td>5</td>
<td>4</td>
<td>0.5</td>
<td>1</td>
<td>false</td>
</tr>
<tr class="even">
<td>"b"</td>
<td>6</td>
<td>3</td>
<td>0.6</td>
<td>3</td>
<td>false</td>
</tr>
<tr class="odd">
<td>"b"</td>
<td>7</td>
<td>2</td>
<td>0.7</td>
<td>5</td>
<td>true</td>
</tr>
<tr class="even">
<td>"b"</td>
<td>8</td>
<td>1</td>
<td>0.8</td>
<td>7</td>
<td>true</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<p>This also allows us to write DataFrame-level functions that operate on different variables by passing the variables as parameters.</p>
<div id="83048164" class="cell" data-execution_count="9">
<div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> calc_diffs(df:pl.DataFrame, var1:<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'x'</span>, var2:<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'y'</span>, threshhold:<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> pl.DataFrame:</span>
<span id="cb10-2"></span>
<span id="cb10-3">    df_out <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (</span>
<span id="cb10-4">        df</span>
<span id="cb10-5">        .with_columns(</span>
<span id="cb10-6">            <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">abs</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (pl.col(var1) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> pl.col(var2)).<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">abs</span>(),</span>
<span id="cb10-7">            abs_gt_t <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (pl.col(var1) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> pl.col(var2)).<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">abs</span>() <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> threshhold,</span>
<span id="cb10-8">        )</span>
<span id="cb10-9">    )</span>
<span id="cb10-10">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> df_out </span>
<span id="cb10-11"></span>
<span id="cb10-12">df.pipe(calc_diffs, var1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'y'</span>, var2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'x'</span>, threshhold <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="9">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (8, 6)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">group</th>
<th data-quarto-table-cell-role="th">x</th>
<th data-quarto-table-cell-role="th">y</th>
<th data-quarto-table-cell-role="th">p</th>
<th data-quarto-table-cell-role="th">abs</th>
<th data-quarto-table-cell-role="th">abs_gt_t</th>
</tr>
<tr class="odd">
<th>str</th>
<th>i64</th>
<th>i64</th>
<th>f64</th>
<th>i64</th>
<th>bool</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>"a"</td>
<td>1</td>
<td>8</td>
<td>0.1</td>
<td>7</td>
<td>true</td>
</tr>
<tr class="even">
<td>"a"</td>
<td>2</td>
<td>7</td>
<td>0.2</td>
<td>5</td>
<td>true</td>
</tr>
<tr class="odd">
<td>"a"</td>
<td>3</td>
<td>6</td>
<td>0.3</td>
<td>3</td>
<td>false</td>
</tr>
<tr class="even">
<td>"a"</td>
<td>4</td>
<td>5</td>
<td>0.4</td>
<td>1</td>
<td>false</td>
</tr>
<tr class="odd">
<td>"b"</td>
<td>5</td>
<td>4</td>
<td>0.5</td>
<td>1</td>
<td>false</td>
</tr>
<tr class="even">
<td>"b"</td>
<td>6</td>
<td>3</td>
<td>0.6</td>
<td>3</td>
<td>false</td>
</tr>
<tr class="odd">
<td>"b"</td>
<td>7</td>
<td>2</td>
<td>0.7</td>
<td>5</td>
<td>true</td>
</tr>
<tr class="even">
<td>"b"</td>
<td>8</td>
<td>1</td>
<td>0.8</td>
<td>7</td>
<td>true</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</section>
</section>
<section id="applying-custom-series-transformations" class="level2">
<h2 class="anchored" data-anchor-id="applying-custom-series-transformations">Applying custom series transformations</h2>
<p>Piping is great, but it can break down when you need to apply column transformations requiring multiple columns as inputs or requiring logic outside of the <code>polars</code> API. That’s where <code>map_batches()</code> and <code>map_elements()</code> become useful.</p>
<p>These methods chain onto expressions just like other transformations. However, they can accept as arguments any arbitrary python function, as well as specifications for the type of return (scalar or vector, data types). The two methods differ in that <code>map_batches()</code> expects the function to be vectorized whereas <code>map_elements()</code> can use any arbitrary function (but assumes it will have to iterate over inputs).</p>
<p>With these methods, we can input one or more expressions from a DataFrame and return either a scalar or a vector output.</p>
<section id="map-batches" class="level3">
<h3 class="anchored" data-anchor-id="map-batches">Map Batches</h3>
<p>Imagine we want to simulate draws from a binomial distribution, based on the sample size <code>x</code> and probability <code>p</code> in the dataset above.</p>
<p>In the simplest case in which our function receives 1 input, we can chain <code>map_batches()</code> onto that expression. Here, we simply provide the function of interest and option fields to confirm that our return value is a scalar (the result of a single coin flip) of type integer:</p>
<div id="ee5cf40b" class="cell" data-execution_count="10">
<div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># one column in, one value out</span></span>
<span id="cb11-2">df.with_columns(</span>
<span id="cb11-3">    coin_flip <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'p'</span>).map_batches(function <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">lambda</span> p: binomial(n <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, p <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> p), returns_scalar <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, return_dtype <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.UInt16)</span>
<span id="cb11-4">)</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="10">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (8, 5)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">group</th>
<th data-quarto-table-cell-role="th">x</th>
<th data-quarto-table-cell-role="th">y</th>
<th data-quarto-table-cell-role="th">p</th>
<th data-quarto-table-cell-role="th">coin_flip</th>
</tr>
<tr class="odd">
<th>str</th>
<th>i64</th>
<th>i64</th>
<th>f64</th>
<th>u16</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>"a"</td>
<td>1</td>
<td>8</td>
<td>0.1</td>
<td>1</td>
</tr>
<tr class="even">
<td>"a"</td>
<td>2</td>
<td>7</td>
<td>0.2</td>
<td>0</td>
</tr>
<tr class="odd">
<td>"a"</td>
<td>3</td>
<td>6</td>
<td>0.3</td>
<td>0</td>
</tr>
<tr class="even">
<td>"a"</td>
<td>4</td>
<td>5</td>
<td>0.4</td>
<td>0</td>
</tr>
<tr class="odd">
<td>"b"</td>
<td>5</td>
<td>4</td>
<td>0.5</td>
<td>1</td>
</tr>
<tr class="even">
<td>"b"</td>
<td>6</td>
<td>3</td>
<td>0.6</td>
<td>0</td>
</tr>
<tr class="odd">
<td>"b"</td>
<td>7</td>
<td>2</td>
<td>0.7</td>
<td>1</td>
</tr>
<tr class="even">
<td>"b"</td>
<td>8</td>
<td>1</td>
<td>0.8</td>
<td>1</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<p>However, if our function requires multiple expressions as inputs, we must either create a <code>struct</code> or internally pass the names of those expressions to <code>exprs</code> (which I find cleaner). The function you are mapping must similar assume it is receiving an input containing those expressions in the same order and, thus, accessing them through indexing.</p>
<div id="7078483f" class="cell" data-execution_count="11">
<div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># two column in, one value out - with structs</span></span>
<span id="cb12-2">df.with_columns(</span>
<span id="cb12-3">    coin_flip <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.struct(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'x'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'p'</span>).map_batches(</span>
<span id="cb12-4">                               function <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">lambda</span> z: binomial(n <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> z.struct[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'x'</span>], p <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> z.struct[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'p'</span>]), </span>
<span id="cb12-5">                               returns_scalar <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, return_dtype <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.UInt16)</span>
<span id="cb12-6">)</span>
<span id="cb12-7"></span>
<span id="cb12-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># two columns in, one value out - with exprs</span></span>
<span id="cb12-9">df.with_columns(</span>
<span id="cb12-10">    coin_flip <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.map_batches(exprs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'x'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'p'</span>],</span>
<span id="cb12-11">                               function <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">lambda</span> z: binomial(n <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> z[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>], p <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> z[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]), </span>
<span id="cb12-12">                               returns_scalar <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, return_dtype <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.UInt16)</span>
<span id="cb12-13">)</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="11">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (8, 5)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">group</th>
<th data-quarto-table-cell-role="th">x</th>
<th data-quarto-table-cell-role="th">y</th>
<th data-quarto-table-cell-role="th">p</th>
<th data-quarto-table-cell-role="th">coin_flip</th>
</tr>
<tr class="odd">
<th>str</th>
<th>i64</th>
<th>i64</th>
<th>f64</th>
<th>u16</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>"a"</td>
<td>1</td>
<td>8</td>
<td>0.1</td>
<td>0</td>
</tr>
<tr class="even">
<td>"a"</td>
<td>2</td>
<td>7</td>
<td>0.2</td>
<td>0</td>
</tr>
<tr class="odd">
<td>"a"</td>
<td>3</td>
<td>6</td>
<td>0.3</td>
<td>2</td>
</tr>
<tr class="even">
<td>"a"</td>
<td>4</td>
<td>5</td>
<td>0.4</td>
<td>2</td>
</tr>
<tr class="odd">
<td>"b"</td>
<td>5</td>
<td>4</td>
<td>0.5</td>
<td>2</td>
</tr>
<tr class="even">
<td>"b"</td>
<td>6</td>
<td>3</td>
<td>0.6</td>
<td>4</td>
</tr>
<tr class="odd">
<td>"b"</td>
<td>7</td>
<td>2</td>
<td>0.7</td>
<td>5</td>
</tr>
<tr class="even">
<td>"b"</td>
<td>8</td>
<td>1</td>
<td>0.8</td>
<td>6</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</section>
<section id="multiple-outputs" class="level3">
<h3 class="anchored" data-anchor-id="multiple-outputs">Multiple Outputs</h3>
<p>Finally, you can also return multiple outputs. Suppose we want to simulate 100 draws not just 1. Our internal function can instead return an array. Afterward, we can calculate the average outcome versus the expected value to see that this worked as intended.</p>
<div id="8fc270c9" class="cell" data-execution_count="12">
<div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># many columns out</span></span>
<span id="cb13-2">df.with_columns(</span>
<span id="cb13-3">    coin_flip <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.struct(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'x'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'p'</span>).map_batches(</span>
<span id="cb13-4">                               function <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">lambda</span> z: binomial(n <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> z.struct[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'x'</span>], </span>
<span id="cb13-5">                                                             p <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> z.struct[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'p'</span>],</span>
<span id="cb13-6">                                                             size <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>,z.shape[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>])</span>
<span id="cb13-7">                                                             ).transpose(), </span>
<span id="cb13-8">                                return_dtype <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.Array(pl.UInt16, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>) </span>
<span id="cb13-9">                                )</span>
<span id="cb13-10">).with_columns( </span>
<span id="cb13-11">    avg_outcome <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'coin_flip'</span>).arr.mean(),</span>
<span id="cb13-12">    exp_value <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'x'</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'p'</span>)</span>
<span id="cb13-13">)</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="12">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (8, 7)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">group</th>
<th data-quarto-table-cell-role="th">x</th>
<th data-quarto-table-cell-role="th">y</th>
<th data-quarto-table-cell-role="th">p</th>
<th data-quarto-table-cell-role="th">coin_flip</th>
<th data-quarto-table-cell-role="th">avg_outcome</th>
<th data-quarto-table-cell-role="th">exp_value</th>
</tr>
<tr class="odd">
<th>str</th>
<th>i64</th>
<th>i64</th>
<th>f64</th>
<th>array[u16, 100]</th>
<th>f64</th>
<th>f64</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>"a"</td>
<td>1</td>
<td>8</td>
<td>0.1</td>
<td>[0, 0, … 0]</td>
<td>0.09</td>
<td>0.1</td>
</tr>
<tr class="even">
<td>"a"</td>
<td>2</td>
<td>7</td>
<td>0.2</td>
<td>[0, 0, … 0]</td>
<td>0.37</td>
<td>0.4</td>
</tr>
<tr class="odd">
<td>"a"</td>
<td>3</td>
<td>6</td>
<td>0.3</td>
<td>[0, 1, … 1]</td>
<td>0.94</td>
<td>0.9</td>
</tr>
<tr class="even">
<td>"a"</td>
<td>4</td>
<td>5</td>
<td>0.4</td>
<td>[1, 2, … 4]</td>
<td>1.48</td>
<td>1.6</td>
</tr>
<tr class="odd">
<td>"b"</td>
<td>5</td>
<td>4</td>
<td>0.5</td>
<td>[2, 1, … 4]</td>
<td>2.47</td>
<td>2.5</td>
</tr>
<tr class="even">
<td>"b"</td>
<td>6</td>
<td>3</td>
<td>0.6</td>
<td>[4, 1, … 4]</td>
<td>3.53</td>
<td>3.6</td>
</tr>
<tr class="odd">
<td>"b"</td>
<td>7</td>
<td>2</td>
<td>0.7</td>
<td>[5, 5, … 5]</td>
<td>4.7</td>
<td>4.9</td>
</tr>
<tr class="even">
<td>"b"</td>
<td>8</td>
<td>1</td>
<td>0.8</td>
<td>[8, 8, … 7]</td>
<td>6.54</td>
<td>6.4</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</section>
</section>
<section id="applying-custom-aggregations-map-groups" class="level2">
<h2 class="anchored" data-anchor-id="applying-custom-aggregations-map-groups">Applying custom aggregations (Map Groups)</h2>
<p>Similar to column transformations, <code>polars</code> can also handle arbitrary data aggregation logic with <code>map_groups()</code>.</p>
<p>Consider a DataFrame with multiple model scores:</p>
<div id="498a8da9" class="cell" data-execution_count="13">
<div class="sourceCode cell-code" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1">data_dict <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {</span>
<span id="cb14-2">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'group'</span>: [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'a'</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'b'</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>,</span>
<span id="cb14-3">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'truth'</span>: [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>,</span>
<span id="cb14-4">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'mod_bad'</span>: [<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.25</span>,<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.25</span>,<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.75</span>,<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.75</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, </span>
<span id="cb14-5">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'mod_bst'</span>: [<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.99</span>,<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.75</span>,<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.25</span>,<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>,</span>
<span id="cb14-6">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'mod_rnd'</span>: [<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span>,</span>
<span id="cb14-7">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'mod_mix'</span>: [<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.99</span>,<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.75</span>,<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.25</span>,<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>[<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>[<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.6</span>]</span>
<span id="cb14-8">}</span>
<span id="cb14-9">df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.DataFrame(data_dict)</span>
<span id="cb14-10">df.glimpse()</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>Rows: 8
Columns: 6
$ group   &lt;str&gt; 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'
$ truth   &lt;i64&gt; 1, 1, 0, 0, 1, 1, 0, 0
$ mod_bad &lt;f64&gt; 0.25, 0.25, 0.75, 0.75, 0.25, 0.25, 0.75, 0.75
$ mod_bst &lt;f64&gt; 0.99, 0.75, 0.25, 0.01, 0.99, 0.75, 0.25, 0.01
$ mod_rnd &lt;f64&gt; 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5
$ mod_mix &lt;f64&gt; 0.99, 0.75, 0.25, 0.01, 0.5, 0.5, 0.5, 0.6
</code></pre>
</div>
</div>
<p><code>map_groups()</code> allows us to conduct arbitrary aggregations, such as calculating AUROC by group from the <code>scikit-learn</code> package. As you can see, the approach is largely the same: specifying the expressions required, the function used, the return type, and the return structure.</p>
<div id="1065cb17" class="cell" data-execution_count="14">
<div class="sourceCode cell-code" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1">df.group_by(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'group'</span>).agg (</span>
<span id="cb16-2">    pl.map_groups(</span>
<span id="cb16-3">    exprs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'truth'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'mod_mix'</span>],</span>
<span id="cb16-4">    function <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">lambda</span> x: roc_auc_score(x[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>], x[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]),</span>
<span id="cb16-5">    return_dtype <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.Float64,</span>
<span id="cb16-6">    returns_scalar <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span></span>
<span id="cb16-7">    )</span>
<span id="cb16-8">)</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="14">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (2, 2)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">group</th>
<th data-quarto-table-cell-role="th">truth</th>
</tr>
<tr class="odd">
<th>str</th>
<th>f64</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>"b"</td>
<td>0.25</td>
</tr>
<tr class="even">
<td>"a"</td>
<td>1.0</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</section>
<section id="alternatives-extensions" class="level2">
<h2 class="anchored" data-anchor-id="alternatives-extensions">Alternatives &amp; extensions</h2>
<p>Once you understand the different mapping capabilities available with <code>polars</code>, you can use these effectively both to <em>expand</em> the paradigm or to decide when you want to <em>deviate</em> from it. We’ll conclude by looking at some examples of each.</p>
<section id="extension-libraries" class="level3">
<h3 class="anchored" data-anchor-id="extension-libraries">Extension libraries</h3>
<p>Recall that native Rust and <code>polars</code> implementations will generally be faster than the techniques shown. Thus, another good option is to familiarize yourself with the burgeoning ecosystem of <code>polars</code> extensions to see if one suits your needs. <a href="https://github.com/ddotta/awesome-polars?tab=readme-ov-file#librariespackagesscripts">Awesome Polars</a> maintains a growing list of such packages.</p>
<p>For example, the <code>polars-ds</code> package can natively handle the AUROC use case above as it provides many useful evaluation functions:</p>
<div id="606d8893" class="cell" data-execution_count="15">
<div class="sourceCode cell-code" id="cb17" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1">df.group_by(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'group'</span>).agg (</span>
<span id="cb17-2">    auroc <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pds.query_roc_auc(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'truth'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'mod_bst'</span>)</span>
<span id="cb17-3">)</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="15">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (2, 2)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">group</th>
<th data-quarto-table-cell-role="th">auroc</th>
</tr>
<tr class="odd">
<th>str</th>
<th>f64</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>"b"</td>
<td>1.0</td>
</tr>
<tr class="even">
<td>"a"</td>
<td>1.0</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</section>
<section id="the-generator-trick" class="level3">
<h3 class="anchored" data-anchor-id="the-generator-trick">The Generator Trick</h3>
<p>While <code>pds.query_roc_auc()</code> can calculate AUROC out of the box, it expect string column names as inputs – not expressions. That means we cannot benefit from <code>polars</code>’s selectors and expression expansion to calculate multiple combinations of columns with one line of code (e.g.&nbsp;calculation AUROC for each combination of <code>truth</code> and varying model scores).</p>
<p>To apply an aggregation to multiple column subsets, <a href="https://docs.pola.rs/user-guide/expressions/expression-expansion/#programmatically-generating-expressions">the Polars docs recommend</a> a pattern like this:</p>
<ul>
<li>write a wrapper function that handles the iteration and acts as a generator yielding the expression of interest</li>
<li>obtain relevant selectors to pass into the function (here, you can use the <code>cs.expand_selectors()</code> helpers or any raw parsing of the column names)</li>
<li>pass the generator into the standard <code>df.group_by(...).agg(...)</code> flow</li>
</ul>
<div id="33f0e06e" class="cell" data-execution_count="16">
<div class="sourceCode cell-code" id="cb18" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> auroc_expressions(models):</span>
<span id="cb18-2">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> m <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> models:</span>
<span id="cb18-3">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">yield</span> pds.query_roc_auc( <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'truth'</span>, m).alias(m)</span>
<span id="cb18-4"></span>
<span id="cb18-5">mods <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> cs.expand_selector(df, cs.starts_with(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'mod_'</span>)) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># could also do: [c for c in df.columns if c[:4] == 'mod_']</span></span>
<span id="cb18-6">df.group_by(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'group'</span>).agg( auroc_expressions( mods ))</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="16">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (2, 5)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">group</th>
<th data-quarto-table-cell-role="th">mod_bad</th>
<th data-quarto-table-cell-role="th">mod_bst</th>
<th data-quarto-table-cell-role="th">mod_rnd</th>
<th data-quarto-table-cell-role="th">mod_mix</th>
</tr>
<tr class="odd">
<th>str</th>
<th>f64</th>
<th>f64</th>
<th>f64</th>
<th>f64</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>"a"</td>
<td>-0.0</td>
<td>1.0</td>
<td>0.5</td>
<td>1.0</td>
</tr>
<tr class="even">
<td>"b"</td>
<td>-0.0</td>
<td>1.0</td>
<td>0.5</td>
<td>0.25</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</section>
<section id="complex-object-types" class="level3">
<h3 class="anchored" data-anchor-id="complex-object-types">Complex Object Types</h3>
<p><code>polars</code> DataFrames can hold arbitrary objects (of datatype <code>pl.Object</code>) – not just scalars and vectors. This means, if we so choose, we can do complex multi-step tasks without leaving the DataFrame<sup>4</sup></p>
<p>Consider one final sample dataset:</p>
<div id="ef19c77d" class="cell" data-execution_count="17">
<div class="sourceCode cell-code" id="cb19" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1">data_dict <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {</span>
<span id="cb19-2">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'group'</span>: [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'a'</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'b'</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>,</span>
<span id="cb19-3">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'x'</span>: [<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.99</span>,<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.75</span>,<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.25</span>,<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>,</span>
<span id="cb19-4">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'y'</span>: [<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.99</span>,<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.75</span>,<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.25</span>,<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>[<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>[<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.6</span>]</span>
<span id="cb19-5">}</span>
<span id="cb19-6">df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.DataFrame(data_dict)</span>
<span id="cb19-7">df.glimpse()</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>Rows: 8
Columns: 3
$ group &lt;str&gt; 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'
$ x     &lt;f64&gt; 0.99, 0.75, 0.25, 0.01, 0.99, 0.75, 0.25, 0.01
$ y     &lt;f64&gt; 0.99, 0.75, 0.25, 0.01, 0.5, 0.5, 0.5, 0.6
</code></pre>
</div>
</div>
<p>If we wish, we can even use <code>map_groups()</code> to create a column that represents complex objects like <em>models</em> and then <code>map_elements()</code> to extract information from these models.</p>
<div id="73e24531" class="cell" data-execution_count="18">
<div class="sourceCode cell-code" id="cb21" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb21-1">(</span>
<span id="cb21-2">df.group_by(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'group'</span>).agg (</span>
<span id="cb21-3">    mod <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.map_groups(</span>
<span id="cb21-4">    exprs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'x'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'y'</span>],</span>
<span id="cb21-5">    function <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">lambda</span> x: sm.OLS( x[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>].to_numpy(), sm.add_constant( x[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] )).fit() ,</span>
<span id="cb21-6">    return_dtype <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.Object,</span>
<span id="cb21-7">    returns_scalar <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span></span>
<span id="cb21-8">    )</span>
<span id="cb21-9">)</span>
<span id="cb21-10">.with_columns(</span>
<span id="cb21-11">    params <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'mod'</span>).map_elements(<span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">lambda</span> x: x.params, return_dtype <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.List(pl.Float64)),</span>
<span id="cb21-12">    r_sq  <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'mod'</span>).map_elements(<span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">lambda</span> x: x.rsquared, return_dtype <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.Float64)</span>
<span id="cb21-13">)</span>
<span id="cb21-14">)</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="18">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (2, 4)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">group</th>
<th data-quarto-table-cell-role="th">mod</th>
<th data-quarto-table-cell-role="th">params</th>
<th data-quarto-table-cell-role="th">r_sq</th>
</tr>
<tr class="odd">
<th>str</th>
<th>object</th>
<th>list[f64]</th>
<th>f64</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>"b"</td>
<td>&lt;statsmodels.regression.linear_model.RegressionResultsWrapper object at 0x0000020D3D09A490&gt;</td>
<td>[3.93, -6.533333]</td>
<td>0.528971</td>
</tr>
<tr class="even">
<td>"a"</td>
<td>&lt;statsmodels.regression.linear_model.RegressionResultsWrapper object at 0x0000020D5974DC90&gt;</td>
<td>[-1.7347e-16, 1.0]</td>
<td>1.0</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<p>This pattern can be very useful if you are doing something like, for example, bootstrap aggregation. It might not be the most efficient computationally (not parallelized), but for small problems where speed is not a gamechanger, it can make for concise and readable analysis.</p>
</section>
<section id="partitions" class="level3">
<h3 class="anchored" data-anchor-id="partitions">Partitions</h3>
<p>However, just because you <em>can</em> keep everything in a DataFrame does not mean you should. The above pattern is useful if your end goal is to extract a singular quantity like a coefficient back into the DataFrame. However, if you ultimately want to go do other things with the objects you are generating, it may make for cleaner code to go ahead and break the DataFrame abstraction.</p>
<p>A final pattern I find particularly pleasant and effective is using the <code>partition_by()</code> method. This splits a DataFrame into separate frames based on grouping columns and organizes them in either a list (by default) or as a dictionary (when <code>as_dict = True</code>) indexed with a tuple containing the values of the grouping variable(s).</p>
<div id="50ca67cb" class="cell" data-execution_count="19">
<div class="sourceCode cell-code" id="cb22" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb22-1">dfs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df.partition_by(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'group'</span>, as_dict <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, include_key <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb22-2"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> k,v <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> dfs.items():</span>
<span id="cb22-3">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>k<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">  : </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>v<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>('a',)  : shape: (4, 3)
┌───────┬──────┬──────┐
│ group ┆ x    ┆ y    │
│ ---   ┆ ---  ┆ ---  │
│ str   ┆ f64  ┆ f64  │
╞═══════╪══════╪══════╡
│ a     ┆ 0.99 ┆ 0.99 │
│ a     ┆ 0.75 ┆ 0.75 │
│ a     ┆ 0.25 ┆ 0.25 │
│ a     ┆ 0.01 ┆ 0.01 │
└───────┴──────┴──────┘
('b',)  : shape: (4, 3)
┌───────┬──────┬─────┐
│ group ┆ x    ┆ y   │
│ ---   ┆ ---  ┆ --- │
│ str   ┆ f64  ┆ f64 │
╞═══════╪══════╪═════╡
│ b     ┆ 0.99 ┆ 0.5 │
│ b     ┆ 0.75 ┆ 0.5 │
│ b     ┆ 0.25 ┆ 0.5 │
│ b     ┆ 0.01 ┆ 0.6 │
└───────┴──────┴─────┘</code></pre>
</div>
</div>
<p>This allows up to break up the data in the way we wish to process it, and then do processing in more python-native syntax such as a list comprehension. I find this makes highly concise and readable code and is ultimately a better strategy when further data wrangling is not needed.</p>
<div id="8c306144" class="cell" data-execution_count="20">
<div class="sourceCode cell-code" id="cb24" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb24-1">dfs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df.partition_by(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'group'</span>, as_dict <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, include_key <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb24-2">grps <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [ k[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>] <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> k <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> dfs.keys() ] <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># turn tuple to scalar bcs only one grouping var in key</span></span>
<span id="cb24-3">mods <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [ sm.OLS( d[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'x'</span>].to_numpy(), </span>
<span id="cb24-4">                 sm.add_constant( d[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'y'</span>].to_numpy() )</span>
<span id="cb24-5">                ).fit() <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> k,d <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> dfs.items()]</span>
<span id="cb24-6">coef <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [m.params[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> m <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> mods]</span>
<span id="cb24-7"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">dict</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">zip</span>( grps, coef))</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="20">
<pre><code>{'a': np.float64(1.0000000000000004), 'b': np.float64(-6.533333333333337)}</code></pre>
</div>
</div>


</section>
</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>Beyond <code>dplyr</code>, analogous to much of what an R user might find in <code>stringr</code>, <code>lubridate</code>, <code>tidyr</code>, among others↩︎</p></li>
<li id="fn2"><p>This is a non-uncommon pattern with R <code>tidyverse</code>’s list columns↩︎</p></li>
<li id="fn3"><p>That is, passed as a named argument to <code>pipe</code> which will, in turn, pass it to the internal function being piped.↩︎</p></li>
<li id="fn4"><p>This mirrors patterns from <code>tidymodel</code>, <code>dplyr</code>, and <code>purrr</code> in R.↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>rstats</category>
  <category>python</category>
  <category>tutorial</category>
  <guid>https://emilyriederer.com/post/py-rgo-udf/</guid>
  <pubDate>Sun, 16 Nov 2025 06:00:00 GMT</pubDate>
  <media:content url="https://emilyriederer.com/post/py-rgo-udf/featured.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>MLOrbs?: MLOps in the database with orbital and dbt</title>
  <dc:creator>Emily Riederer</dc:creator>
  <link>https://emilyriederer.com/post/orbital-mlops/</link>
  <description><![CDATA[ 





<p>So, you build a great predictive model. <em>Now what</em>?</p>
<p><a href="https://mlops.community/">MLOps</a> is hard. Deploying a model involves different tools, skills, and risks than model development. This dooms some data science projects to die on their creator’s hard drive.</p>
<p>Tools like <code>dbt</code> and <code>SQLMesh</code> entered the scene to solve a similar problem for data analysts. These tools offer an opinionatee frameowrk for organizing multiple related SQL scripts into fully tested, orchestrated, and version conotrolled projects. Data analysts can deliver end-to-end pipelines by applying their existing business context, SQL experience, and database modeling<sup>1</sup> acumen into existing infrastructure, resulting in the rise of “analytics engineering”.</p>
<p>So what is the <code>dbt</code> for data scientists doing MLOps? It turns out, it <em>might</em> just be… <code>dbt</code>! (Enter caveats galore).</p>
<p>Posit’s <a href="https://posit.co/blog/introducing-orbital-for-scikit-learn-pipelines/">recently announced</a> <code>orbital</code> project<sup>2</sup> translates feature enginering and model scoring code into raw SQL code for supported model types (e.g.&nbsp;linear, tree-based) trained in <code>scikit-learn</code> pipelines (python) and <code>tidymodels</code> workflows. Similar to <code>dbt</code>, this has the potential to help data scientist’s deploy their own models batch-scoring models using existing tools (R, python, SQL) an infrastructure (analytical database) by creating a new table or view in their data warehouse (or pointing <code>duckdb</code> against their data lake!) Coupled with <code>dbt</code>, could <code>orbital</code> unlock “good enough”, zero-infra MLOps practices in a resource-contrained environment?</p>
<p>In this post, I explore a workflow for using <code>orbital</code> and <code>dbt</code> for zero-infrastructure deployment of batch models inside of a <code>dbt</code> pipeline. We’ll discuss:</p>
<ul>
<li>what makes MLOps hard</li>
<li>when database/<code>dbt</code>-based deployment might help</li>
<li>a reference implementation and workflow for <code>dbt</code> + <code>orbital</code><sup>3</sup></li>
<li>how preexisting <code>dbt</code> + <code>orbital</code> features address common MLOps pain points</li>
<li>limitations and caveats to the above approach</li>
</ul>
<p>Along the way, we’ll walk through this <a href="https://github.com/emilyriederer/orbital-exploration/tree/main/dbt_orb_demo">demo implementation of a churn prediction model</a> (wow, what a cliche). The demo is fully self-contained with open data and a <code>duckdb</code> backend if you want to pull it down and play along!</p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
R and python compatibility
</div>
</div>
<div class="callout-body-container callout-body">
<p>My example in this post uses the <code>orbital</code> python package to translate a <code>scikit-learn</code> pipline to SQL. However, there is also an <code>orbital</code> R package that can translate <code>tidymodels</code>. This post is more about the <em>workflow</em> of preparing and consuming <code>orbital</code> output in <code>dbt</code>, so it’s mostly useful for either language.</p>
</div>
</div>
<section id="mlops-challenges" class="level2">
<h2 class="anchored" data-anchor-id="mlops-challenges">MLOps Challenges</h2>
<p>Predictive modeling requires nuanced, detailed thinking; MLOps requires systems thinking. Success requires an unlikely combination of skills including a deep understanding of the business problem, the modeling workflow, and engineering principles. Some key challenges include:<sup>4</sup></p>
<ul>
<li>Data management &amp; testing
<ul>
<li>Pre-modeling (not exactly MLOps) – test source data upstream of the initial query for better visibility into quality errors or concept drift</li>
<li>Scoring time – preventing scoring on features ranges not seen in training when this poses inordinate risk</li>
</ul></li>
<li>Recreating the development / evaluation environment
<ul>
<li>Feature transformations – ensuring feature availability in prod (no leakage!) and same transformation logic as dev</li>
<li>Environment management – controling package versions and dependencies for scoring code</li>
<li>Versioning – tracking changes to the model over time</li>
</ul></li>
<li>Serving relevance
<ul>
<li>Access – controlling access to intended consumers and delivering to platform they can use</li>
<li>Reliability – ensuring predictions are retrievable on-demand</li>
</ul></li>
<li>Reproducibility / Observability
<ul>
<li>Snapshotting – ability to store past predictions for auditability and performance monitoring</li>
<li>Testing – inserting tests at relevant points in the pipeline, providing observability and automated error handling</li>
<li>Logging – ensuring general system observability of performance, errors, retries, and latency</li>
</ul></li>
</ul>
<p>These technical challenges are exacerbated by cultural factors. In small companies, data teams may be small (or even a data team-of-one) and lack bandwidth for model deployment, engineering-focused skillsets, access to enterprise-grade tools, or stakeholders who would know how to consume model predictions published in bespoke environments. In large companies, modelers may not be allowed access to production systems required for deployment, so handoffs often require prioritization and context sharing across multiple teams.</p>
</section>
<section id="deploying-to-the-database" class="level2">
<h2 class="anchored" data-anchor-id="deploying-to-the-database">Deploying to the database</h2>
<p>For the right use cases, publishing predictions back into an analytical warehouse can be an attractive proposition. This approach is best suited for offline batch scoring, such as models that:</p>
<ul>
<li>drive bulk actions in downstrean CRMs, e.g.&nbsp;marketing segments to drive targeted emails<sup>5</sup></li>
<li>inform human decision-making, e.g.&nbsp;individual predictions that rollup into quarterly sales forecast dashboard</li>
<li>are reincorporated in downstream analysis similar to raw data, e.g.&nbsp;model validation and backtesting, publish a set of propensity scores back into a clinical database</li>
</ul>
<p>In such cases, there are many advantages of having model predictions in the database:</p>
<ul>
<li><strong>Fast &amp; accurate deployment</strong>: SQL-based deployment means you can deploy your data against exactly the same data is was trained on, reducing the risk of feature drift between dev and prod. Similarly, it reduced ongoing headaches of dependency management since SQL is generally a stable language<sup>6</sup> and does not depend on external packages.</li>
<li><strong>Data management tools</strong>: Features, scores – at the end of the day <em>its all just data flow</em>. Having predictions in the database unlocks the ability to leverage on other features integrated in your database like access controls, data quality checks, scheduled updates, incremental loads, and snapshots.</li>
<li><strong>Integration</strong>: Many modern data stacks have their analytical data warehouse connected to many other business-critical systems like dashboards and CRMs (e.g.&nbsp;MailChimp) or are easy to integrate via numerous reverse ETL solutions. Serving predictions to a warehouse is a great first step to syncing them <em>beyond</em> the warehouse in the platforms where they can drive actions like customer contacts.</li>
<li><strong>Development language agnosticism</strong>: For R users tired of the “we don’t put R in production” conversations, SQL provides a generic abstraction layer to “hand off” a model object regardless of how it was developed</li>
</ul>
<p>Conversely, this solution is poorly suited for real-time models where models are scored for single observations on the fly. I share a few more thoughts on when this solution is not well-suited in the final section of this post, Section&nbsp;4.</p>
</section>
<section id="orbital-dbt-pattern" class="level2">
<h2 class="anchored" data-anchor-id="orbital-dbt-pattern"><code>orbital</code> + <code>dbt</code> Pattern</h2>
<p>To demonstrate how an <code>orbital</code> + <code>dbt</code> pattern could work, I’ll walk through <a href="https://github.com/emilyriederer/orbital-exploration/tree/main/dbt_orb_demo">this example project</a>, using <a href="https://github.com/IBM/telco-customer-churn-on-icp4d/blob/master/data/Telco-Customer-Churn.csv">IBM’s telcom churn dataset</a>. The project mostly is mostly structured like a standard <code>dbt</code> project with most of the model training and <code>orbital</code> code in <a href="https://github.com/emilyriederer/orbital-exploration/blob/main/dbt_orb_demo/model_dev/train_and_convert.ipynb">this additional notebook</a>.</p>
<p>Churn prediction might be a good candidate for batch scoring. Each month, accounts reaching their resubscription date could be scored and published to a database. Scores might then be used analytical use cases like monitoring or revenue forecasting and operational use cases like ingesting segments into a CRM to email targeted retention offers.</p>
<p>We’ll work up to this final pipeline:</p>
<p><img src="https://emilyriederer.com/post/orbital-mlops/full.png" class="img-fluid"></p>
<p>The point of this exercise is to think about the <code>orbital</code> and <code>dbt</code> architecture, so the model we deploy will be quite uninventive. Pull down some features, one-hot encode, fit a random forest, and do it all in a (<em>gasp</em>) Jupyter notebook. (Please dont’ do this.)</p>
<section id="sec-design" class="level3">
<h3 class="anchored" data-anchor-id="sec-design">Key Features and Design Choices</h3>
<p>If you want the TLDR, I’ll briefly explain the key design choices for this pipeline:</p>
<ul>
<li>Initial Data Preparation
<ul>
<li>Set up <code>dbt test</code>s to test sources before joining your feature table. This can better catch dropout from failed joins, newly emerging encoded categories, etc. Consider what additional filters you want to put in downstream tables (Better to “alert and allow” and “block until fixed”?)</li>
<li>Prepare feature separately (for normalization) but join different features in database to take advantage of database processing</li>
<li>Consider adding random number in training table for reproducible test/train split (this has to be linked to hash or something about the entities your randomizing to ensure reproducibility without regard to ordering of data samples)</li>
</ul></li>
<li>Feature Engineering
<ul>
<li>Create separate <code>scikit-learn</code> pipelines and/or <code>tidymodels</code> workflows for the feature engineering and training steps so you can render these as separate queries. This can enable better data testing and make queries more efficient so <code>orbital</code> does not repeat the feature transformation logic</li>
<li>Use test-driven development to update <code>dbt</code> data tests as you develop. For example, encoding a categorical? Immediately add an upstream test to check for previously unseen values.</li>
</ul></li>
<li>Preparing <code>orbital</code> SQL (supported by <code>sqlglot</code>)
<ul>
<li>Add back your identifier column to the query so predictions are joinable</li>
<li>Add a model version field into the query for better context to users</li>
<li>Change placeholder table to a <code>dbt</code> <code>ref()</code></li>
<li>Rename columns to remove <code>.</code>s so you do not have to always quote in queries</li>
<li>Output nicely formatted version for readability</li>
</ul></li>
<li>Deploying as a model
<ul>
<li>Consider carefully whether to make a table, view, or macro depending on your specific database, query latency, and desire to score bespoke populations</li>
</ul></li>
<li>Observability, logging, and error handling
<ul>
<li>Use <code>dbt snapshots</code> to save timestamped past predictions and feature values if these can change over time. This improves auditability and future analysis</li>
<li>Execute tests to <code>--store-failures</code> to detect changes in your data that might require model retraining or additional error handling</li>
<li>Check out <code>dbt</code> packages like <a href="https://hub.getdbt.com/elementary-data/elementary/latest/">elementary</a> to log more aspects of the model run process</li>
</ul></li>
</ul>
</section>
<section id="set-up" class="level3">
<h3 class="anchored" data-anchor-id="set-up">Set-Up</h3>
<p>The sample IBM data is provided as “one big table”, so I <a href="https://github.com/emilyriederer/orbital-exploration/blob/main/dbt_orb_demo/setup/prep-seeds.py">break things up</a> to look a bit more like normalized database data representing subscription information, billing information, demographics, and churn targets. I also add a few columns to simulate different months, censor data I want to pretend is in the future, and add a few data errors for fun.</p>
<p>Here’s a preview of the resulting tables, connected by a <code>customer_id</code> primary key:</p>
<div id="fd49b6b8" class="cell" data-execution_count="1">
<div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> duckdb</span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> polars <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pl</span>
<span id="cb1-3"></span>
<span id="cb1-4"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> duckdb.<span class="ex" style="color: null;
background-color: null;
font-style: inherit;">connect</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'dev.duckdb'</span>) <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> con:</span>
<span id="cb1-5"></span>
<span id="cb1-6">    df_serv <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> con.table(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'serv'</span>).pl()</span>
<span id="cb1-7">    df_bill <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> con.table(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bill'</span>).pl()</span>
<span id="cb1-8">    df_demo <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> con.table(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'demo'</span>).pl()</span>
<span id="cb1-9">    df_chrn <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> con.table(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'chrn'</span>).pl()</span></code></pre></div>
</div>
<div class="tabset-margin-container"></div><div class="panel-tabset">
<ul class="nav nav-tabs"><li class="nav-item"><a class="nav-link active" id="tabset-1-1-tab" data-bs-toggle="tab" data-bs-target="#tabset-1-1" aria-controls="tabset-1-1" aria-selected="true">Services Enrolled</a></li><li class="nav-item"><a class="nav-link" id="tabset-1-2-tab" data-bs-toggle="tab" data-bs-target="#tabset-1-2" aria-controls="tabset-1-2" aria-selected="false">Billing Information</a></li><li class="nav-item"><a class="nav-link" id="tabset-1-3-tab" data-bs-toggle="tab" data-bs-target="#tabset-1-3" aria-controls="tabset-1-3" aria-selected="false">Demographics</a></li><li class="nav-item"><a class="nav-link" id="tabset-1-4-tab" data-bs-toggle="tab" data-bs-target="#tabset-1-4" aria-controls="tabset-1-4" aria-selected="false">Churn</a></li></ul>
<div class="tab-content">
<div id="tabset-1-1" class="tab-pane active" aria-labelledby="tabset-1-1-tab">
<div id="3392b80b" class="cell" data-execution_count="2">
<div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1">df_serv.glimpse()</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>Rows: 7043
Columns: 12
$ customer_id        &lt;str&gt; '7590-VHVEG', '5575-GNVDE', '3668-QPYBK', '7795-CFOCW', '9237-HQITU', '9305-CDSKC', '1452-KIOVK', '6713-OKOMC', '7892-POOKP', '6388-TABGU'
$ tenure             &lt;i32&gt; 1, 34, 2, 45, 2, 8, 22, 10, 28, 62
$ phone_service      &lt;str&gt; 'No', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes'
$ multiple_lines     &lt;str&gt; 'No phone service', 'No', 'No', 'No phone service', 'No', 'Yes', 'Yes', 'No phone service', 'Yes', 'No'
$ internet_service   &lt;str&gt; 'DSL', 'DSL', 'DSL', 'DSL', 'Fiber optic', 'Fiber optic', 'Fiber optic', 'DSL', 'Fiber optic', 'DSL'
$ online_security    &lt;str&gt; 'No', 'Yes', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'No', 'Yes'
$ online_backup      &lt;str&gt; 'Yes', 'No', 'Yes', 'No', 'No', 'No', 'Yes', 'No', 'No', 'Yes'
$ device_protection  &lt;str&gt; 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'No', 'Yes', 'No'
$ tech_support       &lt;str&gt; 'No', 'No', 'No', 'Yes', 'No', 'No', 'No', 'No', 'Yes', 'No'
$ streaming_tv       &lt;str&gt; 'No', 'No', 'No', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No'
$ streaming_movies   &lt;str&gt; 'No', 'No', 'No', 'No', 'No', 'Yes', 'No', 'No', 'Yes', 'No'
$ dt_renewal        &lt;date&gt; 2025-07-01, 2025-07-01, 2025-07-01, 2025-07-01, 2025-07-01, 2025-07-01, 2025-08-01, 2025-07-01, 2025-07-01, 2025-07-01
</code></pre>
</div>
</div>
</div>
<div id="tabset-1-2" class="tab-pane" aria-labelledby="tabset-1-2-tab">
<div id="bef8abfd" class="cell" data-execution_count="3">
<div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1">df_bill.glimpse()</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>Rows: 7043
Columns: 6
$ customer_id       &lt;str&gt; '7590-VHVEG', '5575-GNVDE', '3668-QPYBK', '7795-CFOCW', '9237-HQITU', '9305-CDSKC', '1452-KIOVK', '6713-OKOMC', '7892-POOKP', '6388-TABGU'
$ contract          &lt;str&gt; 'Month-to-month', 'One year', 'Month-to-month', 'One year', 'Month-to-month', 'Month-to-month', 'Month-to-month', 'Month-to-month', 'Month-to-month', 'One year'
$ paperless_billing &lt;str&gt; 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No'
$ payment_method    &lt;str&gt; 'Electronic check', 'Mailed check', 'Mailed check', 'Bank transfer (automatic)', 'Electronic check', 'Electronic check', 'Credit card (automatic)', 'Mailed check', 'Electronic check', 'Bank transfer (automatic)'
$ monthly_charges   &lt;f64&gt; 29.85, 56.95, 53.85, 42.3, 70.7, 99.65, 89.1, 29.75, 104.8, 56.15
$ total_charges     &lt;f64&gt; 29.85, 1889.5, 108.15, 1840.75, 151.65, 820.5, 1949.4, 301.9, 3046.05, 3487.95
</code></pre>
</div>
</div>
</div>
<div id="tabset-1-3" class="tab-pane" aria-labelledby="tabset-1-3-tab">
<div id="76140c3e" class="cell" data-execution_count="4">
<div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1">df_demo.glimpse()</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>Rows: 7043
Columns: 5
$ customer_id    &lt;str&gt; '7590-VHVEG', '5575-GNVDE', '3668-QPYBK', '7795-CFOCW', '9237-HQITU', '9305-CDSKC', '1452-KIOVK', '6713-OKOMC', '7892-POOKP', '6388-TABGU'
$ gender         &lt;str&gt; 'Female', 'Male', 'Male', 'Male', 'Female', 'Female', 'Male', 'Female', 'Female', 'Male'
$ senior_citizen &lt;i32&gt; 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ partner        &lt;str&gt; 'Yes', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'Yes', 'No'
$ dependents     &lt;str&gt; 'No', 'No', 'No', 'No', 'No', 'No', 'Yes', 'No', 'No', 'Yes'
</code></pre>
</div>
</div>
</div>
<div id="tabset-1-4" class="tab-pane" aria-labelledby="tabset-1-4-tab">
<div id="70ba30c9" class="cell" data-execution_count="5">
<div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1">df_chrn.glimpse()</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>Rows: 7043
Columns: 2
$ customer_id &lt;str&gt; '7590-VHVEG', '5575-GNVDE', '3668-QPYBK', '7795-CFOCW', '9237-HQITU', '9305-CDSKC', '1452-KIOVK', '6713-OKOMC', '7892-POOKP', '6388-TABGU'
$ churn       &lt;str&gt; 'No', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No', 'No', 'Yes', 'No'
</code></pre>
</div>
</div>
</div>
</div>
</div>
<p>Ultimately, these are saved as <code>seeds</code> in the dbt project as a lightweight way to ingest small CSVs; in reality, they would be my <code>sources</code> flowing into my data warehouse from other production sources.</p>
</section>
<section id="features-training" class="level3">
<h3 class="anchored" data-anchor-id="features-training">Features &amp; Training</h3>
<p>Feature preparation and training are the heart of where <code>orbital</code> fits into our pipelines. I recommend doing these steps one-at-a-time and explain them similarly. However, since the code is closely coupled, I’ll provide it at once for reference. The combination of feature engineering and model training steps look like this:</p>
<div id="321ca8cb" class="cell" data-execution_count="6">
<details class="code-fold">
<summary>Pipeline to orbital</summary>
<div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># build pipeline(s)</span></span>
<span id="cb10-2"></span>
<span id="cb10-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">## feature pipeline does OneHotEncoding on all string columns (all are low/known cardinality)</span></span>
<span id="cb10-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">## orbital can create some very verbose variable names (for uniqueness) so we clean those up some</span></span>
<span id="cb10-5">cols_str <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> X.select( cs.string() ).columns</span>
<span id="cb10-6">onho_enc <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'oh'</span>, OneHotEncoder(sparse_output <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>), cols_str)</span>
<span id="cb10-7">ppl_feat <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Pipeline([</span>
<span id="cb10-8">  (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"encoder"</span>, ColumnTransformer([onho_enc], remainder<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'passthrough'</span>))</span>
<span id="cb10-9">]).set_output(transform<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"polars"</span>)</span>
<span id="cb10-10">X_tran <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ppl_feat.fit_transform(X, y)</span>
<span id="cb10-11">X_tran.columns <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [c.replace(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">' '</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'_'</span>).replace(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'-'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'_'</span>).replace(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'('</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">''</span>).replace(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">')'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">''</span>) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> c <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> X_tran.columns]</span>
<span id="cb10-12"></span>
<span id="cb10-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">## training pipeline fits actual random forest model</span></span>
<span id="cb10-14">ppl_pred <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Pipeline([</span>
<span id="cb10-15">  (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"prep"</span>, ColumnTransformer([], remainder<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'passthrough'</span>)),</span>
<span id="cb10-16">  (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"pred"</span>, RandomForestClassifier(max_depth<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, n_estimators<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, random_state<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">123</span>))</span>
<span id="cb10-17">])</span>
<span id="cb10-18">ppl_pred.fit(X_tran, y)</span>
<span id="cb10-19"></span>
<span id="cb10-20"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># convert to orbital</span></span>
<span id="cb10-21"></span>
<span id="cb10-22">tbl <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"TBL_REF"</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># placeholder replaced in cleaning</span></span>
<span id="cb10-23"></span>
<span id="cb10-24"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">## creating mapping of source data types to orbital types </span></span>
<span id="cb10-25">type_map <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {</span>
<span id="cb10-26">    pl.String:orbital.types.StringColumnType(),</span>
<span id="cb10-27">    pl.Int32:orbital.types.Int32ColumnType(),</span>
<span id="cb10-28">    pl.Float64:orbital.types.DoubleColumnType()</span>
<span id="cb10-29">}</span>
<span id="cb10-30">dict_feat <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {e: type_map.get(t) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> e, t <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">zip</span>(X.columns, X.dtypes)}</span>
<span id="cb10-31">dict_pred <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {e: type_map.get(t) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> e, t <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">zip</span>(X_tran.columns, X_tran.dtypes)}</span>
<span id="cb10-32"></span>
<span id="cb10-33"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">## features</span></span>
<span id="cb10-34">orb_ppl_feat <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> orbital.parse_pipeline(ppl_feat, features<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>dict_feat)</span>
<span id="cb10-35">sql_raw_feat <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> orbital.export_sql(tbl, orb_ppl_feat, dialect<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"duckdb"</span>)</span>
<span id="cb10-36"></span>
<span id="cb10-37"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">## scoring</span></span>
<span id="cb10-38">orb_ppl_pred <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> orbital.parse_pipeline(ppl_pred, features<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>dict_pred)</span>
<span id="cb10-39">sql_raw_pred <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> orbital.export_sql(tbl, orb_ppl_pred, dialect<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"duckdb"</span>)</span></code></pre></div>
</details>
</div>
<section id="features" class="level4">
<h4 class="anchored" data-anchor-id="features">Features</h4>
<p>Feature prep is the first use case for integrating <code>orbital</code> code in our <code>dbt</code> pipeline. Ultimately, we want to be sure our production features are identical to our development features. To do this, we make three design choices:</p>
<ul>
<li>Prepare raw features in the database (pre-joining) to take advantage of database-grade computational power and have preprocessing “published” to fuel different model experimentation
<ul>
<li>Adding a <a href="https://github.com/emilyriederer/orbital-exploration/blob/main/dbt_orb_demo/models/churn_model/raw_feat.sql">model <code>raw_feat</code></a> to my dbt project that simply pre-joins relevant sources</li>
</ul></li>
<li>Make separate <code>scikit-learn</code> pipelines and <code>orbital</code> SQL output for feature and training steps for separate testing and faster scoring (Otherwise, <code>orbital</code>-generated SQL sometimes reproduces feature transformation logic at <em>every use</em> of the feature versus doing it once upfront. Depending one your database’s optimizer, it may or may not be smart enough to reorder this at runtime.)
<ul>
<li>In python, fit the <code>ppl_feat</code> pipeline (<a href="https://github.com/emilyriederer/orbital-exploration/blob/main/dbt_orb_demo/model_dev/train_and_convert.ipynb">cell 4</a>) which only fits the feature transformation steps</li>
<li>Retrieve the resulting SQL code from <code>orbital</code> and clean it up (discussed below)</li>
<li>Deploy it by writing the SQL back to the <code>models/</code> folder as a <a href="https://github.com/emilyriederer/orbital-exploration/blob/main/dbt_orb_demo/models/churn_model/prep_feat.sql">model <code>prep_feat</code></a></li>
</ul></li>
<li>Noting the assumptions we are making about our data while engineering features and pushing those tests upstream to the source in the database
<ul>
<li>For example, one-hot encoding assumes the categories won’t change. So, since we are one-hot encoding the <code>internet_service</code> field from source, we can update our <code>schema.yml</code> file to <a href="https://github.com/emilyriederer/orbital-exploration/blob/b63cf4732cc8a440821ed96339bdb1655e9b9bb5/dbt_orb_demo/seeds/schema.yml#L33">add an <code>accepted_values</code> data test for that field</a> to warn us if our model is beginning to see data is was not prepared to handle<sup>7</sup>. Subsequent data models could, in theory, route these cases away from our scoring table and into a separate logging table for separate handling.</li>
</ul></li>
</ul>
<p>This way, we can deploy our exact features to the database separately from our final model for additional data validation. We can also run our dbt tests before consuming the results to ensure the assumptions that went into feature creation still hold.</p>
<p>Again, because we are using <code>dbt</code>, we can take advtange of related tools. Using the VS Code extension, we can examine our database’s DAG so far and see that our data test was correctly placed on the source:</p>
<p><img src="https://emilyriederer.com/post/orbital-mlops/source-test.png" class="img-fluid"></p>
</section>
<section id="training" class="level4">
<h4 class="anchored" data-anchor-id="training">Training</h4>
<p>Model training follows similarly. We create another sci-kit-learn pipline <code>ppl_pred</code> and train it (<a href="https://github.com/emilyriederer/orbital-exploration/blob/main/dbt_orb_demo/model_dev/train_and_convert.ipynb">cell 4</a>). This time, we just use the preprocessed data that was <code>fit_transform</code>ed in the prior step. Alternatively, we could re-retrieve our newly prepared features from the database.</p>
<p>In theory, this is where we’d also do a lot of model evaluation and iteration where being <em>outside</em> of the database in a joy. I don’t do this since getting a good model is not my goal.</p>
</section>
</section>
<section id="sql-cleanup" class="level3">
<h3 class="anchored" data-anchor-id="sql-cleanup">SQL Cleanup</h3>
<p>While <code>orbital</code> does a lot of heavy lifting, the SQL it produces is not perfect:</p>
<ul>
<li>It does not <code>SELECT</code> any metadata or identifier columns, rendering your predictions impossible to join to other data sources. Inserting this column requires care because sometimes the upstream data is being queried within the main query and other times it is queried in a CTE</li>
<li>Its hard to get <code>orbital</code> to query from a <code>ref()</code> that plays nice with <code>dbt</code>’s Jinja because <code>orbital</code> is rigorous about quoting table and column names. So, it’s easier to put a placeholder table name and edit it in post-processing.</li>
<li>It uses somewhat long and bulky variable names that reflect <code>scikit-learn</code> internals, including <code>.</code>s in column names which can reduce readability and requires quoting since <code>.</code> usually means something different in SQL</li>
<li>It includes positive predictions, negative predictions, and labels which may be excessive. I’ve never wanted anything more than the positive predictions</li>
<li>It’s not formatted which shouldn’t matter but will wrankle anyone who has ever worked with SQL</li>
</ul>
<p>To mitigate these multiple issues, <code>sqlglot</code> makes it easy to further parse the query. <code>sqlglot</code> is a package that allows you to turn any SQL script into an AST for ease of programatic modification. I defined a <a href="https://github.com/emilyriederer/orbital-exploration/blob/main/dbt_orb_demo/model_dev/clean_sql.py">helper function</a> with <code>sqlglot</code> to fix all of the above.</p>
<div id="f3f49349" class="cell" data-execution_count="7">
<details class="code-fold">
<summary>Cleaning function definition</summary>
<div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> sqlglot</span>
<span id="cb11-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sqlglot <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> parse_one, exp </span>
<span id="cb11-3"></span>
<span id="cb11-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> clean_sql(sql_raw: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>, </span>
<span id="cb11-5">              tbl_ref: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>, </span>
<span id="cb11-6">              model_version: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>,</span>
<span id="cb11-7">              col_id: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>, </span>
<span id="cb11-8">              cols_renm: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">dict</span>[<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>, <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'output_probability.1'</span>:<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'pred'</span>, </span>
<span id="cb11-9">                                           <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'output_probability.0'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'0'</span>, </span>
<span id="cb11-10">                                           <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'output_label'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'0'</span>},</span>
<span id="cb11-11">              ) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>:</span>
<span id="cb11-12"></span>
<span id="cb11-13">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Opinionated clean-up of SQL returned by orbital</span></span>
<span id="cb11-14"></span>
<span id="cb11-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    This function executes the following transformations:</span></span>
<span id="cb11-16"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    - Rename desired columns such as the prediction column (per result of cols_renm)</span></span>
<span id="cb11-17"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    - Remove unwanted variables (those being "renamed" to "0")</span></span>
<span id="cb11-18"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    - Add back ID variable for joining predictions to other datasets </span></span>
<span id="cb11-19"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    - Fix table reference from default TBL_REF to a specific dbt model reference</span></span>
<span id="cb11-20"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    - Reformats SQL for improved readability</span></span>
<span id="cb11-21"></span>
<span id="cb11-22"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Parameters</span></span>
<span id="cb11-23"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    ----------</span></span>
<span id="cb11-24"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    sql_raw: SQL string provided by `orbital`</span></span>
<span id="cb11-25"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    tbl_ref: Name of dbt model to be referenced in query's FROM clause</span></span>
<span id="cb11-26"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    model_version: Version number of model to be added as own column. Defaults to None to add no column</span></span>
<span id="cb11-27"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    col_id: Name of the column representing the unique identifier of entities to be predicted</span></span>
<span id="cb11-28"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    cols_renm: Dictionary of {default_name: desired_name} to rename fields</span></span>
<span id="cb11-29"></span>
<span id="cb11-30"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Returns</span></span>
<span id="cb11-31"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    -------</span></span>
<span id="cb11-32"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    str</span></span>
<span id="cb11-33"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        A formatted and updated SQL query</span></span>
<span id="cb11-34"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    """</span></span>
<span id="cb11-35"></span>
<span id="cb11-36"></span>
<span id="cb11-37">    ast <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> parse_one(sql_raw)</span>
<span id="cb11-38">    </span>
<span id="cb11-39">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> e <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> ast.expressions:</span>
<span id="cb11-40">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># rename prediction column</span></span>
<span id="cb11-41">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> cols_renm.get(e.alias) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'0'</span>:</span>
<span id="cb11-42">            e.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">set</span>(arg_key<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'this'</span>,value<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>)</span>
<span id="cb11-43">            e.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">set</span>(arg_key<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'alias'</span>,value<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>)</span>
<span id="cb11-44">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> e.alias <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> cols_renm.keys():</span>
<span id="cb11-45">            e.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">set</span>(arg_key<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'alias'</span>,value<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>cols_renm.get(e.alias))</span>
<span id="cb11-46">    </span>
<span id="cb11-47">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># add back a variable for reference (typically like an ID for joining to other tables)</span></span>
<span id="cb11-48">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># this is tricky because sometimes orbital uses CTEs and other times it doesn't;</span></span>
<span id="cb11-49">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># generally, we need to get the identifier inside the CTE if it exists</span></span>
<span id="cb11-50">    col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> exp.Column(this<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>exp.to_identifier(col_id))</span>
<span id="cb11-51">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> ast.find(exp.CTE) <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">is</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">not</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>:</span>
<span id="cb11-52">        cte_select <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ast.find(exp.CTE).this</span>
<span id="cb11-53">        cte_select.expressions.append(col)</span>
<span id="cb11-54">    ast <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ast.select(col_id)</span>
<span id="cb11-55"></span>
<span id="cb11-56">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># add model version to outer query if desired</span></span>
<span id="cb11-57">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> model_version <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">is</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">not</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>:</span>
<span id="cb11-58"></span>
<span id="cb11-59">        col_version <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> exp.Alias(</span>
<span id="cb11-60">            this<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>exp.Literal.string(model_version), </span>
<span id="cb11-61">            alias<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"model_version"</span>)</span>
<span id="cb11-62">        ast.find(exp.Select).expressions.append(col_version)</span>
<span id="cb11-63">    </span>
<span id="cb11-64">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># pretty print</span></span>
<span id="cb11-65">    sql_fmt <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> sqlglot.transpile(ast.sql(), </span>
<span id="cb11-66">                                write<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"duckdb"</span>, </span>
<span id="cb11-67">                                identify<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, </span>
<span id="cb11-68">                                pretty<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span>
<span id="cb11-69">    </span>
<span id="cb11-70">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># change out table to dbt reference</span></span>
<span id="cb11-71">    ref_str <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">{{{{</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> ref('</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>tbl_ref<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">')</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">}}}}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span>
<span id="cb11-72">    sql_fnl <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> sql_fmt.replace(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'"TBL_REF"'</span>, ref_str) </span>
<span id="cb11-73">  </span>
<span id="cb11-74">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> sql_fnl</span></code></pre></div>
</details>
</div>
<p>I run the SQL generated from both <code>ppl_feat</code> and <code>ppl_rafo</code> through this function before writing them to <code>models/churn_model/prep_feat.sql</code> and <code>models/churn_model/pred_churn.sql</code> in my <a href="https://github.com/emilyriederer/orbital-exploration/tree/main/dbt_orb_demo/models/churn_model"><code>dbt</code> <code>models/</code> directory</a>.</p>
<p>This establishes our core model deployment pipeline:</p>
<p><img src="https://emilyriederer.com/post/orbital-mlops/core-pipe.png" class="img-fluid"></p>
</section>
<section id="scoring-preserving-and-activating-predictions" class="level3">
<h3 class="anchored" data-anchor-id="scoring-preserving-and-activating-predictions">Scoring, Preserving, and Activating Predictions</h3>
<p>We now have a table in our database that has our churn model predictions! Here is where we can begin to utilize the full benefit of the data management tools that <code>dbt</code> has built in.</p>
<p>Before scoring, we can run our <code>dbt test</code> to ensure that our features are stable and valid.</p>
<p>For scoring, depending on our use case we can set the table materialization to be a table (rebuilt on a schedule) or a view (generated on the fly for a specific population).</p>
<p>For archiving past scores, we can update our <a href="https://github.com/emilyriederer/orbital-exploration/blob/b63cf4732cc8a440821ed96339bdb1655e9b9bb5/dbt_orb_demo/dbt_project.yml#L35">dbt-project.yml to include snapshotting</a> our predictions table. This means even if we publish our tables as a view, we could schedule a call to <code>dbt snapshot</code> on a regular basis to record a timestamped record of what our scores were at any given point in time. This could be useful for model monitoring or auditiability. For example, if we are using our churn model to segment a marketing campaign, we might need these scores later to determine who got what treatment in the campaign.</p>
<p>For staging analysis, we can use <code>dbt</code> <code>analyses</code> to <a href="https://github.com/emilyriederer/orbital-exploration/blob/main/dbt_orb_demo/analyses/churn_model_perf.sql">render the scripts</a> that might be needed to conduct model monitoring (e.g.&nbsp;merging past scores with observed targets.)</p>
<p>We can see examples of these different artifacts branching off of our DAG:</p>
<p><img src="https://emilyriederer.com/post/orbital-mlops/artifacts.png" class="img-fluid"></p>
</section>
<section id="datamart-preview" class="level3">
<h3 class="anchored" data-anchor-id="datamart-preview">Datamart Preview</h3>
<p>Below, we can tour the resulting datasets:</p>
<div id="b2ea61bd" class="cell" data-execution_count="8">
<div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> duckdb</span>
<span id="cb12-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> polars <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pl</span>
<span id="cb12-3"></span>
<span id="cb12-4"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> duckdb.<span class="ex" style="color: null;
background-color: null;
font-style: inherit;">connect</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'dev.duckdb'</span>) <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> con:</span>
<span id="cb12-5"></span>
<span id="cb12-6">    df_feat <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> con.table(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'raw_feat'</span>).pl()</span>
<span id="cb12-7">    df_prep <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> con.table(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'prep_feat'</span>).pl()</span>
<span id="cb12-8">    df_pred <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> con.table(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'pred_churn'</span>).pl()</span>
<span id="cb12-9">    df_snap <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> con.table(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'main_snapshots.pred_churn_snapshot'</span>).pl()</span>
<span id="cb12-10">    df_fail <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> con.table(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'main_audit.accepted_values_serv_internet_service__DSL__Fiber_optic__No'</span>).pl()</span></code></pre></div>
</div>
<div class="tabset-margin-container"></div><div class="panel-tabset">
<ul class="nav nav-tabs"><li class="nav-item"><a class="nav-link active" id="tabset-2-1-tab" data-bs-toggle="tab" data-bs-target="#tabset-2-1" aria-controls="tabset-2-1" aria-selected="true">Raw Training Data</a></li><li class="nav-item"><a class="nav-link" id="tabset-2-2-tab" data-bs-toggle="tab" data-bs-target="#tabset-2-2" aria-controls="tabset-2-2" aria-selected="false">Prepared Features</a></li><li class="nav-item"><a class="nav-link" id="tabset-2-3-tab" data-bs-toggle="tab" data-bs-target="#tabset-2-3" aria-controls="tabset-2-3" aria-selected="false">Predictions</a></li><li class="nav-item"><a class="nav-link" id="tabset-2-4-tab" data-bs-toggle="tab" data-bs-target="#tabset-2-4" aria-controls="tabset-2-4" aria-selected="false">Snapshots</a></li><li class="nav-item"><a class="nav-link" id="tabset-2-5-tab" data-bs-toggle="tab" data-bs-target="#tabset-2-5" aria-controls="tabset-2-5" aria-selected="false">Failures</a></li></ul>
<div class="tab-content">
<div id="tabset-2-1" class="tab-pane active" aria-labelledby="tabset-2-1-tab">
<div id="ccd80876" class="cell" data-execution_count="9">
<div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1">df_feat.glimpse()</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>Rows: 6944
Columns: 21
$ customer_id       &lt;str&gt; '7590-VHVEG', '5575-GNVDE', '3668-QPYBK', '7795-CFOCW', '9237-HQITU', '9305-CDSKC', '1452-KIOVK', '6713-OKOMC', '7892-POOKP', '6388-TABGU'
$ cat_train_test    &lt;str&gt; 'Train', 'Train', 'Train', 'Train', 'Train', 'Train', 'Train', 'Train', 'Train', 'Train'
$ tenure            &lt;i32&gt; 1, 34, 2, 45, 2, 8, 22, 10, 28, 62
$ phone_service     &lt;str&gt; 'No', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes'
$ multiple_lines    &lt;str&gt; 'No phone service', 'No', 'No', 'No phone service', 'No', 'Yes', 'Yes', 'No phone service', 'Yes', 'No'
$ internet_service  &lt;str&gt; 'DSL', 'DSL', 'DSL', 'DSL', 'Fiber optic', 'Fiber optic', 'Fiber optic', 'DSL', 'Fiber optic', 'DSL'
$ online_security   &lt;str&gt; 'No', 'Yes', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'No', 'Yes'
$ online_backup     &lt;str&gt; 'Yes', 'No', 'Yes', 'No', 'No', 'No', 'Yes', 'No', 'No', 'Yes'
$ device_protection &lt;str&gt; 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'No', 'Yes', 'No'
$ tech_support      &lt;str&gt; 'No', 'No', 'No', 'Yes', 'No', 'No', 'No', 'No', 'Yes', 'No'
$ streaming_tv      &lt;str&gt; 'No', 'No', 'No', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No'
$ streaming_movies  &lt;str&gt; 'No', 'No', 'No', 'No', 'No', 'Yes', 'No', 'No', 'Yes', 'No'
$ gender            &lt;str&gt; 'Female', 'Male', 'Male', 'Male', 'Female', 'Female', 'Male', 'Female', 'Female', 'Male'
$ senior_citizen    &lt;i32&gt; 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ partner           &lt;str&gt; 'Yes', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'Yes', 'No'
$ dependents        &lt;str&gt; 'No', 'No', 'No', 'No', 'No', 'No', 'Yes', 'No', 'No', 'Yes'
$ contract          &lt;str&gt; 'Month-to-month', 'One year', 'Month-to-month', 'One year', 'Month-to-month', 'Month-to-month', 'Month-to-month', 'Month-to-month', 'Month-to-month', 'One year'
$ paperless_billing &lt;str&gt; 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No'
$ payment_method    &lt;str&gt; 'Electronic check', 'Mailed check', 'Mailed check', 'Bank transfer (automatic)', 'Electronic check', 'Electronic check', 'Credit card (automatic)', 'Mailed check', 'Electronic check', 'Bank transfer (automatic)'
$ monthly_charges   &lt;f64&gt; 29.85, 56.95, 53.85, 42.3, 70.7, 99.65, 89.1, 29.75, 104.8, 56.15
$ total_charges     &lt;f64&gt; 29.85, 1889.5, 108.15, 1840.75, 151.65, 820.5, 1949.4, 301.9, 3046.05, 3487.95
</code></pre>
</div>
</div>
</div>
<div id="tabset-2-2" class="tab-pane" aria-labelledby="tabset-2-2-tab">
<div id="ca44dedc" class="cell" data-execution_count="10">
<div class="sourceCode cell-code" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1">df_prep.glimpse()</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>Rows: 6944
Columns: 47
$ oh__phone_service_No                       &lt;f64&gt; 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0
$ oh__phone_service_Yes                      &lt;f64&gt; 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0
$ oh__multiple_lines_No                      &lt;f64&gt; 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0
$ oh__multiple_lines_No_phone_service        &lt;f64&gt; 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0
$ oh__multiple_lines_Yes                     &lt;f64&gt; 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0
$ oh__internet_service_DSL                   &lt;f64&gt; 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0
$ oh__internet_service_Fiber_optic           &lt;f64&gt; 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0
$ oh__internet_service_No                    &lt;f64&gt; 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
$ oh__online_security_No                     &lt;f64&gt; 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0
$ oh__online_security_No_internet_service    &lt;f64&gt; 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
$ oh__online_security_Yes                    &lt;f64&gt; 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0
$ oh__online_backup_No                       &lt;f64&gt; 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0
$ oh__online_backup_No_internet_service      &lt;f64&gt; 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
$ oh__online_backup_Yes                      &lt;f64&gt; 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0
$ oh__device_protection_No                   &lt;f64&gt; 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0
$ oh__device_protection_No_internet_service  &lt;f64&gt; 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
$ oh__device_protection_Yes                  &lt;f64&gt; 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0
$ oh__tech_support_No                        &lt;f64&gt; 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0
$ oh__tech_support_No_internet_service       &lt;f64&gt; 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
$ oh__tech_support_Yes                       &lt;f64&gt; 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0
$ oh__streaming_tv_No                        &lt;f64&gt; 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0
$ oh__streaming_tv_No_internet_service       &lt;f64&gt; 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
$ oh__streaming_tv_Yes                       &lt;f64&gt; 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0
$ oh__streaming_movies_No                    &lt;f64&gt; 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0
$ oh__streaming_movies_No_internet_service   &lt;f64&gt; 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
$ oh__streaming_movies_Yes                   &lt;f64&gt; 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0
$ oh__gender_Female                          &lt;f64&gt; 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0
$ oh__gender_Male                            &lt;f64&gt; 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0
$ oh__partner_No                             &lt;f64&gt; 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0
$ oh__partner_Yes                            &lt;f64&gt; 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0
$ oh__dependents_No                          &lt;f64&gt; 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0
$ oh__dependents_Yes                         &lt;f64&gt; 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0
$ oh__contract_Month_to_month                &lt;f64&gt; 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0
$ oh__contract_One_year                      &lt;f64&gt; 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0
$ oh__contract_Two_year                      &lt;f64&gt; 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
$ oh__paperless_billing_No                   &lt;f64&gt; 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0
$ oh__paperless_billing_Yes                  &lt;f64&gt; 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0
$ oh__payment_method_Bank_transfer_automatic &lt;f64&gt; 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0
$ oh__payment_method_Credit_card_automatic   &lt;f64&gt; 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0
$ oh__payment_method_Electronic_check        &lt;f64&gt; 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0
$ oh__payment_method_Mailed_check            &lt;f64&gt; 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0
$ remainder__tenure                          &lt;f64&gt; 1.0, 34.0, 2.0, 45.0, 2.0, 8.0, 22.0, 10.0, 28.0, 62.0
$ remainder__senior_citizen                  &lt;f64&gt; 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
$ remainder__monthly_charges                 &lt;f64&gt; 29.85, 56.95, 53.85, 42.3, 70.7, 99.65, 89.1, 29.75, 104.8, 56.15
$ remainder__total_charges                   &lt;f64&gt; 29.85, 1889.5, 108.15, 1840.75, 151.65, 820.5, 1949.4, 301.9, 3046.05, 3487.95
$ customer_id                                &lt;str&gt; '7590-VHVEG', '5575-GNVDE', '3668-QPYBK', '7795-CFOCW', '9237-HQITU', '9305-CDSKC', '1452-KIOVK', '6713-OKOMC', '7892-POOKP', '6388-TABGU'
$ model_version                              &lt;str&gt; '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0'
</code></pre>
</div>
</div>
</div>
<div id="tabset-2-3" class="tab-pane" aria-labelledby="tabset-2-3-tab">
<div id="20932f04" class="cell" data-execution_count="11">
<div class="sourceCode cell-code" id="cb17" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1">df_pred.glimpse()</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>Rows: 6944
Columns: 3
$ pred          &lt;f64&gt; 0.4350639304611832, 0.14068829294410534, 0.34994459204608575, 0.10898763570003211, 0.5811184463091195, 0.5483232741244137, 0.4043196897255257, 0.311830934981117, 0.3962726652389392, 0.1372128768125549
$ customer_id   &lt;str&gt; '7590-VHVEG', '5575-GNVDE', '3668-QPYBK', '7795-CFOCW', '9237-HQITU', '9305-CDSKC', '1452-KIOVK', '6713-OKOMC', '7892-POOKP', '6388-TABGU'
$ model_version &lt;str&gt; '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0'
</code></pre>
</div>
</div>
</div>
<div id="tabset-2-4" class="tab-pane" aria-labelledby="tabset-2-4-tab">
<p>Score versioned and timestamped predictions from snapshots for auditability.</p>
<div id="69ab0cd9" class="cell" data-execution_count="12">
<div class="sourceCode cell-code" id="cb19" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1">df_snap.glimpse()</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>Rows: 6944
Columns: 7
$ pred                    &lt;f64&gt; 0.4350639304611832, 0.14068829294410534, 0.34994459204608575, 0.10898763570003211, 0.5811184463091195, 0.5483232741244137, 0.4043196897255257, 0.311830934981117, 0.3962726652389392, 0.1372128768125549
$ customer_id             &lt;str&gt; '7590-VHVEG', '5575-GNVDE', '3668-QPYBK', '7795-CFOCW', '9237-HQITU', '9305-CDSKC', '1452-KIOVK', '6713-OKOMC', '7892-POOKP', '6388-TABGU'
$ model_version           &lt;str&gt; '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0'
$ dbt_scd_id              &lt;str&gt; 'c4671964ba707c90a41d74f6f2ef75b7', '7dc40efa71bcee4795c7f54b3b5bc783', 'b05d4425f5d07106f1f2f2e782461f44', '3b919e27eb23ba54e200462af172e7da', 'eb6117ba3156a771b0e02e5e7bc644ab', 'ddae31e6abdabdd771ea4bbd1072fe55', 'aa7fe49fcbb5a937b44f7ac589b3ff34', 'da7eb2655934862105e8782e40ca5eb5', '882f945d0e265290e5976d4c8d04679e', '72f44f68e12a53baaf1d9ddd2469a616'
$ dbt_updated_at &lt;datetime[μs]&gt; 2025-08-15 19:22:45.830000, 2025-08-15 19:22:45.830000, 2025-08-15 19:22:45.830000, 2025-08-15 19:22:45.830000, 2025-08-15 19:22:45.830000, 2025-08-15 19:22:45.830000, 2025-08-15 19:22:45.830000, 2025-08-15 19:22:45.830000, 2025-08-15 19:22:45.830000, 2025-08-15 19:22:45.830000
$ dbt_valid_from &lt;datetime[μs]&gt; 2025-08-15 19:22:45.830000, 2025-08-15 19:22:45.830000, 2025-08-15 19:22:45.830000, 2025-08-15 19:22:45.830000, 2025-08-15 19:22:45.830000, 2025-08-15 19:22:45.830000, 2025-08-15 19:22:45.830000, 2025-08-15 19:22:45.830000, 2025-08-15 19:22:45.830000, 2025-08-15 19:22:45.830000
$ dbt_valid_to   &lt;datetime[μs]&gt; 9999-12-31 00:00:00, 9999-12-31 00:00:00, 9999-12-31 00:00:00, 9999-12-31 00:00:00, 9999-12-31 00:00:00, 9999-12-31 00:00:00, 9999-12-31 00:00:00, 9999-12-31 00:00:00, 9999-12-31 00:00:00, 9999-12-31 00:00:00
</code></pre>
</div>
</div>
</div>
<div id="tabset-2-5" class="tab-pane" aria-labelledby="tabset-2-5-tab">
<p>What happens when the <code>internet_service</code> field is recoded in production data from “Fiber optic” to “Fiber” after training? If we are checking for <code>accepted_values</code>, we capture that change in our failures table before scoring on bad data!</p>
<div id="a775d800" class="cell" data-execution_count="13">
<div class="sourceCode cell-code" id="cb21" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb21-1">df_fail.glimpse()</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>Rows: 1
Columns: 2
$ value_field &lt;str&gt; 'Fiber'
$ n_records   &lt;i64&gt; 48
</code></pre>
</div>
</div>
</div>
</div>
</div>
</section>
<section id="dreaming-bigger" class="level3">
<h3 class="anchored" data-anchor-id="dreaming-bigger">Dreaming bigger</h3>
<p>This demo shows just <code>orbital</code> + <code>dbt</code>, but that’s just the beginning. Treating the whole MLOps process just like data processing means you can benefit from a wide range of integrated tools and capabilities, e.g.:</p>
<ul>
<li>data ingestion
<ul>
<li>retrieve training data for APIs with <code>dlt</code></li>
<li>ingest features from flatfiles on blob sources via the <code>dbt</code> <a href="https://hub.getdbt.com/dbt-labs/dbt_external_tables/latest/">external-tables</a> package</li>
</ul></li>
<li>better testing with dbt packages such as <a href="https://hub.getdbt.com/metaplane/dbt_expectations/latest/"><code>dbt-expectatons</code></a> (from Great Expectations)</li>
<li>logging and observability
<ul>
<li>snapshot features table as well as predictions table</li>
<li>use <code>dbt</code> packages like <a href="https://hub.getdbt.com/elementary-data/elementary/latest/">elementary</a> to write more run metadata to your warehouse</li>
</ul></li>
<li>orchestration with <code>Dagster</code>
<ul>
<li>unfurl your local <code>dbt</code> DAG into a broader pipeline</li>
<li>trigger more model-adjacent tasks from refitting, monitoring, etc.</li>
</ul></li>
<li>documentions with <code>dbt docs</code> (which can be <a href="https://github.com/emilyriederer/dbt_duckdb_quarto">enhanced with Quarto</a>)</li>
<li>reverse ETL with tools like HighTouch or Census to easily sync analytical data models into production systems like CRMs</li>
</ul>
</section>
</section>
<section id="sec-limitations" class="level2">
<h2 class="anchored" data-anchor-id="sec-limitations">Limitations</h2>
<p>While I see a lot of promise in model deployment to the database, it’s currently not without it’s limitations. Tobias Macey of the excellent <a href="https://www.dataengineeringpodcast.com/">Data Engineering Podcast</a> always ends his show by asking his guests (mostly tool developers): “When is <thing> not the right solution?” I’ll conclude by answering the same.</thing></p>
<p>There are many things I would consider if using <code>orbital</code> today for business use cases versus hobby projects:</p>
<ul>
<li><strong>Use Case</strong>: ML in Database only makes sense for batch predictions. <code>orbital</code> is not the right solution if there is a chance you’ll want realtime predictions</li>
<li><strong>Algorithms</strong>: Right now <code>orbital</code> is mostly limited to <code>scikit-learn</code> models and select feature engineering steps (or <code>tidymodels</code> in R). This can be a challenge if you want to use other common algorithms. I’ve figured out some workarounds for <a href="../..\post/orbital-xgb"><code>xgboost</code></a> but at some point, the amount of hacking around the periphery reduces the “same code in dev and prod” benefits</li>
<li><strong>Scale/Complexity</strong>: SQL is generally good at optimizing large-scale data processing jobs. However, extremely large ensemble models may experience slower runtimes. If such a model was to be run at extreme scale, one would need to consider the relative latency<sup>8</sup> and cost<sup>9</sup> of this versus other solutions. Depending on your engine, there are also some <a href="https://github.com/tidymodels/orbital/issues/97#issuecomment-3194016548">query optimizations to consider</a></li>
<li><strong>Platform</strong>: Surprisingly, as I explored this mashup, I came to learn both <a href="https://cloud.google.com/bigquery/quotas">BigQuery</a> and Azure impose maximum query length limits which could pose challenges for large models (e.g.&nbsp;deep trees in random forests or GBMs). One could work around this with a lot of views, but it’s generally better to not pick a fight with your infrastructure.</li>
<li><strong>Precision</strong>: <code>orbital</code> uses <code>sklearn-onnx</code> which can create some issues when <a href="https://onnx.ai/sklearn-onnx/auto_tutorial/plot_ebegin_float_double.html">floating point precision</a>. It is easily tested how critical this is for your use case, but you may find corner cases where it is difficult to precisely recreate your local predictions – particularly for tree-based models where tiny perturbations send an observation down a different path.</li>
<li><strong>Bugs</strong>: <code>orbital</code> still has some bugs it’s working out and seems to still be building out its testing infrastructure. For example, at the time of writing this demo, I started out trying to use the <code>TargetEncoder()</code> which <a href="https://github.com/posit-dev/orbital/issues/62">failed unexpectedly</a> so I switched to the <code>OneHotEncoder()</code>. That’s fine for a demo, but I wouldn’t be so cavelier about letting tool limitations shape my modeling choices in real life.</li>
<li><strong>Governance</strong>: Similar to the downsides of <code>dbt</code>, the risk of lowering the barriers to entry to deploying a new data model or machine learning model is that it will be done carelessly or prolificly. As the demo above shows, a rigorous approach can add many data artifacts to your datamart and could risk causing bloat if done casually. Having the right controls to determine who should be allowed to deploy models of what materiality is key.</li>
</ul>
<p>The good news is, most of these downsides are fully testable. You can quickly and pretty robustly dual-validated <code>orbital</code>’s logic and cross-check prediction speed and accuracy from python and SQL environments. So, if the idea sounds intriguing, take it for a spin! There aren’t too many “unknown unknowns”. These packages are under active development and improving by the day. I am excited to continue following the progress and experimenting with this project.</p>


</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>This post is cursed because “data modeling” and “predictive modeling” are completely different things, one involving data pipelines and the other involve machine learning. Both happen to be relevant here.↩︎</p></li>
<li id="fn2"><p>I say project versus package because <code>orbital</code> is really a “concept” with parallel but programmatically unrelated R and python implementations; the R project has been around for a but, but the python version is recently released .↩︎</p></li>
<li id="fn3"><p>Just want a few concrete ideas for stitching these tools together without the wind-up? Jump to Section&nbsp;3.1.↩︎</p></li>
<li id="fn4"><p>This list is, of course, non-comprehensive and coincidentally cherry-picked towards the problems which I’ll claim <code>orbital</code> might address. For a thoughtful and comprehensive take on MLOps, check out <a href="https://arxiv.org/abs/2209.09125">this excellent survey</a> by <a href="https://www.sh-reya.com/">Shreya Shankar</a> who, coincidentally enough, made MLOps the focus on her Stanford PhD in… Databases!↩︎</p></li>
<li id="fn5"><p>In my dual life volunteering on downballot campaigns, I also thing this pattern would be very effective to publish partisanship and turnout scores back to BigQuery, the beating heart of campaign data infrastructure.↩︎</p></li>
<li id="fn6"><p>Within a given database. SQL is a loosely enforced spec leading to an absurd amount of arbitrary uniqueness on top of ANSI. But, happily, so long as you aren’t switching databases, this does not matter.↩︎</p></li>
<li id="fn7"><p>If you run <code>dbt test</code> or <code>dbt test --store-failures</code>, you can find two such failure cases.↩︎</p></li>
<li id="fn8"><p>Or mitigate it through off-hours scheduling and materializing as a table versus a view↩︎</p></li>
<li id="fn9"><p>Comparing cost of database compute versus egress/ingress of pulling data from database to execute somewhere else↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>rstats</category>
  <category>python</category>
  <category>dbt</category>
  <category>sql</category>
  <category>data</category>
  <category>ml</category>
  <guid>https://emilyriederer.com/post/orbital-mlops/</guid>
  <pubDate>Sat, 16 Aug 2025 05:00:00 GMT</pubDate>
  <media:content url="https://emilyriederer.com/post/orbital-mlops/featured.png" medium="image" type="image/png" height="135" width="144"/>
</item>
<item>
  <title>How Quarto embed fixes data science storytelling</title>
  <dc:creator>Emily Riederer</dc:creator>
  <link>https://emilyriederer.com/post/quarto-comms/</link>
  <description><![CDATA[ 






<p>Data science stakeholder communication is hard. The typical explanation of this is to parody data scientists as “too technical” to communicate with their audiences. But I’ve always found it unsatisfying to believe that “being technical” makes it too challenging to connect with the 0.1% of the population so similar to ourselves that we all happen to work in the same organization.</p>
<p>Instead, I believe communication is rarely taught intentionally and, worse, is modeled poorly by educational communication which has different goals. This leads to an “explain all the things” mindset that is enabled by literate programming tools like notebooks. It’s said that “writing is thinking”, and literate programming excels at capturing our stream of conscience. However, our stream of conscience does not excel at succinct retrospective explanations of our work’s impact.</p>
<p>Data scientist do not have a communication problem. They have a problem in ordering their story for impact and engagement, driven by their background, context, and tools.</p>
<p>Fortunately, Quarto’s new <code>embed</code> feature bridges the gap between reproducible research and resonate story-telling. This simple feature allows us to cross-reference chunk output (tables, plots, text, or anything else) between documents. The ability to import reproducible results can completely change our writing workflow. It separates the tasks of analysis from summarization and changes our mindset to one of explaning “all the things” to curating the most persuasive evidence for our plaintxt arguments.</p>
<p>In this post, I discuss some of the reasons why I think data science communication goes wrong, why changing story orer helps, and how Quarto <code>embed</code>s can help us have reproducible results and a compelling story at the same time.</p>
<p><em>This topic has been on my mind for a while, and I was recently motivated to get this post over the finish line while talking to Dr.&nbsp;Lucy D’Agostino McGowan and Dr.&nbsp;Ellie Murray on their <a href="https://casualinfer.libsyn.com/optimizing-data-workflows-with-emily-riederer-season-6-episode-8">Casual Inference</a> podcast. Thanks to them for the energy and inspiration!</em></p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>This post assumes you know about the basics of <a href="https://quarto.org/">Quarto</a>.</p>
<p>TLDR: Quarto is a tool for rendering documents, PDFs, blogs, websites, and more from markdown and embedded code chunks. Thus, it supports literate programming, reproducible research, and much more (including this blog).</p>
</div>
</div>
<section id="why-data-science-communication-is-hard" class="level2">
<h2 class="anchored" data-anchor-id="why-data-science-communication-is-hard">Why data science communication is hard</h2>
<p>The deck is stacked against good communication of data science outcomes. Most of our experience with communication comes from education where it serves fundamentally different purposes. Educational communication tends to be linear and step-by-step, but professional communication should often lead with the key takeaway.</p>
<section id="communication-in-education" class="level3">
<h3 class="anchored" data-anchor-id="communication-in-education">Communication in education</h3>
<p>The majority of technical communication consumed and produced by early career professionals happened during their education. However, academic<sup>1</sup> communication has a fundamentally different goal, so it does not provide an effective model.</p>
<p>Academic communcation leans towards exhaustive knowledge sharing of all the details – either because the target audience needs to know them or the audience needs to know that the communicator knows them.</p>
<p>When students are communicating (completing problemsets or assignments), they have the goal of proving their mastery. Their audience (professors, TAs) can be assumed to have deeper knowledge of the topic than the presenter, and communication is intended to demonstrate comprehensiveness of knowledge – or at least to “show their work” for partial credit.</p>
<p>When students are consuming communication (from an instructor or textbook), they experience communication with the goal of exhaustive knowledge transfer. Instructors or textbooks aim to make the audience know what they know and to be able to execute that information independently.</p>
</section>
<section id="communication-in-industry" class="level3">
<h3 class="anchored" data-anchor-id="communication-in-industry">Communication in industry</h3>
<p>These are decidedly not the objective of professional communication. We are given a job <em>because</em> we are judged to have the mastery of a topic <em>specifically that no one else has the time, energy, or desire to think about in enough detail</em>. The goal is not to show what you know (or, how much work you did along the way) or to get the audience to your intimacy of understanding.<sup>2</sup></p>
<p>Instead, the goal is to be an effective abstraction layer between the minute details and what is actually needed to <em>act</em>. Communication is an act of curating the minimal spanning set of relevant facts, context, and supporting evidence or caveats.<sup>3</sup></p>
</section>
<section id="story-structuring" class="level3">
<h3 class="anchored" data-anchor-id="story-structuring">Story structuring</h3>
<p>Roughly speaking, this means we are used to talking about data science work in the <em>procedural</em> order:</p>
<pre><code>1. I wondered...
2. I prepared my data...
3. I ran this analysis...
4. This gave me another question...
5. That {did/not} work...
6. So finally I ended up with this...
7. I learned...</code></pre>
<p>However, for effective communication, it may be more useful to structure our work with <em>progressive disclosure</em>:</p>
<pre><code>1. I wondered...
7. I ultimately found...
6. This is based on trying this...
(3-5). We also considered other options...
2. And this is all based on this data, details, etc.</code></pre>
<p>In short, we want to tell the story of why others should <em>care</em> about our results – not the story of how we got the result.<sup>4</sup> Then, it helps turn a presentation or written document into a “conversation” where they can selectively partake of the details instead of waiting for the main point to be revealed as in a murder mystery.</p>
</section>
</section>
<section id="communicating-and-the-data-science-workflow" class="level2">
<h2 class="anchored" data-anchor-id="communicating-and-the-data-science-workflow">Communicating and the data science workflow</h2>
<p>Moving between story structures isn’t just a matter of changing our mindset. Organizational pressures and tooling also bias us towards poor communication practices. I’ve always loved the phrase “writing is thinking”, but the corrolary is that your writing cannot be clearer than the amount of time you have take to think and synthesize what actually mattered from your own work.</p>
<p>Timeline pressures push us towards more procedural stories. The story <em>you yourself know</em> about your work is the linear one that you just experienced – what you tried, why, and what happened next. If you need to communicate before you can synthesize and restructure, you will be caught flat-footed sharing anything but a procedural story. It’s likely better to begin drafting your final communication from a clean slate but tempting to reuse what exists.</p>
<p>What’s more, even the best tools don’t set us up for success. I’ve long been a fan of literate programming tools like R Markdown and Quarto. I used to believe that these allowed me to effectively document while developing. This is true for documenting my raw stream of conscience for my own future reference or other colleagues. However, notebooks narratives are by definition in the procedural order.</p>
<p>This mindset is further embed as we think about working reproducibly and structuring our work into DAGs that can be rerun end-to-end. If I want to create a final manuscript that is fully reproducible with plots and tables generated dynamically (no copy pasting!), literate programming may feel like it is constraining me towards running things in order. (This isn’t entirely true, as I’ve written about before with <a href="post/rmarkdown-driven-development">R Markdown Driven Development</a>.)</p>
</section>
<section id="using-quarto-embeds-to-improve-your-workflow" class="level2">
<h2 class="anchored" data-anchor-id="using-quarto-embeds-to-improve-your-workflow">Using Quarto <code>embed</code>s to improve your workflow</h2>
<p>So, we need to structure our stories differently for effective communication, but neither our timelines or tooling is conducive to it? That’s where the <a href="https://quarto.org/docs/authoring/notebook-embed.html">Quarto <code>embed</code> feature</a> comes into the picture.</p>
<section id="quarto-embed-overview" class="level3">
<h3 class="anchored" data-anchor-id="quarto-embed-overview">Quarto <code>embed</code> overview</h3>
<p>The <code>embed</code> shortcode lets us reference the output of another <code>.qmd</code> or <code>ipynb</code> Quarto document in a different Quarto file. This requires two steps:</p>
<p>First, in the original notebook we <code>label</code> the top of the chunk whose output we wish to target, e.g.&nbsp;in our notebook <code>analysis.ipynb</code>:<sup>5</sup></p>
<pre><code>#| label: my-calc

1+1</code></pre>
<p>Then in our main document we can pull in the output (and optionally the code) of that calculation, e.g.&nbsp;in a final Quarto document <code>final-writeup.qmd</code> we could add:</p>
<pre><code>{{&lt; embed analysis.ipynb#my-calc &gt;}}</code></pre>
<p>This works with any sort of cell output including raw <code>print()</code> statement output, plots, tables, etc.</p>
</section>
<section id="usage-patterns" class="level3">
<h3 class="anchored" data-anchor-id="usage-patterns">Usage patterns</h3>
<p>Why are embeds a game-changer for data science communication? <strong>Because writing is thinking and storytelling is curation.</strong> Embeds can help us switch our mindset away from “showing our work” and towards providing persuasive evidence that supports our narrative.</p>
<p>The workflow I recommend is:</p>
<ul>
<li>Still use good practices to modularizes steps like data pulls, separate modules for reusable functions, etc.</li>
<li>Do the analysis and last-mile transformation you would do in a Jupyter notebook, leaving the commentary that you would along the way</li>
<li>After you’re done, think about what is important. What does your audience need to see and in what order?</li>
<li>Then take a step back and write your actual story in a Quarto notebook</li>
<li>Selectively embed compelling evidence at the right points in your narrative</li>
</ul>
<p>This is illustrated in the figure below:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://emilyriederer.com/post/quarto-comms/featured.png" class="img-fluid figure-img"></p>
<figcaption>Content embedding workflow illustration</figcaption>
</figure>
</div>
<p>This simple shortcode unblocks us from critical storytelling and workflow challenges:</p>
<ul>
<li>We can generate content in rerunnable linear notebooks</li>
<li>We can start writing from a blank slate to ensure that we are focused on <em>substance</em> and not just sheer volume of content</li>
<li>We can then <em>selectively curate</em> output worthy of inclusion in a final document</li>
<li>We can insert these in the order that makes sense <em>for the story</em> versus for generation</li>
<li>This can exist without deleting or modifying our notebooks that capture the full thought process</li>
<li>As a bonus, our final document need not be a gnarly <code>.ipynb</code> but a plaintext <code>.qmd</code> to make version control, editing, and collaborating with noncoding contributors easier</li>
</ul>
<p>It’s not just code output that can be imported either. Perhaps you already wrote up an introduction framed as an experimental design or a proposal? Other full markdown files can similarly be inclued with the <a href="https://quarto.org/docs/authoring/includes.html"><code>includes</code> shortcode</a>. (<code>includes</code> adds the <em>unrendered text</em> of another <code>.qmd</code> file, including any code chunks to be executed, to your main <code>.qmd</code>; whereas, for <code>embeds</code>, we are referencing the <em>output</em> of another file without rerendering.)</p>
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Tip
</div>
</div>
<div class="callout-body-container callout-body">
<p>This does not mean you should just cram all your analysis into your notebook and not worry about code quality, organization, or commentary!</p>
<p>The goal here is to have two good results for two different audiences without the overhead or reproducibility risks of maintaining them separately.</p>
</div>
</div>
</section>
<section id="demo" class="level3">
<h3 class="anchored" data-anchor-id="demo">Demo</h3>
<p>To give a quick demo, I’ll made <a href="post/quarto-comms/raw-analysis.html">a separate notebook</a> that just pulls some data from an API, cleans it up, and makes a few aggregations and plots. But suppose I doubt you’re interested in any of that. If you’ve read this long, you seem to trust me to do some amount of data stuff correctly.</p>
<p>So instead, I just put the following line in the <code>.qmd</code> file that is creating this post:</p>
<pre><code>{{&lt; embed raw-analysis.ipynb#tbl-pit-eo &gt;}}</code></pre>
<p>That produces this:</p>
<div class="quarto-embed-nb-cell">
<div class="cell" data-execution_count="10">
<div id="tbl-pit-eo" class="cell quarto-float quarto-figure quarto-figure-center anchored" data-execution_count="10">
<figure class="quarto-float quarto-float-tbl figure">
<figcaption class="quarto-float-caption-top quarto-float-caption quarto-float-tbl quarto-uncaptioned" id="tbl-pit-eo-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Table&nbsp;1
</figcaption>
<div aria-describedby="tbl-pit-eo-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<div class="cell-output cell-output-display">



<meta charset="utf-8">


<div id="bjovyrnhbw" style="padding-left:0px;padding-right:0px;padding-top:10px;padding-bottom:10px;overflow-x:auto;overflow-y:auto;width:auto;height:auto;">
<style>
#bjovyrnhbw table {
          font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, 'Helvetica Neue', 'Fira Sans', 'Droid Sans', Arial, sans-serif;
          -webkit-font-smoothing: antialiased;
          -moz-osx-font-smoothing: grayscale;
        }

#bjovyrnhbw thead, tbody, tfoot, tr, td, th { border-style: none !important; }
 tr { background-color: transparent !important; }
#bjovyrnhbw p { margin: 0 !important; padding: 0 !important; }
 #bjovyrnhbw .gt_table { display: table !important; border-collapse: collapse !important; line-height: normal !important; margin-left: auto !important; margin-right: auto !important; color: #333333 !important; font-size: 16px !important; font-weight: normal !important; font-style: normal !important; background-color: #FFFFFF !important; width: auto !important; border-top-style: solid !important; border-top-width: 2px !important; border-top-color: #A8A8A8 !important; border-right-style: none !important; border-right-width: 2px !important; border-right-color: #D3D3D3 !important; border-bottom-style: solid !important; border-bottom-width: 2px !important; border-bottom-color: #A8A8A8 !important; border-left-style: none !important; border-left-width: 2px !important; border-left-color: #D3D3D3 !important; }
 #bjovyrnhbw .gt_caption { padding-top: 4px !important; padding-bottom: 4px !important; }
 #bjovyrnhbw .gt_title { color: #333333 !important; font-size: 125% !important; font-weight: initial !important; padding-top: 4px !important; padding-bottom: 4px !important; padding-left: 5px !important; padding-right: 5px !important; border-bottom-color: #FFFFFF !important; border-bottom-width: 0 !important; }
 #bjovyrnhbw .gt_subtitle { color: #333333 !important; font-size: 85% !important; font-weight: initial !important; padding-top: 3px !important; padding-bottom: 5px !important; padding-left: 5px !important; padding-right: 5px !important; border-top-color: #FFFFFF !important; border-top-width: 0 !important; }
 #bjovyrnhbw .gt_heading { background-color: #FFFFFF !important; text-align: center !important; border-bottom-color: #FFFFFF !important; border-left-style: none !important; border-left-width: 1px !important; border-left-color: #D3D3D3 !important; border-right-style: none !important; border-right-width: 1px !important; border-right-color: #D3D3D3 !important; }
 #bjovyrnhbw .gt_bottom_border { border-bottom-style: solid !important; border-bottom-width: 2px !important; border-bottom-color: #D3D3D3 !important; }
 #bjovyrnhbw .gt_col_headings { border-top-style: solid !important; border-top-width: 2px !important; border-top-color: #D3D3D3 !important; border-bottom-style: solid !important; border-bottom-width: 2px !important; border-bottom-color: #D3D3D3 !important; border-left-style: none !important; border-left-width: 1px !important; border-left-color: #D3D3D3 !important; border-right-style: none !important; border-right-width: 1px !important; border-right-color: #D3D3D3 !important; }
 #bjovyrnhbw .gt_col_heading { color: #333333 !important; background-color: #FFFFFF !important; font-size: 100% !important; font-weight: normal !important; text-transform: inherit !important; border-left-style: none !important; border-left-width: 1px !important; border-left-color: #D3D3D3 !important; border-right-style: none !important; border-right-width: 1px !important; border-right-color: #D3D3D3 !important; vertical-align: bottom !important; padding-top: 5px !important; padding-bottom: 5px !important; padding-left: 5px !important; padding-right: 5px !important; overflow-x: hidden !important; }
 #bjovyrnhbw .gt_column_spanner_outer { color: #333333 !important; background-color: #FFFFFF !important; font-size: 100% !important; font-weight: normal !important; text-transform: inherit !important; padding-top: 0 !important; padding-bottom: 0 !important; padding-left: 4px !important; padding-right: 4px !important; }
 #bjovyrnhbw .gt_column_spanner_outer:first-child { padding-left: 0 !important; }
 #bjovyrnhbw .gt_column_spanner_outer:last-child { padding-right: 0 !important; }
 #bjovyrnhbw .gt_column_spanner { border-bottom-style: solid !important; border-bottom-width: 2px !important; border-bottom-color: #D3D3D3 !important; vertical-align: bottom !important; padding-top: 5px !important; padding-bottom: 5px !important; overflow-x: hidden !important; display: inline-block !important; width: 100% !important; }
 #bjovyrnhbw .gt_spanner_row { border-bottom-style: hidden !important; }
 #bjovyrnhbw .gt_group_heading { padding-top: 8px !important; padding-bottom: 8px !important; padding-left: 5px !important; padding-right: 5px !important; color: #333333 !important; background-color: #FFFFFF !important; font-size: 100% !important; font-weight: initial !important; text-transform: inherit !important; border-top-style: solid !important; border-top-width: 2px !important; border-top-color: #D3D3D3 !important; border-bottom-style: solid !important; border-bottom-width: 2px !important; border-bottom-color: #D3D3D3 !important; border-left-style: none !important; border-left-width: 1px !important; border-left-color: #D3D3D3 !important; border-right-style: none !important; border-right-width: 1px !important; border-right-color: #D3D3D3 !important; vertical-align: middle !important; text-align: left !important; }
 #bjovyrnhbw .gt_empty_group_heading { padding: 0.5px !important; color: #333333 !important; background-color: #FFFFFF !important; font-size: 100% !important; font-weight: initial !important; border-top-style: solid !important; border-top-width: 2px !important; border-top-color: #D3D3D3 !important; border-bottom-style: solid !important; border-bottom-width: 2px !important; border-bottom-color: #D3D3D3 !important; vertical-align: middle !important; }
 #bjovyrnhbw .gt_from_md> :first-child { margin-top: 0 !important; }
 #bjovyrnhbw .gt_from_md> :last-child { margin-bottom: 0 !important; }
 #bjovyrnhbw .gt_row { padding-top: 8px !important; padding-bottom: 8px !important; padding-left: 5px !important; padding-right: 5px !important; margin: 10px !important; border-top-style: solid !important; border-top-width: 1px !important; border-top-color: #D3D3D3 !important; border-left-style: none !important; border-left-width: 1px !important; border-left-color: #D3D3D3 !important; border-right-style: none !important; border-right-width: 1px !important; border-right-color: #D3D3D3 !important; vertical-align: middle !important; overflow-x: hidden !important; }
 #bjovyrnhbw .gt_stub { color: #333333 !important; background-color: #FFFFFF !important; font-size: 100% !important; font-weight: initial !important; text-transform: inherit !important; border-right-style: solid !important; border-right-width: 2px !important; border-right-color: #D3D3D3 !important; padding-left: 5px !important; padding-right: 5px !important; }
 #bjovyrnhbw .gt_stub_row_group { color: #333333 !important; background-color: #FFFFFF !important; font-size: 100% !important; font-weight: initial !important; text-transform: inherit !important; border-right-style: solid !important; border-right-width: 2px !important; border-right-color: #D3D3D3 !important; padding-left: 5px !important; padding-right: 5px !important; vertical-align: top !important; }
 #bjovyrnhbw .gt_row_group_first td { border-top-width: 2px !important; }
 #bjovyrnhbw .gt_row_group_first th { border-top-width: 2px !important; }
 #bjovyrnhbw .gt_striped { background-color: rgba(128,128,128,0.05) !important; }
 #bjovyrnhbw .gt_table_body { border-top-style: solid !important; border-top-width: 2px !important; border-top-color: #D3D3D3 !important; border-bottom-style: solid !important; border-bottom-width: 2px !important; border-bottom-color: #D3D3D3 !important; }
 #bjovyrnhbw .gt_sourcenotes { color: #333333 !important; background-color: #FFFFFF !important; border-bottom-style: none !important; border-bottom-width: 2px !important; border-bottom-color: #D3D3D3 !important; border-left-style: none !important; border-left-width: 2px !important; border-left-color: #D3D3D3 !important; border-right-style: none !important; border-right-width: 2px !important; border-right-color: #D3D3D3 !important; }
 #bjovyrnhbw .gt_sourcenote { font-size: 90% !important; padding-top: 4px !important; padding-bottom: 4px !important; padding-left: 5px !important; padding-right: 5px !important; text-align: left !important; }
 #bjovyrnhbw .gt_left { text-align: left !important; }
 #bjovyrnhbw .gt_center { text-align: center !important; }
 #bjovyrnhbw .gt_right { text-align: right !important; font-variant-numeric: tabular-nums !important; }
 #bjovyrnhbw .gt_font_normal { font-weight: normal !important; }
 #bjovyrnhbw .gt_font_bold { font-weight: bold !important; }
 #bjovyrnhbw .gt_font_italic { font-style: italic !important; }
 #bjovyrnhbw .gt_super { font-size: 65% !important; }
 #bjovyrnhbw .gt_footnote_marks { font-size: 75% !important; vertical-align: 0.4em !important; position: initial !important; }
 #bjovyrnhbw .gt_asterisk { font-size: 100% !important; vertical-align: 0 !important; }
 
</style>

<table class="gt_table do-not-create-environment caption-top table table-sm table-striped small" data-quarto-bootstrap="false">
<thead>
<tr class="gt_heading header">
<td colspan="3" class="gt_heading gt_title gt_font_normal">Executive Orders Issued by Term</td>
</tr>
<tr class="gt_heading even">
<td colspan="3" class="gt_heading gt_subtitle gt_font_normal gt_bottom_border">Normalized for first 184 days in office (12.6% of term)</td>
</tr>
<tr class="gt_col_headings gt_spanner_row header">
<th rowspan="2" id="term_label" class="gt_col_heading gt_columns_bottom_border gt_center" data-quarto-table-cell-role="th" scope="col">Term</th>
<th colspan="2" id="Count" class="gt_center gt_columns_top_border gt_column_spanner_outer" data-quarto-table-cell-role="th" scope="colgroup"><span class="gt_column_spanner">Count</span></th>
</tr>
<tr class="gt_col_headings even">
<th id="n_ttl_pit" class="gt_col_heading gt_columns_bottom_border gt_center" data-quarto-table-cell-role="th" scope="col">Point in time</th>
<th id="n_ttl" class="gt_col_heading gt_columns_bottom_border gt_center" data-quarto-table-cell-role="th" scope="col">Full term</th>
</tr>
</thead>
<tbody class="gt_table_body">
<tr class="odd">
<td class="gt_row gt_center">(2005) Bush</td>
<td class="gt_row gt_center">13</td>
<td class="gt_row gt_center">118</td>
</tr>
<tr class="even">
<td class="gt_row gt_center gt_striped">(2009) Obama</td>
<td class="gt_row gt_center gt_striped">22</td>
<td class="gt_row gt_center gt_striped">148</td>
</tr>
<tr class="odd">
<td class="gt_row gt_center">(2013) Obama</td>
<td class="gt_row gt_center">14</td>
<td class="gt_row gt_center">130</td>
</tr>
<tr class="even">
<td class="gt_row gt_center gt_striped">(2017) Trump</td>
<td class="gt_row gt_center gt_striped">42</td>
<td class="gt_row gt_center gt_striped">220</td>
</tr>
<tr class="odd">
<td class="gt_row gt_center">(2021) Biden</td>
<td class="gt_row gt_center">52</td>
<td class="gt_row gt_center">162</td>
</tr>
<tr class="even">
<td class="gt_row gt_center gt_striped">(2025) Trump</td>
<td class="gt_row gt_center gt_striped">174</td>
<td class="gt_row gt_center gt_striped">-</td>
</tr>
</tbody>
</table>


</div>
        


</div>
</div>
</figure>
</div>
</div>
<a class="quarto-notebook-link" id="nblink-1" href="raw-analysis-preview.html#cell-tbl-pit-eo">Source: Demo Notebook</a></div>
<p>Perhaps I thought it was better to show you a simple table first. But, then I want to show you a more complex plot. You don’t have to see that in my original notebook I actually made the plot first. (Actually, I made <em>two</em> plots, but only one seemed important to show.) So, I write:</p>
<pre><code>{{&lt; embed raw-analysis.ipynb#fig-cum-eo &gt;}}</code></pre>
<p>That produces this:</p>
<div class="quarto-embed-nb-cell">
<div id="cell-fig-cum-eo" class="cell" data-execution_count="9">
<div class="cell-output cell-output-display">
<div id="fig-cum-eo" class="quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-cum-eo-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://emilyriederer.com/post/quarto-comms/index_files/figure-html/raw-analysis-fig-cum-eo-output-1.png" id="fig-cum-eo" class="img-fluid figure-img">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig quarto-uncaptioned" id="fig-cum-eo-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;1
</figcaption>
</figure>
</div>
</div>
</div>
</div>
<p>If I had done a particularly good job of summarizing my thoughts immediately after seeing this plot, I might have already written them in a markdown cell over there. Embeds technically also work to embed markdown cells, so the following line you see is embedded from my other notebook also:</p>
<div class="quarto-embed-nb-cell">
<p>I have some thoughts…</p>
</div>
<p>However, I don’t advocate for embedding text. I think using a final <code>qmd</code> file as a single, self-contained spot to document your analysis has a lot of benefits.</p>
<p>And then I could go on to add my relevant thoughts and analysis of specific to that plot. But, in this case, another part of professional communication is staying on topic.</p>


</section>
</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>Caveat, here I mostly am reflecting on US undergraduate education in STEM-related disciplines. And, yet, even narrowly scoped this is certainly a very sweeping generalization.↩︎</p></li>
<li id="fn2"><p>The aphorism “You don’t pay the plumber for banging on the pipes. You pay them for knowing where to bang.” really sums up what it means to be a professional. Similarly, you don’t hire them to tell you about why they are banging where.↩︎</p></li>
<li id="fn3"><p>Personally, my professional communication became a lot better only after I grew senior enough to be on the <em>receiving</em> end of a lot of communication. At the beginning of my career, I wondered: “Didn’t my more senior audiences get to those roles because they were smart? Didn’t they want all the details?” But we must considered the audience’s context – not just their background knowledge but also their environment. Considering your executive audience as people who have thought about 7 different topics at 20 minute intervals before talking to you today frames a whole different set of constraints. My overall philosophy for communication over time has shifted more towards “How to be kind to burned out brains” than “How to get other people excited by all the cool stuff I did”.↩︎</p></li>
<li id="fn4"><p>I once coached an analyst who kept writing stories in this order. Responding to my feedback, they asked “But aren’t we supposed to tell a story?” This made me realize how overloaded and perhaps misleading the phrase “data storytelling” has become. Yes, we are telling the story <em>from the data</em> and not <em>about analyzing the data</em>. The analyst is not the main character!↩︎</p></li>
<li id="fn5"><p>FWIW, I find Quarto can get confused if we don’t put the blank line after the label line.↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>workflow</category>
  <category>rstats</category>
  <category>python</category>
  <category>quarto</category>
  <category>rmarkdown</category>
  <guid>https://emilyriederer.com/post/quarto-comms/</guid>
  <pubDate>Sun, 27 Jul 2025 05:00:00 GMT</pubDate>
  <media:content url="https://emilyriederer.com/post/quarto-comms/featured.png" medium="image" type="image/png" height="97" width="144"/>
</item>
<item>
  <title>In my orbit: hacking orbital’s ML-to-SQL for xgboost</title>
  <dc:creator>Emily Riederer</dc:creator>
  <link>https://emilyriederer.com/post/orbital-xgb/</link>
  <description><![CDATA[ 





<p>Posit’s recently-announced project <a href="https://github.com/posit-dev/orbital"><code>orbital</code></a> translates fitted SciKitLearn pipelines to SQL for easy prediction scoring at scale. This project has many exciting applications to deploy models for batch prediction with near-zero dependencies or custom infrastructure and have scores accessible to operatilize from their data warehouse.</p>
<p>As soon as I heard about the project, I was eager to test it out. However, much of my recent work is in pure <code>xgboost</code> and neither <code>xgboost</code>’s learning API nor the scikit-learn compatible <code>XGBClassifier()</code> and inherently supported by <code>orbital</code>. This post describes a number of workarounds to get <code>orbital</code> working with <code>xgboost</code>. This <em>mostly</em> works, so we’ll also cover the known limitations.</p>
<p>Just want the code? The source notebook for this post is linked throughout and available to run end-to-end. I’m also stashing this and other ongoing explorations of wins, snags, and workflows with <code>orbital</code> in <a href="https://github.com/emilyriederer/orbital-exploration">this repo</a>.</p>
<p>(Separately, I’m planning to write about my current test-drive of <code>orbital</code>, possible applicatons/workflows, and current pitfalls. It would have been imminently logical to write that post first. However, I saw others requesting <code>xgboost</code> support for <code>orbital</code> on LinkedIn and began a conversation, so I wanted to pull forward this post.)</p>
<div class="callout callout-style-default callout-warning callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Ready for Production?
</div>
</div>
<div class="callout-body-container callout-body">
<p>By <code>orbital</code>’s own admission in its <code>README</code>, it is still under development. The vision is exciting enough, I think it’s more than worth digging it, but be aware that it is likely not production-ready for enterprise-grade application without rigorous independent validation. I’ve found some corner cases (logged on <a href="https://github.com/posit-dev/orbital/issues?q=is%3Aissue%20state%3Aopen%20author%3Aemilyriederer">GitHub issues</a>) and will share more thoughts in other posts.</p>
</div>
</div>
<section id="step-by-step-guide" class="level2">
<h2 class="anchored" data-anchor-id="step-by-step-guide">Step-by-Step Guide</h2>
<p>Preparing an <code>xgboost</code> model for use in <code>orbital</code> requires a number of transformations. Specifically, this quick “tech note” will cover:</p>
<ul>
<li>Converting a trained <code>xgboost</code> model into an <code>XGBClassifier</code></li>
<li>Adding a pre-trained classifier to a scikit-learn pipeline</li>
<li>Enabling <code>XGBClassifier</code> translation from <code>onnxmltools</code> for <code>orbital</code></li>
<li>Getting final SQL</li>
<li>Validating our results after this hop-scotch game of transformations</li>
</ul>
<p>Executing this sucessfully requires dealing with a handful of rough edges, largely driven by <code>onnxmltools</code>:</p>
<ul>
<li><code>onnxmltools</code> requires variables names of format <code>f{number}</code></li>
<li><code>xgboost</code> and <code>XGBClassifier</code> must use <code>base_score</code> of 0.5 (no longer the default!)</li>
<li><code>orbital</code> seems to complain if the pipeline does not include at least one column transformation</li>
<li><code>XGBClassifier</code> converter must be registered from <code>onnxmltools</code></li>
<li><code>orbital</code>’s parse function must be overwritten to hard-code the ONNX version for compatibility</li>
<li>in rare cases, final predictions vary due to different floating point logic in python and SQL (&lt;0.1% of our test cases)</li>
</ul>
<p>As we go, we’ll see how to address each of these challenges.</p>
<p>First, we’ll grab some sample data to work with:</p>
<div class="quarto-embed-nb-cell">
<div id="data-prep" class="cell" data-execution_count="2">
<div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># make data dataset</span></span>
<span id="cb1-2">X_train, y_train <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> make_classification(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10000</span>, random_state <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">102</span>)</span>
<span id="cb1-3">X_train <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> X_train.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)</span>
<span id="cb1-4"></span>
<span id="cb1-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># get column names for use in pipeline</span></span>
<span id="cb1-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">## onnxmltools forces these to be formatted as "f&lt;number&gt;"</span></span>
<span id="cb1-7">n_cols <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(X_train[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>])</span>
<span id="cb1-8">nm_cols <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"f</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>i<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(n_cols)]</span>
<span id="cb1-9">feat_dict <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {c:orbital.types.DoubleColumnType() <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> c <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> nm_cols}</span></code></pre></div>
</div>
<a class="quarto-notebook-link" id="nblink-1" href="orbital-xgb-preview.html#cell-data-prep">Source: End-to-end notebook</a></div>
<section id="converting-xgboost-model-to-an-xgbclassifier-pipeline" class="level3">
<h3 class="anchored" data-anchor-id="converting-xgboost-model-to-an-xgbclassifier-pipeline">Converting <code>xgboost</code> model to an <code>XGBClassifier</code> pipeline</h3>
<p><code>xgboost</code> provides two interfaces: a native learning API and a scikit-learn compatible API. The learning API is sometimes favored for performance advantages. However, since <code>orbital</code> can only work with scikit-learn pipelines, it’s necessary to move to a compatible API.</p>
<p>The strategy here is to fit an <code>xgboost</code> model (assuming that’s what you wanted to do in the first place), initialize a <code>XGBClassifier</code>, and set its attributes. Then, we can directly put our trained <code>XGBClassifier</code> into the a pipeline.</p>
<div class="callout callout-style-default callout-important callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Base Score/Magin
</div>
</div>
<div class="callout-body-container callout-body">
<p>Currently, we must use a <code>base_score</code> of 0.5 for training <code>xgboost</code> and set the same value for the <code>XGBClassifier</code>. Current versions of <code>xgboost</code> pick smarter values by default, but currently <code>orbital</code> (or perhaps <code>onnxmltools</code>) does not know how to correctly incorporate other base margins into SQL, resulting in incorrect predictions.</p>
<p>This is probably currently the biggest weakness of this overall approach because it’s the only blocker where the fix requires fundamentally changing a modeling decision.</p>
</div>
</div>
<div class="quarto-embed-nb-cell">
<div id="xgb-to-pipe" class="cell" data-execution_count="3">
<div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># train with xgb learning api</span></span>
<span id="cb2-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">## keeping parameters super simple so it trains fast and easy to compare</span></span>
<span id="cb2-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">## important: this only works for now with base_score=0.5 </span></span>
<span id="cb2-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">## this is the default assumed by orbital's logic, and I haven't figured out how to convince it otherwise</span></span>
<span id="cb2-5">dtrain <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> xgb.DMatrix(X_train, y_train, feature_names <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nm_cols)</span>
<span id="cb2-6">params <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'max_depth'</span>:<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, </span>
<span id="cb2-7">          <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'objective'</span>:<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'binary:logistic'</span>, </span>
<span id="cb2-8">          <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'base_score'</span>:<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, </span>
<span id="cb2-9">          <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'seed'</span>:<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">504</span>}</span>
<span id="cb2-10">model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> xgb.train(params, num_boost_round <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, dtrain <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> dtrain)</span>
<span id="cb2-11">preds_xgb <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model.predict(xgb.DMatrix(X_train, feature_names <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nm_cols))</span>
<span id="cb2-12"></span>
<span id="cb2-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># convert back to skl interface &amp; rebuild needed metadata</span></span>
<span id="cb2-14">clf <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> xgb.XGBClassifier()</span>
<span id="cb2-15">clf._Booster <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model</span>
<span id="cb2-16">clf.n_classes_ <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span></span>
<span id="cb2-17">clf.base_score <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span></span>
<span id="cb2-18">preds_skl <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> clf.predict_proba(X_train)[:,<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb2-19"></span>
<span id="cb2-20"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># validate that the results are the same</span></span>
<span id="cb2-21"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"xgb and skl match: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>np<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">all</span>(np.isclose(preds_xgb, preds_skl))<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb2-22"></span>
<span id="cb2-23"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># add to skl pipeline</span></span>
<span id="cb2-24">ppl <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Pipeline([(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"gbm"</span>, clf)])</span>
<span id="cb2-25">preds_ppl <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ppl.predict_proba(X_train)[:,<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb2-26"></span>
<span id="cb2-27"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># validate that the results are the same</span></span>
<span id="cb2-28"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"xgb and ppl match: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>np<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">all</span>(np.isclose(preds_xgb, preds_ppl))<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>xgb and skl match: True
xgb and ppl match: True
xgb and skl match: True
xgb and ppl match: True</code></pre>
</div>
</div>
<a class="quarto-notebook-link" id="nblink-2" href="orbital-xgb-preview.html#cell-xgb-to-pipe">Source: End-to-end notebook</a></div>
<p>We see all three approaches produce the same predictions.</p>
<p>Unfortunately, things aren’t quite that simple.</p>
<div class="callout callout-style-default callout-important callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Add multiple pipeline steps
</div>
</div>
<div class="callout-body-container callout-body">
<p><code>orbital</code> seems to complain if it does not have at least one column-transformation pipeline step. I’ve yet to figure out exactly why, but in the meantime it’s no-cost to make a “fake” step that changes no columns.</p>
</div>
</div>
<p>Below, I remake the pipeline with a column transformer, ask it to apply to an empty list of variables, and request the rest (i.e.&nbsp;all of them) be passed through untouched.</p>
<div class="quarto-embed-nb-cell">
<div id="xgb-to-pipe-v2" class="cell" data-execution_count="4">
<div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># now we actually make a slightly more complicated pipeline</span></span>
<span id="cb4-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># orbital seems unhappy if there isn't at least one preprocessing step,</span></span>
<span id="cb4-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># so we make one that processes no variables and passes through the rest</span></span>
<span id="cb4-4">pipeline <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Pipeline([</span>
<span id="cb4-5">    (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"pre"</span>, ColumnTransformer([], remainder<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"passthrough"</span>)),</span>
<span id="cb4-6">])</span>
<span id="cb4-7">pipeline.fit(X_train)</span>
<span id="cb4-8">pipeline.steps.append((<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"gbm"</span>, clf))</span>
<span id="cb4-9">preds_ppl2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pipeline.predict_proba(X_train)[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb4-10"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"xgb and ppl2 matches: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>np<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">all</span>(np.isclose(preds_xgb, preds_ppl2))<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>xgb and ppl2 matches: True
xgb and ppl2 matches: True</code></pre>
</div>
</div>
<a class="quarto-notebook-link" id="nblink-3" href="orbital-xgb-preview.html#cell-xgb-to-pipe-v2">Source: End-to-end notebook</a></div>
<p>Again, we see this “null” step does not change our predictions.</p>
</section>
<section id="enabling-onnxmltools-for-xgbclassifier-conversion" class="level3">
<h3 class="anchored" data-anchor-id="enabling-onnxmltools-for-xgbclassifier-conversion">Enabling <code>onnxmltools</code> for <code>XGBClassifier</code> conversion</h3>
<p><code>orbital</code> depends on <code>skl2onnx</code> which implements a smaller set of model types. <code>onnxmltools</code> offers many additional model converters. However, for <code>skl2onnx</code> to correctly find and apply these converters, they must be registered.</p>
<div class="quarto-embed-nb-cell">
<div id="register" class="cell" data-execution_count="5">
<div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># `options` copied straight from `onnxmltools` docs</span></span>
<span id="cb6-2">update_registered_converter(</span>
<span id="cb6-3">    XGBClassifier,</span>
<span id="cb6-4">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"XGBoostXGBClassifier"</span>,</span>
<span id="cb6-5">    calculate_linear_classifier_output_shapes,</span>
<span id="cb6-6">    convert_xgboost,</span>
<span id="cb6-7">    options<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"nocl"</span>: [<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>], </span>
<span id="cb6-8">             <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"zipmap"</span>: [<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"columns"</span>], </span>
<span id="cb6-9">            },</span>
<span id="cb6-10">)</span></code></pre></div>
</div>
<a class="quarto-notebook-link" id="nblink-4" href="orbital-xgb-preview.html#cell-register">Source: End-to-end notebook</a></div>
<p>However, there’s another nuance here. We all know the challenges of python package versioning, but both <code>skl2onnx</code> and <code>onnxmltools</code> also require coordinating on a version of the ONNX spec’s version as a universal way to represent model objects. The <code>skl2onnx</code> function that allows us to request a version is wrapped in <code>orbital</code> without the ability to pass in parameters. So, we must override that function.</p>
<div class="callout callout-style-default callout-important callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Override <code>orbital</code>’s <code>parse_pipeline()</code>
</div>
</div>
<div class="callout-body-container callout-body">
<p>This is required to set an ONNX version compatible between <code>skl2onnx</code> and <code>onnxmltools</code>. This is a lightweight function and not a class method, so we can just steal the code from the <code>orbital</code> package, modify it, and call it for ourselves. There is no need to monkeypatch.</p>
</div>
</div>
<div class="quarto-embed-nb-cell">
<div id="override-parse" class="cell" data-execution_count="6">
<div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> parse_pipeline_local(</span>
<span id="cb7-2">    pipeline: Pipeline, features: orbital.types.FeaturesTypes</span>
<span id="cb7-3">) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> orbital.ast.ParsedPipeline:</span>
<span id="cb7-4"></span>
<span id="cb7-5">    onnx_model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> skl2onnx.to_onnx(</span>
<span id="cb7-6">        pipeline,</span>
<span id="cb7-7">        initial_types<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[</span>
<span id="cb7-8">            (fname, ftype._to_onnxtype())</span>
<span id="cb7-9">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> fname, ftype <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> features.items()</span>
<span id="cb7-10">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">not</span> ftype.is_passthrough</span>
<span id="cb7-11">        ],</span>
<span id="cb7-12">        target_opset<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'ai.onnx.ml'</span>:<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>}</span>
<span id="cb7-13">    )</span>
<span id="cb7-14">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> orbital.ast.ParsedPipeline._from_onnx_model(onnx_model, features)</span></code></pre></div>
</div>
<a class="quarto-notebook-link" id="nblink-5" href="orbital-xgb-preview.html#cell-override-parse">Source: End-to-end notebook</a></div>
</section>
<section id="run-orbital" class="level3">
<h3 class="anchored" data-anchor-id="run-orbital">Run <code>orbital</code>!</h3>
<p>If you’ve made it this far, you’ll be happy to know the next step is straightforward. We can now run <code>orbital</code> to generate the SQL representation of our model prediction logic.</p>
<div class="quarto-embed-nb-cell">
<div id="translate" class="cell" data-execution_count="7">
<div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># translate into an Orbital Pipeline</span></span>
<span id="cb8-2">orbital_pipeline <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> parse_pipeline_local(pipeline, features<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>feat_dict)</span>
<span id="cb8-3">sql_raw <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> orbital.export_sql(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"DATA_TABLE"</span>, orbital_pipeline, dialect<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"duckdb"</span>)</span></code></pre></div>
</div>
<a class="quarto-notebook-link" id="nblink-6" href="orbital-xgb-preview.html#cell-translate">Source: End-to-end notebook</a></div>
</section>
<section id="validate-results" class="level3">
<h3 class="anchored" data-anchor-id="validate-results">Validate results</h3>
<p>So, after all that, did we get the right result? One way we can confirm (especially because we kept the initial <code>xgboost</code> model very simple) is to compare the visual of our tree with the resulting SQL.</p>
<p>Here’s the tree grown by <code>xgboost</code>:</p>
<div class="quarto-embed-nb-cell">
<div id="cell-xgb-viz" class="cell" data-execution_count="9">
<div class="cell-output cell-output-display">
<div>
<figure class="figure">
<p><img src="https://emilyriederer.com/post/orbital-xgb/index_files/figure-html/orbital-xgb-xgb-viz-output-1.png" id="xgb-viz" class="img-fluid figure-img"></p>
</figure>
</div>
</div>
</div>
<a class="quarto-notebook-link" id="nblink-7" href="orbital-xgb-preview.html#cell-xgb-viz">Source: End-to-end notebook</a></div>
<p>Here’s the SQL developed by <code>orbital</code>:</p>
<div class="quarto-embed-nb-cell">
<div id="orb-txt" class="cell" data-execution_count="10">
<div class="cell-output cell-output-stdout">
<pre><code>SELECT
  1 / (
    EXP(
      -CASE
        WHEN "t0"."f4" &lt; -0.04800000041723251
        THEN CASE
          WHEN "t0"."f4" &lt; -0.8119999766349792
          THEN -0.5087512135505676
          ELSE -0.21405750513076782
        END
        ELSE CASE
          WHEN "t0"."f18" &lt; -0.4269999861717224
          THEN -0.3149999976158142
          ELSE 0.5008015036582947
        END
      END
    ) + 1
  ) AS "pred",
  "f1"
FROM "DATA_TABLE" AS "t0"
SELECT
  1 / (
    EXP(
      -CASE
        WHEN "t0"."f4" &lt; -0.04800000041723251
        THEN CASE
          WHEN "t0"."f4" &lt; -0.8119999766349792
          THEN -0.5087512135505676
          ELSE -0.21405750513076782
        END
        ELSE CASE
          WHEN "t0"."f18" &lt; -0.4269999861717224
          THEN -0.3149999976158142
          ELSE 0.5008015036582947
        END
      END
    ) + 1
  ) AS "pred",
  "f1"
FROM "DATA_TABLE" AS "t0"</code></pre>
</div>
</div>
<a class="quarto-notebook-link" id="nblink-8" href="orbital-xgb-preview.html#cell-orb-txt">Source: End-to-end notebook</a></div>
<p>These appear to match!</p>
<p>However, if we go to use the results, we find that there are some non-equal predictions.</p>
<div class="quarto-embed-nb-cell">
<div id="final-validation" class="cell" data-execution_count="11">
<div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1">DATA_TABLE <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.DataFrame(X_train, columns <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nm_cols)</span>
<span id="cb10-2">db_preds <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> duckdb.sql(sql_mod).df()</span>
<span id="cb10-3">preds_orb <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> db_preds[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'pred'</span>]</span>
<span id="cb10-4"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"xgb and orb match: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>np<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">all</span>(np.isclose(preds_xgb, preds_orb))<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>xgb and orb match: False
xgb and orb match: False</code></pre>
</div>
</div>
<a class="quarto-notebook-link" id="nblink-9" href="orbital-xgb-preview.html#cell-final-validation">Source: End-to-end notebook</a></div>
<div class="callout callout-style-default callout-important callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Floating point math
</div>
</div>
<div class="callout-body-container callout-body">
<p>Predictions may differ slightly across platforms due to floating point precision. Below, we see 5 of 10K predictions were non-equal. We can pull out the values of <code>f4</code> and <code>f18</code> for those 5 records (the only variables used in the model) and compare them to either the SQL or the flowchart. All 5 misses lie right at the cutpoint for one of the nodes.</p>
</div>
</div>
<div class="quarto-embed-nb-cell">
<div id="cell-misses" class="cell" data-execution_count="12">
<div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># isolate and size misses</span></span>
<span id="cb12-2">misses <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.where(<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span>np.isclose(preds_xgb, preds_orb))</span>
<span id="cb12-3"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f'Different predictions (N): </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(misses[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>])<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>)</span>
<span id="cb12-4"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f'Different predictions (P): </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(misses[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(X_train)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.4f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>)</span>
<span id="cb12-5"></span>
<span id="cb12-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># pull out f4 and f18; notice that all discrepancies lie exactly at the splitting points</span></span>
<span id="cb12-7">X_train[misses][:,[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">18</span>]]</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>Different predictions (N): 5
Different predictions (P): 0.0005
Different predictions (N): 5
Different predictions (P): 0.0005</code></pre>
</div>
<div id="misses" class="cell-output cell-output-display">
<pre><code>array([[-0.812, -0.515],
       [-0.812, -0.739],
       [ 1.715, -0.427],
       [ 0.025, -0.427],
       [ 2.119, -0.427]])</code></pre>
</div>
</div>
<a class="quarto-notebook-link" id="nblink-10" href="orbital-xgb-preview.html#cell-misses">Source: End-to-end notebook</a></div>


</section>
</section>

 ]]></description>
  <category>python</category>
  <category>ml</category>
  <guid>https://emilyriederer.com/post/orbital-xgb/</guid>
  <pubDate>Sat, 19 Jul 2025 05:00:00 GMT</pubDate>
  <media:content url="https://emilyriederer.com/post/orbital-xgb/featured.PNG" medium="image"/>
</item>
<item>
  <title>Casual Inference Pod - Optimizing Data Workflows with Emily Riederer (Season 6, Episode 8)</title>
  <link>https://emilyriederer.com/talk/casual-pod/</link>
  <description><![CDATA[ 




<section id="quick-links" class="level2">
<h2 class="anchored" data-anchor-id="quick-links">Quick Links</h2>
<p><span><i class="bi bi-mic"></i> <a href="https://casualinfer.libsyn.com/site.pdf">Podcast Episode</a> </span></p>
<p>Casual Inference is a podcast on all things epidemiology, statistics, data science, causal inference, and public health. Sponsored by the American Journal of Epidemiology. As a guest on this episode, I discuss data science communication, the different challenges of causal analysis in industry versus academia, and much more.</p>


</section>

 ]]></description>
  <category>causal</category>
  <category>data</category>
  <guid>https://emilyriederer.com/talk/casual-pod/</guid>
  <pubDate>Wed, 25 Jun 2025 05:00:00 GMT</pubDate>
  <media:content url="https://emilyriederer.com/talk/casual-pod/featured.png" medium="image" type="image/png" height="144" width="144"/>
</item>
<item>
  <title>A different type of DAG - data pipelines for epidemiology</title>
  <link>https://emilyriederer.com/talk/epi-pipes/</link>
  <description><![CDATA[ 




<div class="tabset-margin-container"></div><div class="panel-tabset">
<ul class="nav nav-tabs"><li class="nav-item"><a class="nav-link active" id="tabset-1-1-tab" data-bs-toggle="tab" data-bs-target="#tabset-1-1" aria-controls="tabset-1-1" aria-selected="true">Quick Links</a></li><li class="nav-item"><a class="nav-link" id="tabset-1-2-tab" data-bs-toggle="tab" data-bs-target="#tabset-1-2" aria-controls="tabset-1-2" aria-selected="false">Abstract</a></li><li class="nav-item"><a class="nav-link" id="tabset-1-3-tab" data-bs-toggle="tab" data-bs-target="#tabset-1-3" aria-controls="tabset-1-3" aria-selected="false">Slides</a></li></ul>
<div class="tab-content">
<div id="tabset-1-1" class="tab-pane active" aria-labelledby="tabset-1-1-tab">
<p><span><i class="bi bi-file-bar-graph"></i> <a href="slides.pdf">Slides</a> </span></p>
</div>
<div id="tabset-1-2" class="tab-pane" aria-labelledby="tabset-1-2-tab">
<p>This talk was part of a symposium on data science tools and opportunities for adoption in epidemiology. The full session description is provided below:</p>
<p>Most applied research and education in epidemiology does not yet benefit from modern data science. Fledgling epidemiologists may receive cutting-edge education on the theory of epidemiologic methods, but remain largely untrained in how to collect data effectively, how to apply modern analytical methods to real data sets, how to reproducibly document code and results, and how to effectively work in teams in a digital workplace. Despite their own nagging concerns, they may rely on Dr.&nbsp;Google as their training on algorithms, document study procedures in e-mail chains, store data in spreadsheets, copy-paste analytical code, hard-code observations per person into separate variables, and manually type out estimates into results tables – only to discover that they are requested to do it all over when three study participants turn out to be ineligible for an analysis.</p>
<p>This symposium will illustrate success stories on how to efficiently practice data science in epidemiology and how to teach it along the way. There will be no exhortations how Excel is bad and that good people practice code sharing. Instead, the symposium will discuss cutting-edge approaches and real-life use cases of how modern data science has made research and teaching more efficient. The goal is for attendees to bring home a sparkling, vetted toolkit of new ideas and tools for research and teaching.</p>
</div>
<div id="tabset-1-3" class="tab-pane" aria-labelledby="tabset-1-3-tab">
<div id="slides" style="width:100%; aspect-ratio:16/11;">
<embed src="slides.pdf#zoom=Fit" width="100%" height="100%">
</div>
</div>
</div>
</div>



 ]]></description>
  <category>workflow</category>
  <category>data</category>
  <guid>https://emilyriederer.com/talk/epi-pipes/</guid>
  <pubDate>Wed, 11 Jun 2025 05:00:00 GMT</pubDate>
  <media:content url="https://emilyriederer.com/talk/epi-pipes/featured.png" medium="image" type="image/png" height="82" width="144"/>
</item>
<item>
  <title>Python Rgonomics - 2025 Update</title>
  <dc:creator>Emily Riederer</dc:creator>
  <link>https://emilyriederer.com/post/py-rgo-2025/</link>
  <description><![CDATA[ 





<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://emilyriederer.com/post/py-rgo-2025/featured.jpg" class="img-fluid figure-img"></p>
<figcaption>Photo credit to the inimitable <a href="https://allisonhorst.com/">Allison Horst</a></figcaption>
</figure>
</div>
<p>About a year ago, I wrote the <a href="post/py-rgo">original version of Python Rgonomics</a> to help fellow former R users who were entering into the world of python. The general point of the article was that new python tooling (e.g.&nbsp;<code>polars</code> versus <code>pandas</code>) has evolved to a point where there are tools that remain truly performant and pythonic while still having a more similar user experience for those coming from the R world. I also discussed this at <a href="talk/python-rgnomics">posit::conf(2025)</a>.</p>
<p>Ironically, the thesis held so true that it condemned my first 2024 post on the topic. 2024 saw the release of a few game-changing tools that further streamline and simplify the python workflow. This post provides an updated set of recommendations. Specifically, it highlights:</p>
<ul>
<li><strong>Consolidating installation and environment management tooling</strong>: Previously, I recommended <code>pyenv</code> for instaling python versions and <code>pdm</code> for project and environment management. Then, last year saw the release of <a href="https://docs.astral.sh/uv/">Astral’s excellent <code>uv</code></a> which nicely consolidates this functionality into a single highly performant tool.</li>
<li><strong>Considering multiple IDE options</strong>: In addition to <code>VS Code</code>, I submit Posit PBC’s <a href="https://positron.posit.co/"><code>Positron</code></a> for consideration depending on comfort, needs, and use cases. Both are backed by the open-source Code OSS with different layers of flexibility or customization. Positron is mostly interoperable with VS Code extensions, but provides a bit more of a “batteries included” opinionated design for the data analyst persona that may not want to navigate through the customization afforded by VS Code.</li>
</ul>
<p>It is important to have a stable stack and not always jump to the next bright, shiny object; however, as I’ve watched these projects evolve throughout 2024, I feel confident to say they are not just a flash in the pan.</p>
<p><code>uv</code> is supported by the Charlie Marsh’s Astral, which formerly made <code>ruff</code> to consolidate a number of code quality tools. Astral’s commitment to open source, the careful design, and the incredible performance becnhmarks of <code>uv</code> speak for itself. Similarly, Positron is backed by the reliable Posit PBC (formerly RStudio) as an open source extension of Code OSS (which is also the open-source skeleton for Microsoft’s VS Code).</p>
<p>The rest of this post is reproduced in full with relevant updates so it reads end-to-end instead of referencing the changes from old to new recommendations.</p>
<section id="now-lets-get-started" class="level2">
<h2 class="anchored" data-anchor-id="now-lets-get-started">Now let’s get started</h2>
<p>The “expert-novice” duality is an uncomfortable part of switching between languages like R and python. Learning a new language is easily enough done; programming 101 concepts like truth tables and control flow translate seamlessly. But ergonomics of a language do not. The tips and tricks we learn to be hyper productive in a primary language are comfortable, familiar, elegant, and effective. They just <em>feel</em> good. Working in a new language, developers often face a choice between forcing their favored workflows into a new tool where they may not “fit”, writing technically correct yet plodding code to get the job done, or approaching a new language as a true beginner to learn it’s “feel” from the ground up.</p>
<p>Fortunately, some of these higher-level paradigms have begun to bleed across languages, enriching previously isolated tribes with the and enabling developers to take their advanced skillsets with them across languages. For any R users who aim to upskill in python in 2024, recent tools and versions of old favorites have made strides in converging the R and python data science stacks. In this post, I will overview some recommended tools that are both truly pythonic while capturing the comfort and familiarity of some favorite R packages of the <code>tidyverse</code> variety.<sup>1</sup></p>
</section>
<section id="what-this-post-is-not" class="level2">
<h2 class="anchored" data-anchor-id="what-this-post-is-not">What this post is not</h2>
<p>Just to be clear:</p>
<ul>
<li>This is not a post about why python is better than R so R users should switch all their work to python</li>
<li>This is not a post about why R is better than python so R semantics and conventions should be forced into python</li>
<li>This is not a post about why python <em>users</em> are better than R users so R users need coddling</li>
<li>This is not a post about why R <em>users</em> are better than python users and have superior tastes for their toolkit</li>
<li>This is not a post about why these python tools are the only good tools and others are bad tools</li>
</ul>
<p>If you told me you liked the New York’s Museum of Metropolitan Art, I might say that you might also like Chicago’s Art Institute. That doesn’t mean you should only go to the museum in Chicago or that you should never go to the Louvre in Paris. That’s not how recommendations (by human or recsys) work. This is an “opinionated” post in the sense that “I like this” and not opinionated in the sense that “you must do this”.</p>
</section>
<section id="on-picking-tools" class="level2">
<h2 class="anchored" data-anchor-id="on-picking-tools">On picking tools</h2>
<p>The tools I highlight below tend to have two competing features:</p>
<ul>
<li>They have aspects of their workflow and ergonomics that should feel very comfortable to users of favored R tools</li>
<li>They should be independently accepted, successful, and well-maintained python projects with the true pythonic spirit</li>
</ul>
<p>The former is important because otherwise there’s nothing tailored about these recommendations; the latter is important so users actually engage with the python language and community instead of dabbling around in its more peripheral edges. In short, these two principles <em>exclude</em> tools that are direct ports between languages with that as their sole or main benefit.<sup>2</sup></p>
<p>For example, <code>siuba</code> and <code>plotnine</code> were written with the direct intent of mirroring R syntax. They have seen some success and adoption, but more niche tools come with liabilities. With smaller user-bases, they tend to lack in the pace of development, community support, prior art, StackOverflow questions, blog posts, conference talks, discussions, others to collaborate with, cache in a portfolio, etc. Instead of enjoying the ergonomics of an old language or embracing the challenge of learning a new one, ports can sometimes force developers to invest energy into a “secret third thing” of learning tools that isolate them from both communities and facing inevitable snags by themselves.</p>
<p>When in Rome, do as the Romans do – but if you’re coming from the U.S. that doesn’t mean you can’t bring a universal adapter that can help charge your devices in European outlets.</p>
</section>
<section id="the-stack" class="level2">
<h2 class="anchored" data-anchor-id="the-stack">The stack</h2>
<p>WIth that preamble out of the way, below are a few recommendations for the most ergonomic tools for getting set up, conducting core data analysis, and communication results.</p>
<p>To preview these recommendations:</p>
<p><strong>Set Up</strong></p>
<ul>
<li>Installation: <a href="https://docs.astral.sh/uv/"><code>uv</code></a></li>
<li>IDE:
<ul>
<li><a href="https://code.visualstudio.com/docs/languages/python">VS Code</a>, or</li>
<li><a href="https://positron.posit.co/">Positron</a></li>
</ul></li>
</ul>
<p><strong>Analysis</strong></p>
<ul>
<li>Wrangling: <a href="https://pola.rs/"><code>polars</code></a></li>
<li>Visualization: <a href="https://seaborn.pydata.org/"><code>seaborn</code></a></li>
</ul>
<p><strong>Communication</strong></p>
<ul>
<li>Tables: <a href="https://posit-dev.github.io/great-tables/articles/intro.html">Great Tables</a></li>
<li>Notebooks: <a href="https://quarto.org/">Quarto</a></li>
</ul>
<p><strong>Miscellaneous</strong></p>
<ul>
<li>Environment Management: <a href="https://docs.astral.sh/uv/"><code>uv</code></a></li>
<li>Code Quality: <a href="https://docs.astral.sh/ruff/"><code>ruff</code></a></li>
</ul>
<section id="for-setting-up" class="level3">
<h3 class="anchored" data-anchor-id="for-setting-up">For setting up</h3>
<p>The first hurdle is often getting started – both in terms of installing the tools you’ll need and getting into a comfortable IDE to run them.</p>
<ul>
<li><strong>Installation</strong>: R keeps installation simple; there’s one way to do it so you do and it’s done<sup>3</sup>. But before python converts can <code>print("hello world")</code>, they face a range of options (system Python, Python installer UI, Anaconda, Miniconda, etc.) each with its own kinks. These decisions are made harder in Python since projects tend to have stronger dependencies of the language, requiring one to switch between versions. Fortunately, <code>uv</code> now makes this task easy with <a href="https://docs.astral.sh/uv/concepts/python-versions/#installing-a-python-version">many different commands for</a>:
<ul>
<li>Installing one or more specific versions: <code>uv python install &lt;version, constraints, etc.&gt;</code></li>
<li>Listing all available installations: <code>uv python list</code></li>
<li>Returning path of python executables: <code>uv python find</code></li>
<li>Spinning up a quick REPL with a <a href="https://valatka.dev/2025/01/12/on-killer-uv-feature.html">temporary python version</a> and packages: e.g.&nbsp;<code>uv run --python 3.12 --with pandas python</code></li>
</ul></li>
<li><strong>Integrated Development Environment</strong>: Once R is install, R users are typically off to the races with the intuitive RStudio IDE which helps them get immediately hands-on with the REPL. With the UI divided into quadrants, users can write an R script, run it to see results in the console, conceptualize what the program “knows” with the variable explorer, and navigate files through a file explorer. Once again, python is not lacking in IDE options, but users are confronted with yet another decision point before they even get started. Pycharm, Sublime, Spyder, Eclipse, Atom, Neovim, oh my! For python, I’d recommend either VS Code or Positron, which are both extensions of Code OSS.
<ul>
<li><a href="https://code.visualstudio.com/docs/languages/python">VS Code</a> is an industry standard tool for software development. This means it has a rich set of features for coding, debugging, navigating large projects, etc. It’s rich extension ecosystem also means that most major tools (e.g.&nbsp;Quarto, git, linters and stylers, etc.) have nice add-ons so, like RStudio, you can customize your platform to perform many side-tasks in plaintext or with the support of extra UI components.<sup>4</sup></li>
<li><a href="https://positron.posit.co/">Positron</a> is a newer entrant from Posit PBC (formerly RStudio). It streamlines the offerings of VS Code to center the features most useful for data analysis. Positron may feel easier to go from zero-to-one. It does a great job finding and consistently using the right versions of R, python, Quarto, etc. and prioritizes many of the IDE elements that make RStudio wonderful for working with data (e.g.&nbsp;object preview pane). Additionally, <em>most</em> VS Code extensions will work in Positron; however, Positron cannot use extensions <a href="https://mastodon.social/@emilyriederer/112853049023389552">that rely on Microsoft’s PyLance</a> meaning some realtime linting and error detection tools like ErrorLens do not work out-of-the-box. Ultimately, your comfort navigating VS Code and your mix of dev versus data work may determine which is best for you.</li>
</ul></li>
</ul>
</section>
<section id="for-data-analysis" class="level3">
<h3 class="anchored" data-anchor-id="for-data-analysis">For data analysis</h3>
<p>As data practitioners know, we’ll spend most of our time on cleaning and wrangling. As such, R users may struggle particularly to abandon their favorite tools for exploratory data analysis like <code>dplyr</code> and <code>ggplot2</code>. Fans of those packages often appreciate how their functional paradigm helps achieve a “flow state”. Precise syntax may differ, but new developments in the python wrangling stack provide increasingly close analogs to some of these beloved Rgonomics.</p>
<ul>
<li><strong>Data Wrangling</strong>: (<a href="post/py-rgo-polars">See my separate post on <code>polars</code></a>)Although <code>pandas</code> is undoubtedly the best-known wrangling tool in the python space, I believe the growing <a href="https://pola.rs/"><code>polars</code></a> project offers the best experience for a transitioning developer (along with other nice-to-have benefits like being dependency free and blazingly fast). <code>polars</code> may feel more natural and less error-prone to R users for may reasons:
<ul>
<li>it has more internal consistent (and similar to <code>dplyr</code>) syntax such as <code>select</code>, <code>filter</code>, etc. and has demonstrated that the project values a clean API (e.g.&nbsp;recently renaming <code>groupby</code> to <code>group_by</code>)</li>
<li>it does not rely on the distinction between columns and indexes which can feel unintuitive and introduces a new set of concepts to learn</li>
<li>it consistently returns copies of dataframes (while <code>pandas</code> sometimes alters in-place) so code is more idempotent and avoids a whole class of failure modes for new users</li>
<li>it enables many of the same “advanced” wrangling workflows in <code>dplyr</code> with high-level, semantic code like making the transformation of multiple variables at once fast with <a href="https://docs.pola.rs/py-polars/html/reference/selectors.html">column selectors</a>, concisely expressing <a href="https://docs.pola.rs/user-guide/expressions/window/">window functions</a>, and working with nested data (or what <code>dplyr</code> calls “list columns”) with <a href="https://docs.pola.rs/user-guide/expressions/lists/">lists</a> and <a href="https://docs.pola.rs/user-guide/expressions/structs/">structs</a></li>
<li>supporting users working with increasingly large data. Similar to <code>dplyr</code>’s many backends (e.g.&nbsp;<code>dbplyr</code>), <code>polars</code> can be used to write lazily-evaluated, optimized transformations and it’s syntax is reminiscent of <code>pyspark</code> should users ever need to switch between</li>
</ul></li>
<li><strong>Visualization</strong>: Even some of R’s critics will acknowledge the strength of <code>ggplot2</code> for visualization, both in terms of it’s intuitive and incremental API and the stunning graphics it can produce. <a href="https://seaborn.pydata.org/tutorial/objects_interface"><code>seaborn</code>’s object interface</a> seems to strike a great balance between offering a similar workflow (which <a href="https://seaborn.pydata.org/whatsnew/v0.12.0.html">cites <code>ggplot2</code> as an inspiration</a>) while bringing all the benefits of using an industry-standard tool</li>
</ul>
</section>
<section id="for-communication" class="level3">
<h3 class="anchored" data-anchor-id="for-communication">For communication</h3>
<p>Historically, one possible dividing line between R and python has been framed as “python is good at working with computers, R is good at working with people”. While that is partially inspired by reductive takes that R is not production-grade, it is not without truth that the R’s academic roots spurred it to overinvest in a rich “communication stack” and translating analytical outputs into human-readable, publishable outputs. Here, too, the gaps have begun to close.</p>
<ul>
<li><strong>Tables</strong>: R has no shortage of packages for creating nicely formatted tables, an area that has historically lacked a bit in python both in workflow and outcomes. Barring strong competition from the native python space, the one “port” I am bullish about is the recently announced <a href="https://posit-dev.github.io/great-tables/articles/intro.html">Great Tables</a> package. This is a pythonic clone of R’s <code>gt</code> package. I’m more comfortable recommending this since it’s maintained by the same developer as the R version (to support long-term feature parity), backed by an institution not just an individual (to ensure it’s not a short-lived hobby project), and the design feels like it does a good job balancing R inspiration with pythonic practices</li>
<li><strong>Computational notebooks</strong>: Jupyter Notebooks are widely used, widely critiqued parts of many python workflows. While the ability to mix markdown and code chunks. However, notebooks can introduce new types of bugs for the uninitiated; for example, they are hard to version control and easy to execute in the wrong environment. For those coming from the world of R Markdown, plaintext computational notebooks like <a href="https://quarto.org/">Quarto</a> may provide a more transparent development experience. While Quarto allows users to write in <code>.qmd</code> files which are more like their <code>.rmd</code> predecessors, its renderer can also handle Jupyter notebooks to enable collaboration across team members with different preferences</li>
</ul>
</section>
<section id="miscellaneous" class="level3">
<h3 class="anchored" data-anchor-id="miscellaneous">Miscellaneous</h3>
<p>A few more tools may be helpful and familiar to <em>some</em> R users who tend towards the more “developer” versus “analyst” side of the spectrum. These, in my mind, have even more varied pros and cons, but I’ll leave them for consideration:</p>
<ul>
<li><strong>Environment Management</strong>: There’s a truly overwhelming number of ways<sup>5</sup> to manage project-level dependencies in python. As a consequence, there’s also a lot of outdated advice weighing pros and cons of feature sets that have since evolved. Here again, <code>uv</code> takes the cake as a swiss army knife tool. It features fast installation, auto-updating of the <code>pyproject.toml</code> and <code>uv.lock</code> files (so you don’t need to remember to <code>pip freeze</code>), separate trakcing of primary dependencies from the fully resolved environment (so you can cleanly and completely remove dependencies-of-dependencies you no longer need), and so much more. <code>uv</code> can operate as a drop in replacement for <code>pip</code> and generate a <code>requirements.txt</code> if needed for compatability; however, given it’s explosive popularity and ergonomic design, I doubt you’ll have trouble convincing collaborators to adopt the same.</li>
<li><strong>Developer Tools</strong>: <a href="https://docs.astral.sh/ruff/"><code>ruff</code></a> (another Astral project) provides a range of linting and styling options (think R’s <code>lintr</code> and <code>styler</code>) and provides a one-stop-shop over what can be an overwhelming number of atomic tools in this space (<code>isort</code>, <code>black</code>, <code>flake8</code>, etc.). <code>ruff</code> is super fast, has a nice VS Code extension, and, while this class of tools is generally considered more advanced, I think linters can be a fantastic “coach” for new users about best practices</li>
</ul>


</section>
</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>Of course, languages have their own subcultures too. The <code>tidyverse</code> and <code>data.table</code> parts of the R world tend to favor different semantics and ergonomics. This post caters more to the former.↩︎</p></li>
<li id="fn2"><p>There is no doubt a place for language ports, especially for earlier stage project where no native language-specific standard exists. For example, I like Karandeep Singh’s lab work on <a href="https://github.com/TidierOrg/Tidier.jl">a tidyverse for Julia</a> and maintain my own <a href="https://github.com/emilyriederer/dbtplyr"><code>dbtplyr</code></a> package to port <code>dplyr</code>’s select helpers to <code>dbt</code>↩︎</p></li>
<li id="fn3"><p>However, to highlight some advances here, Posit’s newer <a href="https://github.com/r-lib/rig"><code>rig</code></a> project seems to be inspired by python install management tools and offers a convenient CLI for managing multiple version of R↩︎</p></li>
<li id="fn4"><p> If anything, the one challenge of VS Code is the sheer number of set up options, but to start out, you can see these excellent tutorials from Rami Krispin on recommended <a href="https://github.com/RamiKrispin/vscode-python">python</a> and <a href="https://github.com/RamiKrispin/vscode-r">R</a> configurations ↩︎</p></li>
<li id="fn5"><p><code>pdm</code>, <code>virtualenv</code>, <code>conda</code>, <code>piptools</code>, <code>pipenv</code>, <code>poetry</code>, and that doesn’t even scratch the surface↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>rstats</category>
  <category>python</category>
  <guid>https://emilyriederer.com/post/py-rgo-2025/</guid>
  <pubDate>Sun, 26 Jan 2025 06:00:00 GMT</pubDate>
  <media:content url="https://emilyriederer.com/post/py-rgo-2025/featured.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Role-Based Access Control for Quarto sites with Netlify Identity</title>
  <dc:creator>Emily Riederer</dc:creator>
  <link>https://emilyriederer.com/post/quarto-auth-netlify/</link>
  <description><![CDATA[ 





<p>Literate programming tools like R Markdown and Quarto make it easy to convert analyses into aesthetic documents, dashbaords, and websites for public sharing. But what if you don’t want your results <em>too</em> public?</p>
<p>I recently was working on a project that required me to set up a large number of dashboards with similar content but different data for about 10 small, separate organizations. As I considered by tech stack, I found that many Quarto users were <a href="https://github.com/quarto-dev/quarto-cli/discussions/8393">asking similar questions</a>, but understandably the Quarto team had no one slam-dunk answer because authentication management (a serving / hosting problem) would be a substantial scope creep beyond the goals and core functionality of Quarto (an open-source publishing system).</p>
<p>After evaluating my options, I found the best solution for my use case was role-based access controls with <a href="https://docs.netlify.com/security/secure-access-to-sites/identity/">Netlify Identity</a>. In this post, I’ll briefly describe how this solution works, how to set it up, and some of the pros and cons.</p>
<section id="demo" class="level2">
<h2 class="anchored" data-anchor-id="demo">Demo</h2>
<p>Using a minimal Netlify Identity set-up, you can be up and running with the following UX in about 10 minutes. For this post, I show the true “minimum viable deployment”, although the styling and aesthetics could be made much fancier.</p>
<p>When users first visit your site’s homepage, they will be prompted that they need to sign-up or login to continue.</p>
<p><img src="https://emilyriederer.com/post/quarto-auth-netlify/logged-out.png" class="img-fluid"></p>
<p>If users navigate to any other part of the site before logging in, they’ll receive an error message prompting them to return to the home screen. (This could be customized as you would a <code>404 Not Found</code> error.)</p>
<p><img src="https://emilyriederer.com/post/quarto-auth-netlify/not-found.png" class="img-fluid"></p>
<p>After clicking either button, an in-browser popup modal allows them to sign up, login in, or request forgotten credentials.</p>
<div>

</div>
<div class="quarto-layout-panel" data-layout-ncol="2">
<div class="quarto-layout-row">
<div class="quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="https://emilyriederer.com/post/quarto-auth-netlify/signup.png" class="img-fluid"></p>
</div>
<div class="quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="https://emilyriederer.com/post/quarto-auth-netlify/login.png" class="img-fluid"></p>
</div>
</div>
</div>
<p>The example above shows the option to create a custom login or use Google to authenticate. Netlify also allows for the options to use other free (e.g.&nbsp;GitHub, GitLab) or paid (e.g.&nbsp;Okta) third-party login services.</p>
<p>For new signups, Netlify can automatically trigger confirmation emails with <a href="https://docs.netlify.com/security/secure-access-to-sites/identity/identity-generated-emails/">customized content</a> based on a templated text or HTML file in your repository.</p>
<p>Once logged in, the homepage then offers the option to log back out.</p>
<p><img src="https://emilyriederer.com/post/quarto-auth-netlify/logged-in.png" class="img-fluid"></p>
<p>Otherwise, users can then proceed to the rest of the site as if it were publicly available.</p>
</section>
<section id="set-up" class="level2">
<h2 class="anchored" data-anchor-id="set-up">Set Up</h2>
<p>The basics of how Netlify Identity works are described at length in <a href="https://docs.netlify.com/security/secure-access-to-sites/role-based-access-control/#create-users-and-set-roles">this blog post</a>. If you decide to implement this solution, I recommend reading those official documents for a more robust mental model. In short, Netlify Identity works by attaching a token to each user after they log in. This user-specific token can be assigned different roles on the backend, and depending on which roles a user has, they can be redirected to (or gated from) seeing different content.</p>
<p>Setting up Netlify Identify requires a few small tweaks throughout your site:</p>
<ol type="1">
<li>Add Javascript to each page to handle the JSON Web Tokens (JWTs) set by Identity. This is done most easily through the <code>_quarto.yml</code></li>
<li>Configure site redirects to response to the JWTs. This is contained in its own <code>_redirects</code> file</li>
<li>Ensure you have a user interface that allows users to sign up and login, thus changing their JWTs and access. I put this in my <code>index.qmd</code></li>
</ol>
<p>Then, finally, within the Netlify admin panel, you must:</p>
<ol start="4" type="1">
<li>Configure the user signup workflow (e.g.&nbsp;by invitation, open sign-up)</li>
<li>Assign users to roles that determine what content they can see</li>
<li>Optionally, enable third-party forms of authentication (e.g.&nbsp;Google, GitHub)</li>
</ol>
<p>Let’s take these one at a time.</p>
<section id="configure-role-authentiation" class="level3">
<h3 class="anchored" data-anchor-id="configure-role-authentiation">Configure Role Authentiation</h3>
<p>Netlify maintains <a href="https://github.com/netlify/netlify-identity-widget">an Identity widget</a> that handles recognizing authenticated users and their roles from their JWTs. To inject this Javascript snippet into every page, open the <code>_quarto.yml</code> file and add the Javascript snippet to the <code>include-in-header:</code> key under the HTML format, e.g.:</p>
<pre><code>format:
  html: 
    include-in-header: 
      text: |
        &lt;script type="text/javascript" src="https://identity.netlify.com/v1/netlify-identity-widget.js"&gt;&lt;/script&gt;
        &lt;script&gt;
        window.netlifyIdentity.on('login', (user) =&gt; {
        window.netlifyIdentity.refresh(true).then(() =&gt; {
          console.log(user);
        });
        });
        window.netlifyIdentity.on('logout', (user) =&gt; {
        window.location.href = '/login';
        });
        window.netlifyIdentity.init({ container: '#netlify' });
        &lt;/script&gt;</code></pre>
<p>Note, the official widget is injected using the <code>src</code> field of the first <code>script</code> tag.</p>
</section>
<section id="configure-site-redirects" class="level3">
<h3 class="anchored" data-anchor-id="configure-site-redirects">Configure Site Redirects</h3>
<p>Next, create a <a href="https://docs.netlify.com/security/secure-access-to-sites/role-based-access-control/#redirect-visitors-based-on-roles"><code>_redirects</code> file</a> at the top level of your project (or open the existing file) and add the following lines:</p>
<pre><code>/login /
/*  /:splat  200!  Role=admin
/site_libs/* /site_libs/:splat 200!
/   /        200!
/*  /  401!</code></pre>
<p>Syntax for the <code>_redirects</code> file is described <a href="https://docs.netlify.com/routing/redirects/#syntax-for-the-redirects-file">here</a>, but basically each line defines a rule with the structure:</p>
<pre><code>&lt;what was requested&gt; &lt;where to go&gt; &lt;the result to give&gt; &lt;conditional on role&gt;</code></pre>
<p>And, like a <code>case when</code> statement, the first “matching” rule dominates.</p>
<p>So, the example above can roughly be read in English as:</p>
<pre><code>If users go to the /login page, take them back to home
If users try to go anywhere else on my site and they have role admin, let them do that 
If users try to go to the hompage of my site (regardless of their role), let them do that
If users otherwise try to go to other parts of the site (and they don't have admin), give an error</code></pre>
<p>Of course, this could be further customized to set different rules for different subdirectories.</p>
</section>
<section id="create-user-interface" class="level3">
<h3 class="anchored" data-anchor-id="create-user-interface">Create User Interface</h3>
<p>To create the user interface for the login screen, I added code to inject a Netlify-maintained login widget to my site’s <code>index.qmd</code>, e.g.:</p>
<pre><code>---
date: last-modified
---

# Home {.unnumbered}

&lt;div data-netlify-identity-menu&gt;&lt;/div&gt;

Welcome! Please sign in to view the dashboard. 

If you are a first time user, please create a login and email [emilyriederer@gmail.com](mailto:emilyriederer@gmail.com?subject=Dashboard%20Access%20Request) to elevate your access.</code></pre>
</section>
<section id="user-onboarding" class="level3">
<h3 class="anchored" data-anchor-id="user-onboarding">User Onboarding</h3>
<p>After the changes above to your actual Quarto site, the rest of the work lies in the Netlify admin panel. For a small number of users, you can manually change their role in the user interface.</p>
<p><img src="https://emilyriederer.com/post/quarto-auth-netlify/user-mgmt.png" class="img-fluid"></p>
<p>However, to work at any scale, you may need a more automated solution. For that, Netlify’s docs explain how to configure initial role assignment via <a href="https://www.netlify.com/blog/2019/02/21/the-role-of-roles-and-how-to-set-them-in-netlify-identity/">lambda functions</a>. However, out-of-the box functionality that I found to be lacking was assigning default roles for new users or the ability to configure basic logic such as assigning the same role to any new users onboarding from a certain email domain.</p>
</section>
</section>
<section id="is-it-for-you" class="level2">
<h2 class="anchored" data-anchor-id="is-it-for-you">Is it for you?</h2>
<p>Netlify Identity isn’t the perfect solution for all use cases, but for many small websites and blogs it’s possibly one of the lowest friction solutions available.</p>
<p>This solution is easy to set up initially, allows some degree of self-service for users (account set-up and password resets), user communication (email management), and third-party integration (e.g.&nbsp;authenticate with GitHub or Google). It also has a robust free tier, allowing 1K users to self register (and 5 registrations-by-invitation), and is a substantial step up over locking down HTML content with a single common password.</p>
<p>However, Netlify Identity is not a bullet-proof end-to-end security solution and could become painful or expensive at large scale. This solution, for example, doesn’t contemplate securing your website’s full “supply chain” (e.g.&nbsp;if the source code in in a public GitHub repo) and certainly is less secure than hosting your site completely within a sanboxed environment or intranet. For a large number of users, I also feel there’s a large opportunity to allow simple business rules to configure initial roles.</p>
<p>In summary, I would generally recommend Netlify Identity if you’re already using Netlify, expect a small number of users, and are comfortable adding <em>friction</em> to your sign-in process versus absolute security. For larger projects with higher usage and more bullet-proof security needs, it may be worth considering alternatives.</p>


</section>

 ]]></description>
  <category>quarto</category>
  <category>rmarkdown</category>
  <category>workflow</category>
  <guid>https://emilyriederer.com/post/quarto-auth-netlify/</guid>
  <pubDate>Sun, 10 Nov 2024 06:00:00 GMT</pubDate>
  <media:content url="https://emilyriederer.com/post/quarto-auth-netlify/featured.PNG" medium="image"/>
</item>
<item>
  <title>Python Rgonomics</title>
  <link>https://emilyriederer.com/talk/python-rgonomics/</link>
  <description><![CDATA[ 




<div class="tabset-margin-container"></div><div class="panel-tabset">
<ul class="nav nav-tabs"><li class="nav-item"><a class="nav-link active" id="tabset-1-1-tab" data-bs-toggle="tab" data-bs-target="#tabset-1-1" aria-controls="tabset-1-1" aria-selected="true">Quick Links</a></li><li class="nav-item"><a class="nav-link" id="tabset-1-2-tab" data-bs-toggle="tab" data-bs-target="#tabset-1-2" aria-controls="tabset-1-2" aria-selected="false">Abstract</a></li><li class="nav-item"><a class="nav-link" id="tabset-1-3-tab" data-bs-toggle="tab" data-bs-target="#tabset-1-3" aria-controls="tabset-1-3" aria-selected="false">Slides</a></li><li class="nav-item"><a class="nav-link" id="tabset-1-4-tab" data-bs-toggle="tab" data-bs-target="#tabset-1-4" aria-controls="tabset-1-4" aria-selected="false">Video</a></li></ul>
<div class="tab-content">
<div id="tabset-1-1" class="tab-pane active" aria-labelledby="tabset-1-1-tab">
<p><span><i class="bi bi-file-bar-graph"></i> <a href="slides.pdf">Slides</a> </span><br>
<span><i class="bi bi-play"></i> <a href="https://www.youtube.com/watch?v=ILxK92HDtvU&amp;list=PL9HYL-VRX0oSFkdF4fJeY63eGDvgofcbn">Video</a> </span><br>
<span><i class="bi bi-pencil"></i> <a href="../..\post/py-rgo/">Post - Python Rgonomics</a> </span><br>
<span><i class="bi bi-pencil"></i> <a href="../..\post/py-rgo-polars/">Post - Advanced <code>polars</code> versus <code>dplyr</code></a> </span></p>
</div>
<div id="tabset-1-2" class="tab-pane" aria-labelledby="tabset-1-2-tab">
<div class="callout callout-style-default callout-warning callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Warning
</div>
</div>
<div class="callout-body-container callout-body">
<p>Tooling changes quickly. Since this talk occured, Astral’s <code>uv</code> project has come out as a very strong contender to replace <code>pyenv</code>, <code>pdm</code>, and more of the devtools part of a python stack.</p>
</div>
</div>
<p>Data science languages are increasingly interoperable with advances like Arrow, Quarto, and Posit Connect. But data scientists are not. Learning the basic syntax of a new language is easy, but relearning the ergonomics that help us be hyperproductive is hard. In this talk, I will explore the influential ergonomics of R’s tidyverse. Next, I will recommend a curated stack that mirrors these ergonomics while also being genuinely truly pythonic. In particular, we will explore packages (polars, seaborn objects, greattables), frameworks (Shiny, Quarto), dev tools (pyenv, ruff, and pdm), and IDEs (VS Code extensions). The audience should leave feeling inspired to try python while benefiting from their current knowledge and expertise.</p>
</div>
<div id="tabset-1-3" class="tab-pane" aria-labelledby="tabset-1-3-tab">
<div id="slides" style="width:100%; aspect-ratio:16/11;">
<embed src="slides.pdf#zoom=Fit" width="100%" height="100%">
</div>
</div>
<div id="tabset-1-4" class="tab-pane" aria-labelledby="tabset-1-4-tab">
<div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/ILxK92HDtvU" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
</div>
</div>
</div>



 ]]></description>
  <category>workflow</category>
  <category>python</category>
  <category>rstats</category>
  <guid>https://emilyriederer.com/talk/python-rgonomics/</guid>
  <pubDate>Thu, 15 Aug 2024 05:00:00 GMT</pubDate>
  <media:content url="https://emilyriederer.com/talk/python-rgonomics/featured.png" medium="image" type="image/png" height="78" width="144"/>
</item>
<item>
  <title>Crosspost: Data discovery doesn’t belong in ad hoc queries</title>
  <dc:creator>Emily Riederer</dc:creator>
  <link>https://emilyriederer.com/post/data-discovery-ad-hoc/</link>
  <description><![CDATA[ 





<p>Credible documentation is the best tool for working with data. Short of that, labor (and computational) intensive validation may be required. Recently, I had the opportunity to expand on these ideas in a <a href="https://www.selectstar.com/resources/data-discovery-doesnt-belong-in-ad-hoc-queries">cross-post with Select Star</a>. I explore how a “good” data analyst can interrogate a dataset with expensive queries and, more importantly, how best-in-class data products eliminate the need for this.</p>
<p>My post is reproduced below.</p>
<hr>
<p>In the current environment of decreasing headcount and rising cloud costs, the benefits of data management are more objective and tangible than ever. Done well, data management can reduce the cognitive and computational costs of working with enterprise-scale data.</p>
<p>Analysts often jump into new-to-them tables to answer business questions. Without a robust data platform, this constant novelty leads analysts down one of two paths. Either they boldly gamble that they have found intuitive and relevant data or they painstakingly hypothesize and validate assumptions for each new table. The latter approach leads to more trustworthy outcomes, but it comes at the cost of human capital and computational power.</p>
<p>Consider an analyst at an e-commerce company asking the question “How many invoices did we generate for fulfilled orders to Ohio in June?” while navigating unfamiliar tables. In this post, we explore prototypical queries analysts might have to run to validate a new-to-them table. Many of these are “expensive” queries requiring full table scans. Next, we’ll examine how a data discovery platform can obviate this effort.</p>
<p>The impact of this inefficiency may range from a minor papercut to a major cost sink depending on the sizes of your analyst community, historical enterprise data, and warehouse.</p>
<section id="preventable-data-discovery-queries" class="level2">
<h2 class="anchored" data-anchor-id="preventable-data-discovery-queries">6 Preventable Data Discovery Queries</h2>
<section id="what-columns-are-in-the-table" class="level3">
<h3 class="anchored" data-anchor-id="what-columns-are-in-the-table">1. What columns are in the table?</h3>
<p>Without a good data catalog, analysts will first need to check what fields exist in a table. While there may be lower cost ways to do this like looking at a pre-rendered preview (ala BigQuery), using a DESCRIBE statement (ala Spark), or limiting their query to the first few rows, some analysts may default to requesting all the data.</p>
<pre><code>select *
from invoices;</code></pre>
</section>
<section id="is-the-table-still-live-and-updating" class="level3">
<h3 class="anchored" data-anchor-id="is-the-table-still-live-and-updating">2. Is the table still live and updating?</h3>
<p>After establishing that a table has potentially useful information, analysts should next wonder if the data is still live and updating. First they might check a date field to see if the table seems “fresh”.</p>
<pre><code>select max(order_date) 
from invoices;</code></pre>
<p>But, of course, tables often have multiple date fields. For example, an e-commerce invoice table might have fields for both the date an order was placed and the date the record was last modified. So, analysts may guess-and-check a few of these fields to determine table freshness.</p>
<pre><code>select max(updated_date) 
from invoices;</code></pre>
<p>After identifying the correct field, there’s still a question of refresh cadence. Are records added hourly? Daily? Monthly? Lacking system-level metrics and metadata on the upstream table freshness, analysts are still left in the dark. So, once again, they can check empirically by looking at the frequency of the date field.</p>
<pre><code>select max(updated_date), count(1) as n
from invoices
group by 1;</code></pre>
</section>
<section id="what-is-the-grain-of-the-table" class="level3">
<h3 class="anchored" data-anchor-id="what-is-the-grain-of-the-table">3. What is the grain of the table?</h3>
<p>Now that the table is confirmed to be usable, the question becomes how to use it. Specifically, to credibly query and join the table, analysts next must determine its grain. Often, they start with a guess informed by the business context and data modeling conventions, such as assuming an invoice table is unique by order_id.</p>
<pre><code>select count(1) as n, count(distinct order_id)
from invoices;</code></pre>
<p>‍However, if they learn that order_id has a different cardinality then the number of records, they must ask why. So, once again, they scan the full table to find examples of records with shared order_id values.</p>
<pre><code>select *
from invoices
qualify count(1) over (partition by order_id) &gt; 1
order by order_id
limit 10;</code></pre>
<p>Eyeballing the results of this query, the analysts might notice that the same order_id value can coincide with different ship_id values, as a separate invoice is generated for each part of an order when a subset of items is shipped. With this new hypothesis, the analyst iterates on the validation of the grain.</p>
<pre><code>select count(1) as n, count(distinct order_id, ship_id)
from invoices;</code></pre>
</section>
<section id="what-values-can-categorical-variables-take" class="level3">
<h3 class="anchored" data-anchor-id="what-values-can-categorical-variables-take">4. What values can categorical variables take?</h3>
<p>The prior questions all involved table structure. Only now can an analyst finally begin to investigate the table’s content. A first step might be to understand the valid values for categorical variables. For example, if our analyst wanted to ensure only completed orders were queried, they might inspect the potential values of the order_status_id field to determine which values to include in a filter.</p>
<pre><code>select distinct order_status_id
from invoices;</code></pre>
<p>They’ll likely repeat this process for many categorical variables of interest. Since our analyst is interested in shipments specifically to Ohio, they might also inspect the cardinality of the ship_state field to ensure they correctly format the identifier.</p>
<pre><code>select distinct ship_state
from invoices;</code></pre>
</section>
<section id="do-numeric-columns-have-nulls-or-sentinel-values-to-encode-nulls" class="level3">
<h3 class="anchored" data-anchor-id="do-numeric-columns-have-nulls-or-sentinel-values-to-encode-nulls">5. Do numeric columns have nulls or ‘sentinel’ values to encode nulls?</h3>
<p>Similarly, analysts may wish to audit other variables for null handling or sentinel values by inspecting column-level statistics.</p>
<pre><code>select distinct ship_state
from invoices;</code></pre>
</section>
<section id="is-the-data-stored-with-partitioning-or-clustering-keys" class="level3">
<h3 class="anchored" data-anchor-id="is-the-data-stored-with-partitioning-or-clustering-keys">6. Is the data stored with partitioning or clustering keys?</h3>
<p>Inefficient queries aren’t only a symptom of ad hoc data validation. More complex and reused logic may also be written wastefully when table metadata like partitioning and clustering keys is not available to analysts. For example, an analyst might be able to construct a reasonable query filtering either on a shipment date or an order date, but if only one of these is a partitioning or clustering key, different queries could have substantial performance differences.</p>
</section>
</section>
<section id="understanding-your-data-without-relying-on-queries" class="level2">
<h2 class="anchored" data-anchor-id="understanding-your-data-without-relying-on-queries">Understanding Your Data Without Relying on Queries</h2>
<p>Analysts absolutely should ask themselves these types of questions when working with new data. However, it should not be analysts’ job to individually answer these questions by running SQL queries. Instead, best-in-class data documentation can provide critical information through a data catalog like Select Star.</p>
<section id="what-columns-are-in-the-table-and-do-we-need-a-table" class="level3">
<h3 class="anchored" data-anchor-id="what-columns-are-in-the-table-and-do-we-need-a-table">1. What columns are in the table? And do we need a table?</h3>
<p>Comprehensive search across all of an organization’s assets can help users quickly identify the right resources based on table names, field names, or data descriptions. Even better, search can incorporate observed tribal knowledge of table popularity and common querying patterns to prioritize the most relevant results. Moreover, when search also includes downstream data products like pre-built reports and dashboards, analysts might sometimes find an answer to their question exists off the shelf.</p>
</section>
<section id="is-the-table-still-live-and-updating-and-are-its-own-sources-current" class="level3">
<h3 class="anchored" data-anchor-id="is-the-table-still-live-and-updating-and-are-its-own-sources-current">2. Is the table still live and updating? And are its own sources current?</h3>
<p>Data is not a static artifact so metadata should not be either. After analysts identify a candidate table, they should have access to real-time operational information like table usage, table size, refresh date, and upstream dependencies to help confirm whether the table is a reliable resource.</p>
<p>Ideally, analysts can interrogate not just the freshness of a final table but also its dependencies by exploring the table’s data lineage.</p>
</section>
<section id="what-is-the-grain-of-the-table-and-how-does-it-relate-to-others" class="level3">
<h3 class="anchored" data-anchor-id="what-is-the-grain-of-the-table-and-how-does-it-relate-to-others">3. What is the grain of the table? And how does it relate to others?</h3>
<p>Table grain should be clearly documented at the table level and emphasized in the data dictionary via references to primary and foreign keys. Beyond basic documentation, entity-relationship (ER) diagrams will help analysts gain a richer mental model of grains of how they can use these primary-foreign key relationships to link tables to craft information with the desired grain and fields. Alternatively, they can glean this information from the wisdom of the crowds if they have access to how others have queried and joined the data previously.</p>
</section>
<section id="what-values-can-categorical-variables-take-do-numeric-columns-have-nulls-or-sentinel-values-to-encode-nulls" class="level3">
<h3 class="anchored" data-anchor-id="what-values-can-categorical-variables-take-do-numeric-columns-have-nulls-or-sentinel-values-to-encode-nulls">4. What values can categorical variables take? Do numeric columns have nulls or ‘sentinel’ values to encode nulls?</h3>
<p>Information about proper expectations and handling of categorical and null values may be published as field definitions, pointed to lookup tables, implied in data tests, or illustrated in past queries. To drive consistency and offload redundant work from data producers, such field definitions can be propagated from upstream tables.</p>
</section>
<section id="is-the-data-stored-with-partitioning-or-clustering-keys-1" class="level3">
<h3 class="anchored" data-anchor-id="is-the-data-stored-with-partitioning-or-clustering-keys-1">‍5. Is the data stored with partitioning or clustering keys?</h3>
<p>Analysts cannot write efficient code if they don’t know where the efficiency gains lie. Table-level documentation should clearly highlight the use of clustering or partitioning files so analysts can use the most impactful variables in filters and joins. Here, consistency of documentation is paramount; analysts may not always be incented to care about query efficiency, so if this information is hard to find or rarely available, they can be easily dissuaded from looking.</p>
<p>Beyond a poor user experience, poor data discoverability creates inefficiency and added cost. Even if you don’t have large scale historical data or broad data user communities today, slow queries and tedious work still detract from data team productivity while introducing context-switching and chaos. By focusing on improving data discoverability, you can streamline workflows and enhance the overall efficiency of your data operations.</p>


</section>
</section>

 ]]></description>
  <category>data</category>
  <category>workflow</category>
  <category>elt</category>
  <category>crosspost</category>
  <guid>https://emilyriederer.com/post/data-discovery-ad-hoc/</guid>
  <pubDate>Thu, 18 Jul 2024 05:00:00 GMT</pubDate>
  <media:content url="https://emilyriederer.com/post/data-discovery-ad-hoc/featured.png" medium="image" type="image/png" height="76" width="144"/>
</item>
<item>
  <title>Base Python Rgonomic Patterns</title>
  <dc:creator>Emily Riederer</dc:creator>
  <link>https://emilyriederer.com/post/py-rgo-base/</link>
  <description><![CDATA[ 





<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://emilyriederer.com/post/py-rgo-base/featured.jpg" class="img-fluid figure-img"></p>
<figcaption>Photo credit to <a href="https://unsplash.com/@davidclode">David Clode</a> on Unsplash</figcaption>
</figure>
</div>
<p>In the past few weeks, I’ve been writing about a <a href="../..\post/py-rgo">stack of tools</a> and <a href="../..\post/py-rgo-polars/">specific packages like <code>polars</code></a> that may help R users feel “at home” when working in python due to similiar ergonomics. However, one common snag in switching languages is ramping up on common “recipes” for higher-level workflows (e.g.&nbsp;how to build a <code>sklearn</code> modeling pipeline) but missing a languages’s fundamentals that make writing glue code feel smooth (and dare I say pleasant?) It’s a maddening feeling to get code for a <em>complex</em> task to finish only to have the result wrapped in an object that you can’t suss out how to save or manipulate.</p>
<p>This post goes back to the basics. We’ll briefly reflect on a few aspects of usability that have led to the success of many workflow packages in R. Then, I’ll demonstrate a grab bag of coding patterns in python that make it feel more elegant to connect bits of code into a coherent workflow.</p>
<p>We’ll look at the kind of functionality that you didn’t know to miss until it was gone, you may not be quite sure what to search to figure out how to get it back, <em>and</em> you wonder if it’s even reasonable to hope there’s an analog<sup>1</sup>. This won’t be anything groundbreaking – just some nuts and bolts. Specifically: helper functions for data and time manipulation, advanced string interpolation, list comprehensions for more functional programming, and object serialization.</p>
<section id="what-other-r-ergonomics-do-we-enjoy" class="level2">
<h2 class="anchored" data-anchor-id="what-other-r-ergonomics-do-we-enjoy">What other R ergonomics do we enjoy?</h2>
<p>R’s passionate user and developer community has invested a lot in building tools that smooth over rough edges and provide slick, concise APIs to rote tasks. Sepcifically, a number of packages are devoted to:</p>
<ul>
<li><strong>Utility functions</strong>: Things that make it easier to “automate the boring stuff” like <code>fs</code> for naviating file systems or <code>lubridate</code> for more semantic date wrangling</li>
<li><strong>Formatting functions</strong>: Things that help us make things look nice for users like <code>cli</code> and <code>glue</code> to improve human readability of terminal output and string interpolation</li>
<li><strong>Efficiency functions</strong>: Things that help us write efficient workflows like <code>purrr</code> which provides a concise, typesafe interface for iteration</li>
</ul>
<p>All of these capabilities are things we <em>could</em> somewhat trivially write ourselves, but we don’t <em>want</em> to and we don’t <em>need</em> to. Fortunately, we don’t need to in python either.</p>
</section>
<section id="wrangling-things-date-manipulation" class="level2">
<h2 class="anchored" data-anchor-id="wrangling-things-date-manipulation">Wrangling Things (Date Manipulation)</h2>
<p>I don’t know a data person who loves dates. In the R world, many enjoy <code>lubridate</code>’s wide range of helper functions for cleaning, formatting, and computing on dates.</p>
<p>Python’s <code>datetime</code> module is similarly effective. We can easily create and manage dates in <code>date</code> or <code>datetime</code> classes which make them easy to work with.</p>
<div id="ea34065a" class="cell" data-execution_count="1">
<div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> datetime</span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> datetime <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> date</span>
<span id="cb1-3">today <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> date.today()</span>
<span id="cb1-4"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(today)</span>
<span id="cb1-5"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">type</span>(today)</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>2024-01-20</code></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="1">
<pre><code>datetime.date</code></pre>
</div>
</div>
<p>Two of the most important functions are <code>strftime()</code> and <code>strptime()</code>.</p>
<p><code>strftime()</code> <em>formats</em> dates into strings. It accepts both a date and the desired string format. Below, we demonstrate by commiting the cardinal sin of writing a date in non-ISO8601.</p>
<div id="ca22a3a1" class="cell" data-execution_count="2">
<div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1">today_str <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> datetime.datetime.strftime(today, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'%m/</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%d</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">/%Y'</span>)</span>
<span id="cb4-2"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(today_str)</span>
<span id="cb4-3"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">type</span>(today_str)</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>01/20/2024</code></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="2">
<pre><code>str</code></pre>
</div>
</div>
<p><code>strptime()</code> does the opposite and turns a string encoding a date into an actual date. It can try to guess the format, or we can be nice and provide it guidance.</p>
<div id="2fe33487" class="cell" data-execution_count="3">
<div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1">someday_dtm <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> datetime.datetime.strptime(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'2023-01-01'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'%Y-%m-</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%d</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>)</span>
<span id="cb7-2"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(someday_dtm)</span>
<span id="cb7-3"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">type</span>(someday_dtm)</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>2023-01-01 00:00:00</code></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="3">
<pre><code>datetime.datetime</code></pre>
</div>
</div>
<p>Date math is also relatively easy with <code>datetime</code>. For example, you can see we calculate the date difference simply by… taking the difference! From the resulting delta object, we can access the <code>days</code> attribute.</p>
<div id="324e084a" class="cell" data-execution_count="4">
<div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1">n_days_diff <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ( today <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> someday_dtm.date() )</span>
<span id="cb10-2"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(n_days_diff)</span>
<span id="cb10-3"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">type</span>(n_days_diff)</span>
<span id="cb10-4"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">type</span>(n_days_diff.days)</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>384 days, 0:00:00</code></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="4">
<pre><code>int</code></pre>
</div>
</div>
</section>
<section id="formatting-things-f-strings" class="level2">
<h2 class="anchored" data-anchor-id="formatting-things-f-strings">Formatting Things (f-strings)</h2>
<p>R’s <code>glue</code> is beloved for it’s ability to easily combine variables and texts into complex strings without a lot of ugly, nested <code>paste()</code> functions.</p>
<p>python has a number of ways of doing this, but the most readable is the newest: f-strings. Simply put an <code>f</code> before the string and put any variable names to be interpolated in <code>{</code>curly braces<code>}</code>.</p>
<div id="bd04d21e" class="cell" data-execution_count="5">
<div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1">name <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Emily"</span></span>
<span id="cb13-2"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"This blog post is written by </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>name<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>This blog post is written by Emily</code></pre>
</div>
</div>
<p>f-strings also support formatting with formats specified after a colon. Below, we format a long float to round to 2 digits.</p>
<div id="95ac64aa" class="cell" data-execution_count="6">
<div class="sourceCode cell-code" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1">proportion <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.123456789</span></span>
<span id="cb15-2"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"The proportion is </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>proportion<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.2f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>The proportion is 0.12</code></pre>
</div>
</div>
<p>Any python expression – not just a single variable – can go in curly braces. So, we can instead format that propotion as a percent.</p>
<div id="7aa11b90" class="cell" data-execution_count="7">
<div class="sourceCode cell-code" id="cb17" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1">proportion <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.123456789</span></span>
<span id="cb17-2"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"The proportion is </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>proportion<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.1f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">%"</span>)</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>The proportion is 12.3%</code></pre>
</div>
</div>
<p>Despite the slickness of f-strings, sometimes other string interpolation approaches can be useful. For example, if all the variables I want to interpolate are in a dictionary (as often will happen, for example, with REST API responses), the string <code>format()</code> method is a nice alternative. It allows us to pass in the dictionary, “unpacking” the argument with <code>**</code><sup>2</sup></p>
<div id="a11f7294" class="cell" data-execution_count="8">
<div class="sourceCode cell-code" id="cb19" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1">result <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {</span>
<span id="cb19-2">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'dog_name'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Squeak'</span>,</span>
<span id="cb19-3">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'dog_type'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Chihuahua'</span></span>
<span id="cb19-4">}</span>
<span id="cb19-5"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{dog_name}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;"> is a </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{dog_type}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">format</span>(<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>result))</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>Squeak is a Chihuahua</code></pre>
</div>
</div>
<section id="application-generating-file-names" class="level3">
<h3 class="anchored" data-anchor-id="application-generating-file-names">Application: Generating File Names</h3>
<p>Combining what we’ve discussed about <code>datetime</code> and f-strings, here’s a pattern I use frequently. If I am logging results from a run of some script, I might save the results in a file suffixed with the run timestamp. We can generate this easily.</p>
<div id="5c40ec95" class="cell" data-execution_count="9">
<div class="sourceCode cell-code" id="cb21" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb21-1">dt_stub <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> datetime.datetime.now().strftime(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'%Y%m</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%d</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">_%H%M%S'</span>)</span>
<span id="cb21-2">file_name <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"output-</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>dt_stub<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">.csv"</span></span>
<span id="cb21-3"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(file_name)</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>output-20240120_071517.csv</code></pre>
</div>
</div>
</section>
</section>
<section id="repeating-things-iteration-functional-programming" class="level2">
<h2 class="anchored" data-anchor-id="repeating-things-iteration-functional-programming">Repeating Things (Iteration / Functional Programming)</h2>
<p>Thanks in part to a modern-day fiction that <code>for</code> loops in R are inefficient, R users have gravitated towards concise mapping functions for iteration. These can include the <code>*apply()</code> family<sup>3</sup>, <code>purrr</code>’s <code>map_*()</code> functions, or the parallelized version of either.</p>
<p>Python too has a nice pattern for arbitrary iteration in list comprehensions. For any iterable, we can use a list comprehension to make a list of outputs by processing a list of inputs, with optional conditional and default expressions.</p>
<p>Here are some trivial examples:</p>
<div id="2d5c5705" class="cell" data-execution_count="10">
<div class="sourceCode cell-code" id="cb23" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb23-1">l <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>]</span>
<span id="cb23-2">[i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> l]</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="10">
<pre><code>[2, 3, 4]</code></pre>
</div>
</div>
<div id="c606ec83" class="cell" data-execution_count="11">
<div class="sourceCode cell-code" id="cb25" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb25-1">[i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> l <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="11">
<pre><code>[2, 4]</code></pre>
</div>
</div>
<div id="d9ee13a4" class="cell" data-execution_count="12">
<div class="sourceCode cell-code" id="cb27" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb27-1">[i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> i <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> l]</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="12">
<pre><code>[2, 2, 4]</code></pre>
</div>
</div>
<p>There are also closer analogs to <code>purrr</code> like python’s <code>map()</code> function. <code>map()</code> takes a function and an iterable object and applies the function to each element. Like with <code>purrr</code>, functions can be anonymous (as defined in python with lambda functions) or named. List comprehensions are popular for their concise syntax, but there are many different thoughts on the matter as expressed in <a href="https://stackoverflow.com/questions/1247486/list-comprehension-vs-map">this StackOverflow post</a>.</p>
<div id="4647a4b9" class="cell" data-execution_count="13">
<div class="sourceCode cell-code" id="cb29" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb29-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> add_one(i): </span>
<span id="cb29-2">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb29-3"></span>
<span id="cb29-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># these are the same</span></span>
<span id="cb29-5"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">map</span>(<span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">lambda</span> i: i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, l))</span>
<span id="cb29-6"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">map</span>(add_one, l))</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="13">
<pre><code>[2, 3, 4]</code></pre>
</div>
</div>
<section id="application-simulation" class="level3">
<h3 class="anchored" data-anchor-id="application-simulation">Application: Simulation</h3>
<p>As a (slightly) more realistic(ish) example, let’s consider how list comprehensions might help us conduct a numerical simulation or sensitivity analysis.</p>
<p>Suppose we want to simulate 100 draws from a Bernoulli distribution with different success probabilites and see how close our empirically calculated rate is to the true rate.</p>
<p>We can define the probabilites we want to simulate in a list and use a list comprehension to run the simulations.</p>
<div id="fc8f8b46" class="cell" data-execution_count="14">
<div class="sourceCode cell-code" id="cb31" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb31-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb31-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy.random <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> rnd</span>
<span id="cb31-3"></span>
<span id="cb31-4">probs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.25</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.75</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.9</span>]</span>
<span id="cb31-5">coin_flips <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [ np.mean(np.random.binomial(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, p, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>)) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> p <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> probs ]</span>
<span id="cb31-6">coin_flips</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="14">
<pre><code>[0.05, 0.3, 0.48, 0.77, 0.87]</code></pre>
</div>
</div>
<p>Alternatively, instead of returning a list of the same length, our resulting list could include whatever we want – like a list of lists! If we wanted to keep the raw simulation results, we could. The following code returns a list of 5 lists - one with the raw simulation results.</p>
<div id="12787135" class="cell" data-execution_count="15">
<div class="sourceCode cell-code" id="cb33" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb33-1">coin_flips <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [ <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>(np.random.binomial(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, p, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>)) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> p <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> probs ]</span>
<span id="cb33-2"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"""</span></span>
<span id="cb33-3"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">  coin_flips has </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(coin_flips)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> elements</span></span>
<span id="cb33-4"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">  Each element is itself a </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">type</span>(coin_flips[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>])<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb33-5"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">  Each element is of length </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(coin_flips[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>])<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb33-6"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">  """</span>)</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>
  coin_flips has 5 elements
  Each element is itself a &lt;class 'list'&gt;
  Each element is of length 100
  </code></pre>
</div>
</div>
<p>If one wished, they could then put these into a <code>polars</code> dataframe and pivot those list-of-lists (going from a 5-row dataset to a 500-row dataset)to conduct whatever sort of analysis with want with all the replicates.</p>
<div id="3d34880e" class="cell" data-execution_count="16">
<div class="sourceCode cell-code" id="cb35" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb35-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> polars <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pl</span>
<span id="cb35-2"></span>
<span id="cb35-3">df_flips <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.DataFrame({<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'prob'</span>: probs, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'flip'</span>: coin_flips})</span>
<span id="cb35-4">df_flips.explode(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'flip'</span>).glimpse()</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>Rows: 500
Columns: 2
$ prob &lt;f64&gt; 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1
$ flip &lt;i32&gt; 0, 0, 0, 0, 1, 0, 1, 1, 0, 0
</code></pre>
</div>
</div>
<p>We’ll return to list comprehensions in the next section.</p>
</section>
</section>
<section id="faking-things-data-generation" class="level2">
<h2 class="anchored" data-anchor-id="faking-things-data-generation">Faking Things (Data Generation)</h2>
<p>Creating simple miniature datasets is often useful in analysis. When working with a new packages, it’s an important part of learning, developing, debugging, and eventually unit testing. We can easily run our code on a simplified data object where the desired outcome is easy to determine to sanity-check our work, or we can use fake data to confirm our understanding of how a program will handle edge cases (like the diversity of ways different programs <a href="../..\post/nulls-polyglot/">handle null values</a>). Simple datasets can also be used and spines and scaffolds for more complex data wrangling tasks (e.g.&nbsp;joining event data onto a date spine).</p>
<p>In R, <code>data.frame()</code> and <code>expand.grid()</code> are go-to functions, coupled with vector generators like <code>rep()</code> and <code>seq()</code>. Python has many similar options.</p>
<section id="fake-datasets" class="level3">
<h3 class="anchored" data-anchor-id="fake-datasets">Fake Datasets</h3>
<p>For the simplest of datasets, we can manually write a few entries as with <code>data.frame()</code> in R. Here, we define series in a named dictionary where each dictionary key turns into a column name.</p>
<div id="dfa6b79a" class="cell" data-execution_count="17">
<div class="sourceCode cell-code" id="cb37" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb37-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> polars <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pl</span>
<span id="cb37-2"></span>
<span id="cb37-3">pl.DataFrame({</span>
<span id="cb37-4">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'a'</span>: [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>],</span>
<span id="cb37-5">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'b'</span>: [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'x'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'y'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'z'</span>]</span>
<span id="cb37-6">})</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="17">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (3, 2)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">a</th>
<th data-quarto-table-cell-role="th">b</th>
</tr>
<tr class="odd">
<th>i64</th>
<th>str</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>1</td>
<td>"x"</td>
</tr>
<tr class="even">
<td>2</td>
<td>"y"</td>
</tr>
<tr class="odd">
<td>3</td>
<td>"z"</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<p>If we need longer datasets, we can use helper functions in packages like <code>numpy</code> to generate the series. Methods like <code>arange</code> and <code>linspace</code> work similarly to R’s <code>seq()</code>.</p>
<div id="e27be38e" class="cell" data-execution_count="18">
<div class="sourceCode cell-code" id="cb38" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb38-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> polars <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pl</span>
<span id="cb38-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb38-3"></span>
<span id="cb38-4">pl.DataFrame({</span>
<span id="cb38-5">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'a'</span>: np.arange(stop <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>),</span>
<span id="cb38-6">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'b'</span>: np.linspace(start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">9</span>, stop <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">24</span>, num <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)</span>
<span id="cb38-7">})</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="18">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (3, 2)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">a</th>
<th data-quarto-table-cell-role="th">b</th>
</tr>
<tr class="odd">
<th>i32</th>
<th>f64</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>0</td>
<td>9.0</td>
</tr>
<tr class="even">
<td>1</td>
<td>16.5</td>
</tr>
<tr class="odd">
<td>2</td>
<td>24.0</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<p>If we need groups in our sample data, we can use <code>np.repeat()</code> which works like R’s <code>rep(each = TRUE)</code>.</p>
<div id="a86edf0b" class="cell" data-execution_count="19">
<div class="sourceCode cell-code" id="cb39" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb39-1">pl.DataFrame({</span>
<span id="cb39-2">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'a'</span>: np.repeat(np.arange(stop <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>), <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>),</span>
<span id="cb39-3">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'b'</span>: np.linspace(start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, stop <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">27</span>, num <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>)</span>
<span id="cb39-4">})</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="19">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (6, 2)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">a</th>
<th data-quarto-table-cell-role="th">b</th>
</tr>
<tr class="odd">
<th>i32</th>
<th>f64</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>0</td>
<td>3.0</td>
</tr>
<tr class="even">
<td>0</td>
<td>7.8</td>
</tr>
<tr class="odd">
<td>1</td>
<td>12.6</td>
</tr>
<tr class="even">
<td>1</td>
<td>17.4</td>
</tr>
<tr class="odd">
<td>2</td>
<td>22.2</td>
</tr>
<tr class="even">
<td>2</td>
<td>27.0</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<p>Alternatively, for more control and succinct typing, we can created a nested dataset in <code>polars</code> and explode it out.</p>
<div id="6ecf467f" class="cell" data-execution_count="20">
<div class="sourceCode cell-code" id="cb40" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb40-1">(</span>
<span id="cb40-2">  pl.DataFrame({</span>
<span id="cb40-3">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'a'</span>: [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>],</span>
<span id="cb40-4">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'b'</span>: [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"a b c"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"d e f"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"g h i"</span>]</span>
<span id="cb40-5">  })</span>
<span id="cb40-6">  .with_columns(pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'b'</span>).<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>.split(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">" "</span>))</span>
<span id="cb40-7">  .explode(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'b'</span>)</span>
<span id="cb40-8">)</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="20">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (9, 2)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">a</th>
<th data-quarto-table-cell-role="th">b</th>
</tr>
<tr class="odd">
<th>i64</th>
<th>str</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>1</td>
<td>"a"</td>
</tr>
<tr class="even">
<td>1</td>
<td>"b"</td>
</tr>
<tr class="odd">
<td>1</td>
<td>"c"</td>
</tr>
<tr class="even">
<td>2</td>
<td>"d"</td>
</tr>
<tr class="odd">
<td>2</td>
<td>"e"</td>
</tr>
<tr class="even">
<td>2</td>
<td>"f"</td>
</tr>
<tr class="odd">
<td>3</td>
<td>"g"</td>
</tr>
<tr class="even">
<td>3</td>
<td>"h"</td>
</tr>
<tr class="odd">
<td>3</td>
<td>"i"</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<p>Similarly, we could use what we’ve learned about <code>polars</code> list columns <em>and</em> list comprehensions.</p>
<div id="10f32c48" class="cell" data-execution_count="21">
<div class="sourceCode cell-code" id="cb41" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb41-1">a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>]</span>
<span id="cb41-2">b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [ [q<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>i <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> q <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>]] <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> a]</span>
<span id="cb41-3">pl.DataFrame({<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'a'</span>:a,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'b'</span>:b}).explode(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'b'</span>)</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="21">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (9, 2)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">a</th>
<th data-quarto-table-cell-role="th">b</th>
</tr>
<tr class="odd">
<th>i64</th>
<th>i64</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>1</td>
<td>1</td>
</tr>
<tr class="even">
<td>1</td>
<td>2</td>
</tr>
<tr class="odd">
<td>1</td>
<td>3</td>
</tr>
<tr class="even">
<td>2</td>
<td>2</td>
</tr>
<tr class="odd">
<td>2</td>
<td>4</td>
</tr>
<tr class="even">
<td>2</td>
<td>6</td>
</tr>
<tr class="odd">
<td>3</td>
<td>3</td>
</tr>
<tr class="even">
<td>3</td>
<td>6</td>
</tr>
<tr class="odd">
<td>3</td>
<td>9</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<p>In fact, multidimensional list comprehensions can be used to mimic R’s <code>expand.grid()</code> function.</p>
<div id="f6c2c31c" class="cell" data-execution_count="22">
<div class="sourceCode cell-code" id="cb42" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb42-1">pl.DataFrame(</span>
<span id="cb42-2">  [(x, y) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> x <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> y <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)],</span>
<span id="cb42-3">  schema <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'x'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'y'</span>]</span>
<span id="cb42-4">  )</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="22">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (9, 2)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">x</th>
<th data-quarto-table-cell-role="th">y</th>
</tr>
<tr class="odd">
<th>i64</th>
<th>i64</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>0</td>
<td>0</td>
</tr>
<tr class="even">
<td>0</td>
<td>1</td>
</tr>
<tr class="odd">
<td>0</td>
<td>2</td>
</tr>
<tr class="even">
<td>1</td>
<td>0</td>
</tr>
<tr class="odd">
<td>1</td>
<td>1</td>
</tr>
<tr class="even">
<td>1</td>
<td>2</td>
</tr>
<tr class="odd">
<td>2</td>
<td>0</td>
</tr>
<tr class="even">
<td>2</td>
<td>1</td>
</tr>
<tr class="odd">
<td>2</td>
<td>2</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</section>
<section id="built-in-data" class="level3">
<h3 class="anchored" data-anchor-id="built-in-data">Built-In Data</h3>
<p>R has a number of canonical datasets like <code>iris</code> built in to the core language. This can be easy to quickly grab for experimentation<sup>4</sup>. While base python doesn’t include such capabilities, many of the exact same or similar datasets can be found in <code>seaborn</code>.</p>
<p><code>seaborn.get_dataset_names()</code> provides the list of available options. Below, we load the Palmers Penguins data and, if you wish, convert it from <code>pandas</code> to <code>polars</code>.</p>
<div id="62759986" class="cell" data-execution_count="23">
<div class="sourceCode cell-code" id="cb43" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb43-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> seaborn <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> sns</span>
<span id="cb43-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> polars <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pl</span>
<span id="cb43-3"></span>
<span id="cb43-4">df_pd <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> sns.load_dataset(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'penguins'</span>)</span>
<span id="cb43-5">df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.from_pandas(df_pd)</span>
<span id="cb43-6">df.glimpse()</span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>Rows: 344
Columns: 7
$ species           &lt;str&gt; 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie'
$ island            &lt;str&gt; 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen'
$ bill_length_mm    &lt;f64&gt; 39.1, 39.5, 40.3, None, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0
$ bill_depth_mm     &lt;f64&gt; 18.7, 17.4, 18.0, None, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2
$ flipper_length_mm &lt;f64&gt; 181.0, 186.0, 195.0, None, 193.0, 190.0, 181.0, 195.0, 193.0, 190.0
$ body_mass_g       &lt;f64&gt; 3750.0, 3800.0, 3250.0, None, 3450.0, 3650.0, 3625.0, 4675.0, 3475.0, 4250.0
$ sex               &lt;str&gt; 'Male', 'Female', 'Female', None, 'Female', 'Male', 'Female', 'Male', None, None
</code></pre>
</div>
</div>
</section>
</section>
<section id="saving-things-object-serialization" class="level2">
<h2 class="anchored" data-anchor-id="saving-things-object-serialization">Saving Things (Object Serialization)</h2>
<p>Sometimes, it can be useful to save <em>objects</em> as they existed in RAM in an active programming environment. R users may have experienced this if they’ve used <code>.rds</code>, <code>.rda</code>, or <code>.Rdata</code> files to save individual variables or their entire environment. These objects can often be faster to reload than plaintext and can better preserve information that may be lost in other formats (e.g.&nbsp;storing a dataframe in a way that preserves its datatypes versus writing to a CSV file<sup>5</sup> or storing a complex object that can’t be easily reduced to plaintext like a model with training data, hyperparameters, learned tree splits or weights or whatnot for future predictions.) This is called object serializaton<sup>6</sup></p>
<p>Python has comparable capabilities in the <a href="https://docs.python.org/3/library/pickle.html"><code>pickle</code> module</a>. There aren’t really style points here, so I’ve not much to add beyond “this exists” and “read the documentation”. But, at a high level, it looks something like this:</p>
<div id="c0799a02" class="cell" data-execution_count="24">
<div class="sourceCode cell-code" id="cb45" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb45-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># to write a pickle</span></span>
<span id="cb45-2"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">open</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'my-obj.pickle'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'wb'</span>) <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> handle:</span>
<span id="cb45-3">    pickle.dump(my_object, handle, protocol <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pickle.HIGHEST_PROTOCOL)</span>
<span id="cb45-4"></span>
<span id="cb45-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># to read a pickle</span></span>
<span id="cb45-6">my_object <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pickle.load(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">open</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'my-obj.pickle'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'rb'</span>))</span></code></pre></div>
</div>


</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>I defined this odd scope to help limit the infinite number of workflow topics that could be included like “how to write a function” or “how to source code from another script”↩︎</p></li>
<li id="fn2"><p>This is called “**kwargs” and works a bit like <code>do.call()</code> in base R. You can read more about it <a href="https://www.digitalocean.com/community/tutorials/how-to-use-args-and-kwargs-in-python-3">here</a>.↩︎</p></li>
<li id="fn3"><p>Speaking of non-ergonomic things in R, the <code>*apply()</code> family is notoriously diverse in its number and order of arguments↩︎</p></li>
<li id="fn4"><p>Particularly if you want to set wildly unrealistic expectations for the efficacy of k-means clustering, but I digress↩︎</p></li>
<li id="fn5"><p>And yes, you can and should use Parquet and then my example falls apart – but that’s not the point!↩︎</p></li>
<li id="fn6"><p>And, if you want to go incredibly deep here, check out <a href="https://blog.djnavarro.net/posts/2021-11-15_serialisation-with-rds/">this awesome post</a> by Danielle Navarro.↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>rstats</category>
  <category>python</category>
  <category>tutorial</category>
  <guid>https://emilyriederer.com/post/py-rgo-base/</guid>
  <pubDate>Sat, 20 Jan 2024 06:00:00 GMT</pubDate>
  <media:content url="https://emilyriederer.com/post/py-rgo-base/featured.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Crosspost: Why You Need Data Documentation in 2024</title>
  <dc:creator>Emily Riederer</dc:creator>
  <link>https://emilyriederer.com/post/docs-personas/</link>
  <description><![CDATA[ 





<p>We’ve all worked with poorly documented dataset, and we all know it isn’t pretty. However, it’s surprisingly easy for teams to continue to fall into “documentation debt” and deprioritize this foundational work in favor of flashy new projects. These tradeoff discussions may become even more painful in 2024 as teams are continually asked to do more with less.</p>
<p>Recently, I had the opportunity to articulate some of the underappreciated benefits of data documentation in a <a href="https://www.selectstar.com/blog/why-you-need-data-documentation-in-2024">cross-post with Select Star</a>. This builds on my prior post showing that <a href="../..\post/docs-closer-than-you-think/">documentation can be strategically created throughout the data development process</a>. To make the case for taking those “raw” documentation resources to a polished final form, I return to the jobs-to-be-done framework that I’ve previously employed to talk about <a href="../..\post/team-of-packages/">the value of innersource packages</a>. In this perspective, documentation is like hiring an extra resource (or more!) to your team.</p>
<p>Some of the jobs discussed are:</p>
<ul>
<li>Developer Advocacy and Product Evangelism for users
<ul>
<li>Users think data doesn’t exist if they can’t find it, they think data is broken if they misinterpret it</li>
<li>Documentation is both a “user interface” to make data usage easy and a bulwark against confusion and frustration</li>
</ul></li>
<li>Producct and Project Management for developers
<ul>
<li>Data intent can “drift” over time</li>
<li>As teams evolve and collaborate, this risks initial intent getting lost and poluted (after all, what really is a “customer”?)</li>
<li>Documentation serves as a contract and coach for one or more teams to force clarity and consistency of intent</li>
</ul></li>
<li>Chief of Staff oversight for data leaders
<ul>
<li>Leaders face increasing demands in data governance: navigating changing privacy regulations, fighting decaying data quality, and discerning their next strategic investments</li>
<li>Documentation is their command center to understand what data assets exists and where to better spot risks and opportunities</li>
</ul></li>
</ul>
<p>If you or your team works on data documentation, I’d love to hear what other “jobs” you have found that data documentation performs in your organization.</p>



 ]]></description>
  <category>data</category>
  <category>workflow</category>
  <category>elt</category>
  <category>crosspost</category>
  <guid>https://emilyriederer.com/post/docs-personas/</guid>
  <pubDate>Mon, 15 Jan 2024 06:00:00 GMT</pubDate>
  <media:content url="https://emilyriederer.com/post/docs-personas/featured.PNG" medium="image"/>
</item>
<item>
  <title>polars’ Rgonomic Patterns</title>
  <dc:creator>Emily Riederer</dc:creator>
  <link>https://emilyriederer.com/post/py-rgo-polars/</link>
  <description><![CDATA[ 





<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://emilyriederer.com/post/py-rgo-polars/featured.jpg" class="img-fluid figure-img"></p>
<figcaption>Photo credit to <a href="https://unsplash.com/@hansjurgen007">Hans-Jurgen Mager</a> on Unsplash</figcaption>
</figure>
</div>
<p>A few weeks ago, I shared some <a href="../..\post/py-rgo/">recommended modern python tools and libraries</a> that I believe have the most similar ergonomics for R (specifically <code>tidyverse</code>) converts. This post expands on that one with a focus on the <code>polars</code> library.</p>
<p>At the surface level, all data wrangling libraries have roughly the same functionality. Operations like selecting existing columns and making new ones, subsetting and ordering rows, and summarzing results is tablestakes.</p>
<p>However, no one falls in love with a specific library because it has the best <code>select()</code> or <code>filter()</code> function the world has ever seen. It’s the ability to easily do more complex transformations that differentiate a package expert versus novice, and the learning curve for everything that happens <em>after</em> the “Getting Started” guide ends is what can leave experts at one tool feeling so disempowered when working with another.</p>
<p>This deeper sense of intuition and fluency – when your technical brain knows intuitively how to translate in code what your analytical brain wants to see in the data – is what I aim to capture in the term “ergonomics”. In this post, I briefly discuss the surface-level comparison but spend most of the time exploring the deeper similarities in the functionality and workflows enabled by <code>polars</code> and <code>dplyr</code>.</p>
<section id="what-are-dplyrs-ergonomics" class="level2">
<h2 class="anchored" data-anchor-id="what-are-dplyrs-ergonomics">What are <code>dplyr</code>’s ergonomics?</h2>
<p>To claim <code>polars</code> has a similar aesthetic and user experience as <code>dplyr</code>, we first have to consider what the heart of <code>dplyr</code>‘s ergonomics actually is. The explicit design philosophy is described in the developers’ writings on <a href="https://design.tidyverse.org/unifying.html">tidy design principles</a>, but I’ll blend those official intended principles with my personal definitions based on the lived user experience.</p>
<ul>
<li>Consistent:
<ul>
<li>Function names are highly consistent (e.g.&nbsp;snake case verbs) with dependable inputs and outputs (mostly dataframe-in dataframe-out) to increase intuition, reduce mistakes, and eliminate surprises</li>
<li>Metaphors extend throughout the codebase. For example <code>group_by()</code> + <code>summarize()</code> or <code>group_by()</code> + <code>mutate()</code> do what one might expect (aggregation versus a window function) instead of requiring users to remember arbitrary command-specific syntax</li>
<li>Always returns a new dataframe versus modifying in-place so code is more idempotent<sup>1</sup> and less error prone</li>
</ul></li>
<li>Composable:
<ul>
<li>Functions exist at a “sweet spot” level of abstraction. We have the right primitive building blocks that users have full control to do anything they want to do with a dataframe but almost never have to write brute-force glue code. These building blocks can be layered however one choose to conduct</li>
<li>Conistency of return types leads to composability since dataframe-in dataframe-out allows for chaining</li>
</ul></li>
<li>Human-Centered:
<ul>
<li>Packages hit a comfortable level of abstraction somewhere between fully procedural (e.g.&nbsp;manually looping over array indexes without a dataframe abstraction) and fully declarative (e.g.&nbsp;SQL-style languages where you “request” the output but aspects like the order of operations may become unclear). Writing code is essentially articulating the steps of an analysis</li>
<li>This focus on code as recipe writing leads to the creation of useful optional functions and helpers (like my favorite – column selectors)</li>
<li>User’s rarely need to break the fourth wall of this abstraction-layer (versus thinking about things like indexes in <code>pandas</code>)</li>
</ul></li>
</ul>
<p>TLDR? We’ll say <code>dplyr</code>’s ergonomics allow users to express complex transformation precisely, concisely, and expressively.</p>
<p>So, with that, we will import <code>polars</code> and get started!</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> polars <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pl</span></code></pre></div>
</div>
<p>This document was made with <code>polars</code> version <code>0.20.4</code>.</p>
</section>
<section id="basic-functionality" class="level2">
<h2 class="anchored" data-anchor-id="basic-functionality">Basic Functionality</h2>
<p>The similarities between <code>polars</code> and <code>dplyr</code>’s top-level API are already well-explored in many posts, including those by <a href="https://blog.tidy-intelligence.com/posts/dplyr-vs-polars/">Tidy Intelligence</a> and <a href="https://robertmitchellv.com/blog/2022-07-r-python-side-by-side/r-python-side-by-side.html">Robert Mitchell</a>.</p>
<p>We will only do the briefest of recaps of the core data wrangling functions of each and how they can be composed in order to make the latter half of the piece make sense. We will meet these functions again in-context when discussing <code>dplyr</code> and <code>polar</code>’s more advanced workflows.</p>
<section id="main-verbs" class="level3">
<h3 class="anchored" data-anchor-id="main-verbs">Main Verbs</h3>
<p><code>dplyr</code> and <code>polars</code> offer the same foundational functionality for manipulating dataframes. Their APIs for these operations are substantially similar.</p>
<p>For a single dataset:</p>
<ul>
<li>Column selection: <code>select()</code> -&gt; <code>select()</code> + <code>drop()</code></li>
<li>Creating or altering columns: <code>mutate()</code> -&gt; <code>with_columns()</code></li>
<li>Subsetting rows: <code>filter()</code> -&gt; <code>filter()</code></li>
<li>Ordering rows: <code>arrange()</code> -&gt; <code>sort()</code></li>
<li>Computing group-level summary metrics: <code>group_by()</code> + <code>summarize()</code> -&gt; <code>group_by()</code> + <code>agg()</code></li>
</ul>
<p>For multiple datasets:</p>
<ul>
<li>Merging on a shared key: <code>*_join()</code> -&gt; <code>join(strategy = '*')</code></li>
<li>Stacking datasets of the same structure: <code>union()</code> -&gt; <code>concat()</code></li>
<li>Transforming rows and columns: <code>pivot_{longer/wider}()</code><sup>2</sup> -&gt; <code>pivot()</code></li>
</ul>
</section>
<section id="main-verb-design" class="level3">
<h3 class="anchored" data-anchor-id="main-verb-design">Main Verb Design</h3>
<p>Beyond the similarity in naming, <code>dplyr</code> and <code>polars</code> top-level functions are substantially similar in their deeper design choices which impact the ergonomics of use:</p>
<ul>
<li>Referencing columns: Both make it easy to concisely references columns in a dataset without the repeated and redundant references to said dataset (as sometimes occurs in base R or python’s <code>pandas</code>). dplyr does this through nonstandard evaluation wherein a dataframe’s coumns can be reference directly within a data transformation function as if they were top-level variables; in <code>polars</code>, column names are wrapped in <code>pl.col()</code></li>
<li>Optional argument: Both tend to have a wide array of nice-to-have optional arguments. For example the joining capabilities in both libraries offer optional join validation<sup>3</sup> and column renaming by appended suffix</li>
<li>Consistent dataframe-in -&gt; dataframe-out design: <code>dplyr</code> functions take a dataframe as their first argument and return a dataframe. Similarly, <code>polars</code> methods are called on a dataframe and return a dataframe which enables the chaining workflow discussed next</li>
</ul>
</section>
<section id="chaining-piping" class="level3">
<h3 class="anchored" data-anchor-id="chaining-piping">Chaining (Piping)</h3>
<p>These methods are applied to <code>polars</code> dataframes by <em>chaining</em> which should feel very familiar to R <code>dplyr</code> fans.</p>
<p>In <code>dplyr</code> and the broad <code>tidyverse</code>, most functions take a dataframe as their first argument and return a dataframe, enabling the piping of functions. This makes it easy to write more human-readable scripts where functions are written in the order of execution and whitespace can easily be added between lines. The following lines would all be equivalent.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">transformation2</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">transformation1</span>(df))</span>
<span id="cb2-2"></span>
<span id="cb2-3">df <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">transformation1</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">transformation2</span>()</span>
<span id="cb2-4"></span>
<span id="cb2-5">df <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb2-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">transformation1</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb2-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">transformation2</span>()</span></code></pre></div>
</div>
<p>Similarly, <code>polars</code>’s main transfomration methods offer a consistent dataframe-in dataframe-out design which allows <em>method chaining</em>. Here, we similarly can write commands in order where the <code>.</code> beginning the next method call serves the same purpose as R’s pipe. And for python broadly, to achieve the same affordance for whitespace, we can wrap the entire command in parentheses.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1">(</span>
<span id="cb3-2">  df</span>
<span id="cb3-3">  .transformation1()</span>
<span id="cb3-4">  .transformation2()</span>
<span id="cb3-5">)</span></code></pre></div>
</div>
<p>One could even say that <code>polars</code> dedication to chaining goes even deeper than <code>dplyr</code>. In <code>dplyr</code>, while core dataframe-level functions are piped, functions on specific columns are still often written in a nested fashion<sup>4</sup></p>
<div class="cell">
<div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1">df <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">z =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">g</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">f</span>(a)))</span></code></pre></div>
</div>
<p>In contrast, most of <code>polars</code> column-level transformation methods also make it ergonomic to keep the same literate left-to-right chaining within column-level definitions with the same benefits to readability as for dataframe-level operations.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1">df.with_columns(z <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'a'</span>).f().g())</span></code></pre></div>
</div>
</section>
</section>
<section id="advanced-wrangling" class="level2">
<h2 class="anchored" data-anchor-id="advanced-wrangling">Advanced Wrangling</h2>
<p>Beyond the surface-level similarity, <code>polars</code> supports some of the more complex ergonomics that <code>dplyr</code> users may enjoy. This includes functionality like:</p>
<ul>
<li>expressive and explicit syntax for transformations across multiple rows</li>
<li>concise helpers to identify subsets of columns and apply transformations</li>
<li>consistent syntax for window functions within data transformation operations</li>
<li>the ability to work with nested data structures</li>
</ul>
<p>Below, we will examine some of this functionality with a trusty fake dataframe.<sup>5</sup> As with <code>pandas</code>, you can make a quick dataframe in <code>polars</code> by passing a dictionary to <code>pl.DataFrame()</code>.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> polars <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pl </span>
<span id="cb6-2"></span>
<span id="cb6-3">df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.DataFrame({<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'a'</span>:[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>], </span>
<span id="cb6-4">                   <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'b'</span>:[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>], </span>
<span id="cb6-5">                   <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'c'</span>:[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">7</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">9</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]})</span>
<span id="cb6-6">df.head()</span></code></pre></div>
<div class="cell-output-display">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (4, 3)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">a</th>
<th data-quarto-table-cell-role="th">b</th>
<th data-quarto-table-cell-role="th">c</th>
</tr>
<tr class="odd">
<th>i64</th>
<th>i64</th>
<th>i64</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>1</td>
<td>3</td>
<td>7</td>
</tr>
<tr class="even">
<td>1</td>
<td>4</td>
<td>8</td>
</tr>
<tr class="odd">
<td>2</td>
<td>5</td>
<td>9</td>
</tr>
<tr class="even">
<td>2</td>
<td>6</td>
<td>0</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<section id="explicit-api-for-row-wise-operations" class="level3">
<h3 class="anchored" data-anchor-id="explicit-api-for-row-wise-operations">Explicit API for row-wise operations</h3>
<p>While row-wise operations are relatively easy to write ad-hoc, it can still be nice semantically to have readable and stylistically consistent code for such transformations.</p>
<p><code>dplyr</code>’s <a href="https://dplyr.tidyverse.org/articles/rowwise.html"><code>rowwise()</code></a> eliminates ambiguity in whether subsequent functions should be applied element-wise or collectively. Similiarly, <code>polars</code> has explicit <code>*_horizontal()</code> functions.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1">df.with_columns(</span>
<span id="cb7-2">  b_plus_c <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.sum_horizontal(pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'b'</span>), pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'c'</span>)) </span>
<span id="cb7-3">)</span></code></pre></div>
<div class="cell-output-display">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (4, 4)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">a</th>
<th data-quarto-table-cell-role="th">b</th>
<th data-quarto-table-cell-role="th">c</th>
<th data-quarto-table-cell-role="th">b_plus_c</th>
</tr>
<tr class="odd">
<th>i64</th>
<th>i64</th>
<th>i64</th>
<th>i64</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>1</td>
<td>3</td>
<td>7</td>
<td>10</td>
</tr>
<tr class="even">
<td>1</td>
<td>4</td>
<td>8</td>
<td>12</td>
</tr>
<tr class="odd">
<td>2</td>
<td>5</td>
<td>9</td>
<td>14</td>
</tr>
<tr class="even">
<td>2</td>
<td>6</td>
<td>0</td>
<td>6</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</section>
<section id="column-selectors" class="level3">
<h3 class="anchored" data-anchor-id="column-selectors">Column Selectors</h3>
<p><code>dplyr</code>’s <a href="https://dplyr.tidyverse.org/reference/select.html">column selectors</a> dynamically determine a set of columns based on pattern-matching their names (e.g.&nbsp;<code>starts_with()</code>, <code>ends_with()</code>), data types, or other features. I’ve previously <a href="../..\post/column-name-contracts/">written</a> and <a href="../..\talk/col-names-contract/">spoken</a> at length about how transformative this functionality can be when paired with</p>
<p><code>polars</code> has a similar set of <a href="https://docs.pola.rs/py-polars/html/reference/selectors.html">column selectors</a>. We’ll import them and see a few examples.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> polars.selectors <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> cs</span></code></pre></div>
</div>
<p>To make things more interesting, we’ll also turn one of our columns into a different data type.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1">df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df.with_columns(pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'a'</span>).cast(pl.Utf8))</span></code></pre></div>
</div>
<section id="in-select" class="level4">
<h4 class="anchored" data-anchor-id="in-select">In <code>select</code></h4>
<p>We can select columns based on name or data type and use one or more conditions.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1">df.select(cs.starts_with(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'b'</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|</span> cs.string())</span></code></pre></div>
<div class="cell-output-display">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (4, 2)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">b</th>
<th data-quarto-table-cell-role="th">a</th>
</tr>
<tr class="odd">
<th>i64</th>
<th>str</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>3</td>
<td>"1"</td>
</tr>
<tr class="even">
<td>4</td>
<td>"1"</td>
</tr>
<tr class="odd">
<td>5</td>
<td>"2"</td>
</tr>
<tr class="even">
<td>6</td>
<td>"2"</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<p>Negative conditions also work.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1">df.select(<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span>cs.string())</span></code></pre></div>
<div class="cell-output-display">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (4, 2)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">b</th>
<th data-quarto-table-cell-role="th">c</th>
</tr>
<tr class="odd">
<th>i64</th>
<th>i64</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>3</td>
<td>7</td>
</tr>
<tr class="even">
<td>4</td>
<td>8</td>
</tr>
<tr class="odd">
<td>5</td>
<td>9</td>
</tr>
<tr class="even">
<td>6</td>
<td>0</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</section>
<section id="in-with_columns" class="level4">
<h4 class="anchored" data-anchor-id="in-with_columns">In <code>with_columns</code></h4>
<p>Column selectors can play multiple rows in the transformation context.</p>
<p>The same transformation can be applied to multiple columns. Below, we find all integer variables, call a method to add 1 to each, and use the <code>name.suffix()</code> method to dynamically generate descriptive column names.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1">df.with_columns(</span>
<span id="cb12-2">  cs.integer().add(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>).name.suffix(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"_plus1"</span>)</span>
<span id="cb12-3">)</span></code></pre></div>
<div class="cell-output-display">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (4, 5)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">a</th>
<th data-quarto-table-cell-role="th">b</th>
<th data-quarto-table-cell-role="th">c</th>
<th data-quarto-table-cell-role="th">b_plus1</th>
<th data-quarto-table-cell-role="th">c_plus1</th>
</tr>
<tr class="odd">
<th>str</th>
<th>i64</th>
<th>i64</th>
<th>i64</th>
<th>i64</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>"1"</td>
<td>3</td>
<td>7</td>
<td>4</td>
<td>8</td>
</tr>
<tr class="even">
<td>"1"</td>
<td>4</td>
<td>8</td>
<td>5</td>
<td>9</td>
</tr>
<tr class="odd">
<td>"2"</td>
<td>5</td>
<td>9</td>
<td>6</td>
<td>10</td>
</tr>
<tr class="even">
<td>"2"</td>
<td>6</td>
<td>0</td>
<td>7</td>
<td>1</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<p>We can also use selected variables within transformations, like the rowwise sums that we just saw earlier.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1">df.with_columns(</span>
<span id="cb13-2">  row_total <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.sum_horizontal(cs.integer())</span>
<span id="cb13-3">)</span></code></pre></div>
<div class="cell-output-display">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (4, 4)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">a</th>
<th data-quarto-table-cell-role="th">b</th>
<th data-quarto-table-cell-role="th">c</th>
<th data-quarto-table-cell-role="th">row_total</th>
</tr>
<tr class="odd">
<th>str</th>
<th>i64</th>
<th>i64</th>
<th>i64</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>"1"</td>
<td>3</td>
<td>7</td>
<td>10</td>
</tr>
<tr class="even">
<td>"1"</td>
<td>4</td>
<td>8</td>
<td>12</td>
</tr>
<tr class="odd">
<td>"2"</td>
<td>5</td>
<td>9</td>
<td>14</td>
</tr>
<tr class="even">
<td>"2"</td>
<td>6</td>
<td>0</td>
<td>6</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</section>
<section id="in-group_by-and-agg" class="level4">
<h4 class="anchored" data-anchor-id="in-group_by-and-agg">In <code>group_by</code> and <code>agg</code></h4>
<p>Column selectors can also be passed as inputs anywhere else that one or more columns is accepted, as with data aggregation.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1">df.group_by(cs.string()).agg(cs.integer().<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>())</span></code></pre></div>
<div class="cell-output-display">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (2, 3)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">a</th>
<th data-quarto-table-cell-role="th">b</th>
<th data-quarto-table-cell-role="th">c</th>
</tr>
<tr class="odd">
<th>str</th>
<th>i64</th>
<th>i64</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>"1"</td>
<td>7</td>
<td>15</td>
</tr>
<tr class="even">
<td>"2"</td>
<td>11</td>
<td>9</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</section>
</section>
<section id="consistent-api-for-window-functions" class="level3">
<h3 class="anchored" data-anchor-id="consistent-api-for-window-functions">Consistent API for Window Functions</h3>
<p>Window functions are another incredibly important tool in any data wrangling language but seem criminally undertaught in introductory analysis classes. Window functions allows you to apply aggregation <em>logic</em> over subgroups of data while preserving the original <em>grain</em> of the data (e.g.&nbsp;in a table of all customers and orders and a column for the max purchase account by customer).</p>
<p><code>dplyr</code> make window functions trivially easy with the <code>group_by()</code> + <code>mutate()</code> pattern, invoking users’ pre-existing understanding of how to write aggregation logic and how to invoke transformations that preserve a table’s grain.</p>
<p><code>polars</code> takes a slightly different but elegant approach. Similarly, it reuses the core <code>with_columns()</code> method for window functions. However, it uses a more SQL-reminiscent specification of the “window” in the column definition versus a separate grouping statement. This has the added advantage of allowing one to use multiple window functions with different windows in the same <code>with_columns()</code> call if you should so choose.</p>
<p>A simple window function tranformation can be done by calling <code>with_columns()</code>, chaining an aggregation method onto a column, and following with the <code>over()</code> method to define the window of interest.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1">df.with_columns(</span>
<span id="cb15-2">  min_b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'b'</span>).<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">min</span>().over(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'a'</span>)</span>
<span id="cb15-3">)</span></code></pre></div>
<div class="cell-output-display">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (4, 4)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">a</th>
<th data-quarto-table-cell-role="th">b</th>
<th data-quarto-table-cell-role="th">c</th>
<th data-quarto-table-cell-role="th">min_b</th>
</tr>
<tr class="odd">
<th>str</th>
<th>i64</th>
<th>i64</th>
<th>i64</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>"1"</td>
<td>3</td>
<td>7</td>
<td>3</td>
</tr>
<tr class="even">
<td>"1"</td>
<td>4</td>
<td>8</td>
<td>3</td>
</tr>
<tr class="odd">
<td>"2"</td>
<td>5</td>
<td>9</td>
<td>5</td>
</tr>
<tr class="even">
<td>"2"</td>
<td>6</td>
<td>0</td>
<td>5</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<p>The chaining over and aggregate and <code>over()</code> can follow any other arbitrarily complex logic. Here, it follows a basic “case when”-type statement that creates an indicator for whether column b is null.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1">df.with_columns(</span>
<span id="cb16-2">  n_b_odd <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.when( (pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'b'</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb16-3">              .then(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb16-4">              .otherwise(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb16-5">              .<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>().over(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'a'</span>)</span>
<span id="cb16-6">)</span></code></pre></div>
<div class="cell-output-display">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (4, 4)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">a</th>
<th data-quarto-table-cell-role="th">b</th>
<th data-quarto-table-cell-role="th">c</th>
<th data-quarto-table-cell-role="th">n_b_odd</th>
</tr>
<tr class="odd">
<th>str</th>
<th>i64</th>
<th>i64</th>
<th>i32</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>"1"</td>
<td>3</td>
<td>7</td>
<td>1</td>
</tr>
<tr class="even">
<td>"1"</td>
<td>4</td>
<td>8</td>
<td>1</td>
</tr>
<tr class="odd">
<td>"2"</td>
<td>5</td>
<td>9</td>
<td>1</td>
</tr>
<tr class="even">
<td>"2"</td>
<td>6</td>
<td>0</td>
<td>1</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</section>
<section id="list-columns-and-nested-frames" class="level3">
<h3 class="anchored" data-anchor-id="list-columns-and-nested-frames">List Columns and Nested Frames</h3>
<p>While the R <code>tidyverse</code>’s raison d’etre was originally around the design of heavily normalize <a href="https://vita.had.co.nz/papers/tidy-data.pdf">tidy data</a>, modern data and analysis sometimes benefits from more complex and hierarchical data structures. Sometimes data comes to us in nested forms, like from an API<sup>6</sup>, and other times nesting data can help us perform analysis more effectively<sup>7</sup> Recognizing these use cases, <code>tidyr</code> provides many capability for the creation and manipulation of <a href="https://tidyr.tidyverse.org/articles/nest.html">nested data</a> in which a single cell contains values from multiple columns or sometimes even a whoel miniature dataframe.</p>
<p><code>polars</code> makes these operations similarly easy with its own version of structs (list columns) and arrays (nested dataframes).</p>
<section id="list-columns-nested-frames" class="level4">
<h4 class="anchored" data-anchor-id="list-columns-nested-frames">List Columns &amp; Nested Frames</h4>
<p>List columns that contain multiple key-value pairs (e.g.&nbsp;column-value) in a single column can be created with <code>pl.struct()</code> similar to R’s <code>list()</code>.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb17" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1">df.with_columns(list_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.struct( cs.integer() ))</span></code></pre></div>
<div class="cell-output-display">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (4, 4)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">a</th>
<th data-quarto-table-cell-role="th">b</th>
<th data-quarto-table-cell-role="th">c</th>
<th data-quarto-table-cell-role="th">list_col</th>
</tr>
<tr class="odd">
<th>str</th>
<th>i64</th>
<th>i64</th>
<th>struct[2]</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>"1"</td>
<td>3</td>
<td>7</td>
<td>{3,7}</td>
</tr>
<tr class="even">
<td>"1"</td>
<td>4</td>
<td>8</td>
<td>{4,8}</td>
</tr>
<tr class="odd">
<td>"2"</td>
<td>5</td>
<td>9</td>
<td>{5,9}</td>
</tr>
<tr class="even">
<td>"2"</td>
<td>6</td>
<td>0</td>
<td>{6,0}</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<p>These structs can be further be aggregated across rows into miniature datasets.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb18" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1">df.group_by(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'a'</span>).agg(list_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.struct( cs.integer() ) )</span></code></pre></div>
<div class="cell-output-display">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (2, 2)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">a</th>
<th data-quarto-table-cell-role="th">list_col</th>
</tr>
<tr class="odd">
<th>str</th>
<th>list[struct[2]]</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>"2"</td>
<td>[{5,9}, {6,0}]</td>
</tr>
<tr class="even">
<td>"1"</td>
<td>[{3,7}, {4,8}]</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<p>In fact, this could be a good use case for our column selectors! If we have many columns we want to keep unnested and many we want to next, it could be efficient to list out only the grouping variables and create our nested dataset by examining matches.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb19" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1">cols <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'a'</span>]</span>
<span id="cb19-2">(df</span>
<span id="cb19-3">  .group_by(cs.by_name(cols))</span>
<span id="cb19-4">  .agg(list_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.struct(<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span>cs.by_name(cols)))</span>
<span id="cb19-5">)</span></code></pre></div>
<div class="cell-output-display">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (2, 2)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">a</th>
<th data-quarto-table-cell-role="th">list_col</th>
</tr>
<tr class="odd">
<th>str</th>
<th>list[struct[2]]</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>"2"</td>
<td>[{5,9}, {6,0}]</td>
</tr>
<tr class="even">
<td>"1"</td>
<td>[{3,7}, {4,8}]</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</section>
<section id="undoing" class="level4">
<h4 class="anchored" data-anchor-id="undoing">Undoing</h4>
<p>Just as we constructed our nested data, we can denormalize it and return it to the original state in two steps. To see this, we can assign the nested structure above as <code>df_nested</code>.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb20" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb20-1">df_nested <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df.group_by(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'a'</span>).agg(list_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.struct( cs.integer() ) )</span></code></pre></div>
</div>
<p>First <code>explode()</code> returns the table to the original grain, leaving use with a single struct in each row.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb21" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb21-1">df_nested.explode(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'list_col'</span>)</span></code></pre></div>
<div class="cell-output-display">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (4, 2)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">a</th>
<th data-quarto-table-cell-role="th">list_col</th>
</tr>
<tr class="odd">
<th>str</th>
<th>struct[2]</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>"1"</td>
<td>{3,7}</td>
</tr>
<tr class="even">
<td>"1"</td>
<td>{4,8}</td>
</tr>
<tr class="odd">
<td>"2"</td>
<td>{5,9}</td>
</tr>
<tr class="even">
<td>"2"</td>
<td>{6,0}</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<p>Then, <code>unnest()</code> unpacks each struct and turns each element back into a column.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb22" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb22-1">df_nested.explode(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'list_col'</span>).unnest(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'list_col'</span>)</span></code></pre></div>
<div class="cell-output-display">
<div>
<div><style>
.dataframe > thead > tr,
.dataframe > tbody > tr {
  text-align: right;
  white-space: pre-wrap;
}
</style>
<small>shape: (4, 3)</small>
<table class="dataframe table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">a</th>
<th data-quarto-table-cell-role="th">b</th>
<th data-quarto-table-cell-role="th">c</th>
</tr>
<tr class="odd">
<th>str</th>
<th>i64</th>
<th>i64</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>"1"</td>
<td>3</td>
<td>7</td>
</tr>
<tr class="even">
<td>"1"</td>
<td>4</td>
<td>8</td>
</tr>
<tr class="odd">
<td>"2"</td>
<td>5</td>
<td>9</td>
</tr>
<tr class="even">
<td>"2"</td>
<td>6</td>
<td>0</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>


</section>
</section>
</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>Meaning you can’t get the same result twice because if you rerun the same code the input has already been modified↩︎</p></li>
<li id="fn2"><p>Of the <code>tidyverse</code> funtions mentioned so far, this is the only one found in <code>tidyr</code> not <code>dplyr</code>↩︎</p></li>
<li id="fn3"><p>That is, validating an assumption that joins should have been one-to-one, one-to-many, etc.↩︎</p></li>
<li id="fn4"><p>However, this is more by convention. There’s not a strong reason why they would strictly need to be.↩︎</p></li>
<li id="fn5"><p>I recently ran a <a href="https://twitter.com/EmilyRiederer/status/1744707632886095998">Twitter poll</a> on whether people prefer real, canonical, or fake datasets for learning and teaching. Fake data wasn’t the winner, but a strategy I find personally fun and useful as the unit-test analog for learning.↩︎</p></li>
<li id="fn6"><p>For example, an API payload for a LinkedIn user might have nested data structures representing professional experience and educational experience↩︎</p></li>
<li id="fn7"><p>For example, training a model on different data subsets.↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>rstats</category>
  <category>python</category>
  <category>tutorial</category>
  <guid>https://emilyriederer.com/post/py-rgo-polars/</guid>
  <pubDate>Sat, 13 Jan 2024 06:00:00 GMT</pubDate>
  <media:content url="https://emilyriederer.com/post/py-rgo-polars/featured.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Crosspost: Why you’re closer to data documentation than you think</title>
  <dc:creator>Emily Riederer</dc:creator>
  <link>https://emilyriederer.com/post/docs-closer-than-you-think/</link>
  <description><![CDATA[ 





<p>Documentation can be a make-or-break for the success of a data initiative, but it’s too often considered an optional nice-to-have. I’m a big believer that writing is thinking. Similarly, documenting is planning, executing, and validating.</p>
<p>Previously, I’ve explored how <a href="https://emilyriederer.netlify.app/post/latent-lasting-documentation/">we can create latent and lasting documentation</a> of data products and how <a href="https://emilyriederer.netlify.app/post/column-name-contracts/">column names can be self documenting</a>.</p>
<p>Recently, I had the opportunity to expand on these ideas in a <a href="https://www.selectstar.com/blog/why-youre-closer-to-data-documentation-than-you-think">cross-post with Select Star</a>. I argue that teams can produce high-quality and maintainable documentation with low overhead with a form of “documentation-driven development”. That is, smartly structuring and re-using artifacts from the development process into long-term documentation. For example:</p>
<ul>
<li>At the planning stage:
<ul>
<li>Structuring requirements docs in the form of data dictionaries</li>
<li>Creating early alignment on higher-order concepts like entity definitions (and <em>writing them down</em>)</li>
<li>Mentally beta testing data usability with an entity-relationship diagram</li>
</ul></li>
<li>At the development stage:
<ul>
<li>Ensuring relevant parts of internal “development documentation” (e.g.&nbsp;dbt column definitions, docstrings) are published to a format and location accessible to users</li>
<li>With different information but similar motivation to ER diagrams, sharing the full orchestration DAG to help users trace column-level lineage and internalize how each field maps to a real-world data generating process</li>
<li>Sharing data tests being executed (the “user contract”) and their results</li>
</ul></li>
<li>Throughout the lifecycle:
<ul>
<li>Answering questions “in public” (e.g.&nbsp;Slack versus email) to create a searchable collection of insights</li>
<li>Producing table usage statistics to help large, decentralized orgs capture the “wisdom of the crowds”</li>
</ul></li>
</ul>
<p>If you or your team works on data documentation, I’d love to hear what other patterns you’ve found to collect useful documentation assets during a data development process.</p>



 ]]></description>
  <category>data</category>
  <category>workflow</category>
  <category>elt</category>
  <category>crosspost</category>
  <guid>https://emilyriederer.com/post/docs-closer-than-you-think/</guid>
  <pubDate>Fri, 05 Jan 2024 06:00:00 GMT</pubDate>
  <media:content url="https://emilyriederer.com/post/docs-closer-than-you-think/featured.PNG" medium="image"/>
</item>
<item>
  <title>Python Rgonomics</title>
  <dc:creator>Emily Riederer</dc:creator>
  <link>https://emilyriederer.com/post/py-rgo/</link>
  <description><![CDATA[ 





<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://emilyriederer.com/post/py-rgo/featured.jpg" class="img-fluid figure-img"></p>
<figcaption>Photo credit to the inimitable <a href="https://allisonhorst.com/">Allison Horst</a></figcaption>
</figure>
</div>
<div class="callout callout-style-default callout-warning callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Warning
</div>
</div>
<div class="callout-body-container callout-body">
<p>Some advice in this post has gone stale regarding IDEs, installers, and environment management tools. Please see me <a href="post/py-rgo-2025">2025 update</a> for more recent thoughts following the release of <code>uv</code> and <code>Positron</code></p>
</div>
</div>
<p>Interoperability was a key theme in open-source data languages in 2023. Ongoing innovations in <a href="https://arrow.apache.org/">Arrow</a> (a language-agnostic in-memory standard for data storage), growing adoption of <a href="https://quarto.org/">Quarto</a> (the language-agnostic heir apparent to R Markdown), and even pandas creator Wes McKinney <a href="https://posit.co/blog/welcome-wes/">joining Posit</a> (the language-agnostic rebranding of RStudio) all illustrate the ongoing investment in breaking down barriers between different programming languages and paradigms.</p>
<p>Despite these advances in <em>technical</em> interoperability, individual developers will always face more friction than state-of-the-art tools when moving between languages. Learning a new language is easily enough done; programming 101 concepts like truth tables and control flow translate seamlessly. But ergonomics of a language do not. The tips and tricks we learn to be hyper productive in a primary language are comfortable, familiar, elegant, and effective. They just <em>feel</em> good. Working in a new language, developers often face a choice between forcing their favored workflows into a new tool where they may not “fit”, writing technically correct yet plodding code to get the job done, or approaching a new language as a true beginner to learn it’s “feel” from the ground up.</p>
<p>Fortunately, some of these higher-level paradigms have begun to bleed across languages, enriching previously isolated tribes with the and enabling developers to take their advanced skillsets with them across languages. For any R users who aim to upskill in python in 2024, recent tools and versions of old favorites have made strides in converging the R and python data science stacks. In this post, I will overview some recommended tools that are both truly pythonic while capturing the comfort and familiarity of some favorite R packages of the <code>tidyverse</code> variety.<sup>1</sup></p>
<section id="what-this-post-is-not" class="level2">
<h2 class="anchored" data-anchor-id="what-this-post-is-not">What this post is not</h2>
<p>Just to be clear:</p>
<ul>
<li>This is not a post about why python is better than R so R users should switch all their work to python</li>
<li>This is not a post about why R is better than python so R semantics and conventions should be forced into python</li>
<li>This is not a post about why python <em>users</em> are better than R users so R users need coddling</li>
<li>This is not a post about why R <em>users</em> are better than python users and have superior tastes for their toolkit</li>
<li>This is not a post about why these python tools are the only good tools and others are bad tools</li>
</ul>
<p>If you told me you liked the New York’s Museum of Metropolitan Art, I might say that you might also like Chicago’s Art Institute. That doesn’t mean you should only go to the museum in Chicago or that you should never go to the Louvre in Paris. That’s not how recommendations (by human or recsys) work. This is an “opinionated” post in the sense that “I like this” and not opinionated in the sense that “you must do this”.</p>
</section>
<section id="on-picking-tools" class="level2">
<h2 class="anchored" data-anchor-id="on-picking-tools">On picking tools</h2>
<p>The tools I highlight below tend to have two competing features:</p>
<ul>
<li>They have aspects of their workflow and ergonomics that should feel very comfortable to users of favored R tools</li>
<li>They should be independently accepted, successful, and well-maintained python projects with the true pythonic spirit</li>
</ul>
<p>The former is important because otherwise there’s nothing tailored about these recommendations; the latter is important so users actually engage with the python language and community instead of dabbling around in its more peripheral edges. In short, these two principles <em>exclude</em> tools that are direct ports between languages with that as their sole or main benefit.<sup>2</sup></p>
<p>For example, <code>siuba</code> and <code>plotnine</code> were written with the direct intent of mirroring R syntax. They have seen some success and adoption, but more niche tools come with liabilities. With smaller user-bases, they tend to lack in the pace of development, community support, prior art, StackOverflow questions, blog posts, conference talks, discussions, others to collaborate with, cache in a portfolio, etc. Instead of enjoying the ergonomics of an old language or embracing the challenge of learning a new one, ports can sometimes force developers to invest energy into a “secret third thing” of learning tools that isolate them from both communities and facing inevitable snags by themselves.</p>
<p>When in Rome, do as the Romans do – but if you’re coming from the U.S. that doesn’t mean you can’t bring a universal adapter that can help charge your devices in European outlets.</p>
</section>
<section id="the-stack" class="level2">
<h2 class="anchored" data-anchor-id="the-stack">The stack</h2>
<p>WIth that preamble out of the way, below are a few recommendations for the most ergonomic tools for getting set up, conducting core data analysis, and communication results.</p>
<p>To preview these recommendations:</p>
<p><strong>Set Up</strong></p>
<ul>
<li>Installation: <a href="https://github.com/pyenv/pyenv"><code>pyenv</code></a></li>
<li>IDE: <a href="https://code.visualstudio.com/docs/languages/python">VS Code</a></li>
</ul>
<p><strong>Analysis</strong></p>
<ul>
<li>Wrangling: <a href="https://pola.rs/"><code>polars</code></a></li>
<li>Visualization: <a href="https://seaborn.pydata.org/"><code>seaborn</code></a></li>
</ul>
<p><strong>Communication</strong></p>
<ul>
<li>Tables: <a href="https://posit-dev.github.io/great-tables/articles/intro.html">Great Tables</a></li>
<li>Notebooks: <a href="https://quarto.org/">Quarto</a></li>
</ul>
<p><strong>Miscellaneous</strong></p>
<ul>
<li>Environment Management: <a href="https://pdm-project.org/latest/"><code>pdm</code></a></li>
<li>Code Quality: <a href="https://docs.astral.sh/ruff/"><code>ruff</code></a></li>
</ul>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>I don’t want this advice to set up users for a potential snag. If you are on Windows and install python with <code>pyenv-win</code>, Quarto (as of writing on v1.3) may struggle to find the correct executable. Better support for this is on the backlog, but if you run into this issue, checkout this <a href="https://github.com/quarto-dev/quarto-cli/issues/3500#issuecomment-1375334561">brilliant fix</a>.</p>
</div>
</div>
<section id="for-setting-up" class="level3">
<h3 class="anchored" data-anchor-id="for-setting-up">For setting up</h3>
<p>The first hurdle is often getting started – both in terms of installing the tools you’ll need and getting into a comfortable IDE to run them.</p>
<ul>
<li><strong>Installation</strong>: R keeps installation simple; there’s one way to do it* so you do and it’s done. But before python converts can <code>print("hello world")</code>, they face a range of options (system Python, Python installer UI, Anaconda, Miniconda, etc.) each with its own kinks. These decisions are made harder in Python since projects tend to have stronger dependencies of the language, requiring one to switch between versions. For both of these reasons, I favor the <a href="https://github.com/pyenv/pyenv"><code>pyenv</code></a> (or <code>pyenv-win</code> for those on Windows) for easily managing python installation(s) from the command line. While the installation process of <code>pyenv</code> may be <em>technically</em> different, it’s similar in that it “just works” with just a few commands. In fact, the workflow is <em>so slick</em> that things seem to have gone 180 degrees with <code>pyenv</code> inspiring <a href="https://github.com/r-lib/rig">similar project called <code>rig</code> to manage R installations</a>. This may sound intimidating, but the learning curve is actually quite shallow:
<ul>
<li><code>pyenv install --list</code>: To see what python versions are available to install</li>
<li><code>pyenv install &lt;version number&gt;</code>: To install a specific version</li>
<li><code>pyenv versions</code>: To see what python versions are installed on your system</li>
<li><code>pyenv global &lt;version number&gt;</code>: The set one python version as a global default</li>
<li><code>pyenv local &lt;version number&gt;</code>: The set a python version to be used within a specific directory/project</li>
</ul></li>
<li><strong>Integrated Development Environment</strong>: Once R is install, R users are typically off to the races with the intuitive RStudio IDE which helps them get immediately hands-on with the REPL. With the UI divided into quadrants, users can write an R script, run it to see results in the console, conceptualize what the program “knows” with the variable explorer, and navigate files through a file explorer. Once again, python is not lacking in IDE options, but users are confronted with yet another decision point before they even get started. Pycharm, Sublime, Spyder, Eclipse, Atom, Neovim, oh my! I find that <a href="https://code.visualstudio.com/docs/languages/python">VS Code</a> offers the best functionality. It’s rich extension ecosystem also means that most major tools (e.g.&nbsp;Quarto, git, linters and stylers, etc.) have nice add-ons so, like RStudio, you can customize your platform to perform many side-tasks in plaintext or with the support of extra UI components.<sup>3</sup></li>
</ul>
</section>
<section id="for-data-analysis" class="level3">
<h3 class="anchored" data-anchor-id="for-data-analysis">For data analysis</h3>
<p>As data practitioners know, we’ll spend most of our time on cleaning and wrangling. As such, R users may struggle particularly to abandon their favorite tools for exploratory data analysis like <code>dplyr</code> and <code>ggplot2</code>. Fans of those packages often appreciate how their functional paradigm helps achieve a “flow state”. Precise syntax may differ, but new developments in the python wrangling stack provide increasingly close analogs to some of these beloved Rgonomics.</p>
<ul>
<li><strong>Data Wrangling</strong>: Although <code>pandas</code> is undoubtedly the best-known wrangling tool in the python space, I believe the growing <a href="https://pola.rs/"><code>polars</code></a> project offers the best experience for a transitioning developer (along with other nice-to-have benefits like being dependency free and blazingly fast). <code>polars</code> may feel more natural and less error-prone to R users for may reasons:
<ul>
<li>it has more internal consistent (and similar to <code>dplyr</code>) syntax such as <code>select</code>, <code>filter</code>, etc. and has demonstrated that the project values a clean API (e.g.&nbsp;recently renaming <code>groupby</code> to <code>group_by</code>)</li>
<li>it does not rely on the distinction between columns and indexes which can feel unintuitive and introduces a new set of concepts to learn</li>
<li>it consistently returns copies of dataframes (while <code>pandas</code> sometimes alters in-place) so code is more idempotent and avoids a whole class of failure modes for new users</li>
<li>it enables many of the same “advanced” wrangling workflows in <code>dplyr</code> with high-level, semantic code like making the transformation of multiple variables at once fast with <a href="https://docs.pola.rs/py-polars/html/reference/selectors.html">column selectors</a>, concisely expressing <a href="https://docs.pola.rs/user-guide/expressions/window/">window functions</a>, and working with nested data (or what <code>dplyr</code> calls “list columns”) with <a href="https://docs.pola.rs/user-guide/expressions/lists/">lists</a> and <a href="https://docs.pola.rs/user-guide/expressions/structs/">structs</a></li>
<li>supporting users working with increasingly large data. Similar to <code>dplyr</code>’s many backends (e.g.&nbsp;<code>dbplyr</code>), <code>polars</code> can be used to write lazily-evaluated, optimized transformations and it’s syntax is reminiscent of <code>pyspark</code> should users ever need to switch between</li>
</ul></li>
<li><strong>Visualization</strong>: Even some of R’s critics will acknowledge the strength of <code>ggplot2</code> for visualization, both in terms of it’s intuitive and incremental API and the stunning graphics it can produce. <a href="https://seaborn.pydata.org/tutorial/objects_interface"><code>seaborn</code>’s object interface</a> seems to strike a great balance between offering a similar workflow (which <a href="https://seaborn.pydata.org/whatsnew/v0.12.0.html">cites <code>ggplot2</code> as an inspiration</a>) while bringing all the benefits of using an industry-standard tool</li>
</ul>
</section>
<section id="for-communication" class="level3">
<h3 class="anchored" data-anchor-id="for-communication">For communication</h3>
<p>Historically, one possible dividing line between R and python has been framed as “python is good at working with computers, R is good at working with people”. While that is partially inspired by reductive takes that R is not production-grade, it is not without truth that the R’s academic roots spurred it to overinvest in a rich “communication stack” and translating analytical outputs into human-readable, publishable outputs. Here, too, the gaps have begun to close.</p>
<ul>
<li><strong>Tables</strong>: R has no shortage of packages for creating nicely formatted tables, an area that has historically lacked a bit in python both in workflow and outcomes. Barring strong competition from the native python space, the one “port” I am bullish about is the recently announced <a href="https://posit-dev.github.io/great-tables/articles/intro.html">Great Tables</a> package. This is a pythonic clone of R’s <code>gt</code> package. I’m more comfortable recommending this since it’s maintained by the same developer as the R version (to support long-term feature parity), backed by an institution not just an individual (to ensure it’s not a short-lived hobby project), and the design feels like it does a good job balancing R inspiration with pythonic practices</li>
<li><strong>Computational notebooks</strong>: Jupyter Notebooks are widely used, widely critiqued parts of many python workflows. While the ability to mix markdown and code chunks. However, notebooks can introduce new types of bugs for the uninitiated; for example, they are hard to version control and easy to execute in the wrong environment. For those coming from the world of R Markdown, plaintext computational notebooks like <a href="https://quarto.org/">Quarto</a> may provide a more transparent development experience. While Quarto allows users to write in <code>.qmd</code> files which are more like their <code>.rmd</code> predecessors, its renderer can also handle Jupyter notebooks to enable collaboration across team members with different preferences</li>
</ul>
</section>
<section id="miscellaneous" class="level3">
<h3 class="anchored" data-anchor-id="miscellaneous">Miscellaneous</h3>
<p>A few more tools may be helpful and familiar to <em>some</em> R users who tend towards the more “developer” versus “analyst” side of the spectrum. These, in my mind, have even more varied pros and cons, but I’ll leave them for consideration:</p>
<ul>
<li><strong>Environment Management</strong>: Joining the python world means never having to settle on an environment management tool for installing packages. There’s a truly overwhelming number of ways to manage project-level dependencies (<code>virtualenv</code>, <code>conda</code>, <code>piptools</code>, <code>pipenv</code>, <code>poetry</code>, and that doesn’t even scratch the surface) with different pros and cons and phenomenal amount of ink/pixels have been spilled over litigating these trade-offs. Putting all that aside, lately, I’ve been favoring <a href="https://pdm-project.org/latest/"><code>pdm</code></a> because it prioritizes features I care most about (auto-updating <code>pyproject.toml</code>, isolating dependencies from dependencies-of-dependencies, active development and error handling, mostly just works pretty undramatically)</li>
<li><strong>Developer Tools</strong>: <a href="https://docs.astral.sh/ruff/"><code>ruff</code></a> provides a range of linting and styling options (think R’s <code>lintr</code> and <code>styler</code>) and provides a one-stop-shop over what can be an overwhelming number of atomic tools in this space (<code>isort</code>, <code>black</code>, <code>flake8</code>, etc.). <code>ruff</code> is super fast, has a nice VS Code extension, and, while this class of tools is generally considered more advanced, I think linters can be a fantastic “coach” for new users about best practices</li>
</ul>
</section>
</section>
<section id="more-to-come" class="level2">
<h2 class="anchored" data-anchor-id="more-to-come">More to come!</h2>
<p>Each recommendation here itself could be its own tutorial or post. In particular, I hope to showcase the Rgonomics of <code>polars</code>, <code>seaborn</code>, and <code>great_tables</code> in future posts.</p>


</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>Of course, languages have their own subcultures too. The <code>tidyverse</code> and <code>data.table</code> parts of the R world tend to favor different semantics and ergonomics. This post caters more to the former.↩︎</p></li>
<li id="fn2"><p>There is no doubt a place for language ports, especially for earlier stage project where no native language-specific standard exists. For example, I like Karandeep Singh’s lab work on <a href="https://github.com/TidierOrg/Tidier.jl">a tidyverse for Julia</a> and maintain my own <a href="https://github.com/emilyriederer/dbtplyr"><code>dbtplyr</code></a> package to port <code>dplyr</code>’s select helpers to <code>dbt</code>↩︎</p></li>
<li id="fn3"><p> If anything, the one challenge of VS Code is the sheer number of set up options, but to start out, you can see these excellent tutorials from Rami Krispin on recommended <a href="https://github.com/RamiKrispin/vscode-python">python</a> and <a href="https://github.com/RamiKrispin/vscode-r">R</a> configurations ↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>rstats</category>
  <category>python</category>
  <guid>https://emilyriederer.com/post/py-rgo/</guid>
  <pubDate>Sat, 30 Dec 2023 06:00:00 GMT</pubDate>
  <media:content url="https://emilyriederer.com/post/py-rgo/featured.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Big ideas from the 2023 Causal Data Science Meeting</title>
  <dc:creator>Emily Riederer</dc:creator>
  <link>https://emilyriederer.com/post/recap-causal-2023/</link>
  <description><![CDATA[ 





<p>Last week, I enjoyed attending parts of the annual virtual <a href="https://www.causalscience.org/">Causal Data Science Meeting</a> organized by researchers from Maastricht University, Netherlands, and Copenhagen Business School, Denmark. This has been one of my favorite virtual events since the first iteration in 2020, and I find it consistently highlights the best of the causal research community: brining together industry and academia with concise talks that are at once thought-provoking, theoretically well-grounded, yet thoroughly pragmatic.</p>
<p>While I could not join the entire event (running in CET time, some sessions fit snuggly between my first cup of coffee and first work meeting of the day in CST), this year’s conference did not disappoint! Below, I share a sampling with five “big ideas” from the sessions.</p>
<ol type="1">
<li><p><strong>What’s the current “gold standard” of causal ML methods in industry?</strong> <a href="https://www.linkedin.com/in/dimgold/https://www.linkedin.com/in/dimgold/">Dima Goldenberg</a> presented a great case study on heterogeneous uplift modeling at Booking.com. (While I couldn’t find the exact slides or paper, you can get a flavor of Booking’s work in experimentation and causal inference from their excellent <a href="https://blog.booking.com/#datascience">tech blog</a> )</p></li>
<li><p><strong>How does causal evidence add value?</strong> <a href="https://www.linkedin.com/in/robert-kubinec-9191a9a/">Robert Kubinec</a> conceptualized a measurable spectrum of descriptive to causal studies based on entropy. This framework broadens the aperture to think about how both quantitative and qualitative evidence can come together to form causal conclusions. (<a href="https://osf.io/preprints/socarxiv/a492b/">Preprint</a>)</p></li>
<li><p><strong>But how do we know the methods work?</strong> Causal methods are notoriously hard to validate since, by definition, we lack a ground truth against which to compare our estimate. To validate new methods, Lingjie Shen and coauthors presented one approach with their new [<code>RCTrep</code> R package] (https://github.com/duolajiang/RCTrep) which can be used to compare outcomes between real-world data (RWD) and randomized control trial data (RCT).</p></li>
<li><p><strong>And what do we do when they can’t get all the way there?</strong> <a href="https://www.linkedin.com/in/ferlocar/">Carlos Fernández-Loría</a> and <a href="https://www.linkedin.com/in/jorge-lor%C3%ADa/">Jorge Loría</a> talk on “Causal Scoring” explores how we can accept and make use of “causal ranking” or “causal classification” even when we do not believe we can generate fully credible, calibrated causal estimates. By defining which type of estimand is really necessary for a specific use case, they show how one can tailor their modeling approach and broaden the range of applications. (<a href="https://arxiv.org/abs/2206.12532">Preprint</a>)</p></li>
<li><p><strong>Finally, do the best methods that correctly accrue causal evidence and validate <em>matter</em>?</strong> <a href="https://www.linkedin.com/in/ronberman/">Ron Berman</a> and Anya Shchetkina tackled this question in their paper about when correctly modeling uplift heterogeneity does and doesn’t matter. They decomposed potential causes using real-world marketing and public health examples and presented a methodology for identifying when uplift-based personalization makes a business impact (I couldn’t find pre-print, but they also presented at MIT’s CODE this week, so hopefully there will be a video soon!)</p></li>
</ol>
<p>One of the joys of the causal DS community’s mindset is the inherent focus on impact and pragmatism, and this year’s conference continued to deliver in that vein. I’m marking my calendar (and setting my 4AM alarm!) for next year already.</p>



 ]]></description>
  <category>causal</category>
  <guid>https://emilyriederer.com/post/recap-causal-2023/</guid>
  <pubDate>Sat, 18 Nov 2023 06:00:00 GMT</pubDate>
  <media:content url="https://emilyriederer.com/post/recap-causal-2023/featured.png" medium="image" type="image/png" height="83" width="144"/>
</item>
<item>
  <title>Data Downtime Horror Stories Panel</title>
  <link>https://emilyriederer.com/talk/data-downtime/</link>
  <description><![CDATA[ 




<section id="abstract" class="level2">
<h2 class="anchored" data-anchor-id="abstract">Abstract</h2>
<p>In October, I joined a Halloween-themed panel along with Chad Sanderson and Joe Reis to discuss our horror stories of data quality gone wrong and how to build successful data quality strategies in large organizations. Key takeaways are summarized on <a href="https://www.montecarlodata.com/blog-scary-data-quality-stories-7-tips-for-preventing-your-own-data-downtime-nightmare/">Monte Carlo’s blog</a>.</p>


</section>

 ]]></description>
  <category>data</category>
  <guid>https://emilyriederer.com/talk/data-downtime/</guid>
  <pubDate>Mon, 23 Oct 2023 05:00:00 GMT</pubDate>
  <media:content url="https://emilyriederer.com/talk/data-downtime/featured.PNG" medium="image"/>
</item>
<item>
  <title>Operationalizing Column-Name Contracts with dbtplyr</title>
  <link>https://emilyriederer.com/talk/dbtplyr/</link>
  <description><![CDATA[ 




<p>url_video: “”</p>
<div class="tabset-margin-container"></div><div class="panel-tabset">
<ul class="nav nav-tabs"><li class="nav-item"><a class="nav-link active" id="tabset-1-1-tab" data-bs-toggle="tab" data-bs-target="#tabset-1-1" aria-controls="tabset-1-1" aria-selected="true">Quick Links</a></li><li class="nav-item"><a class="nav-link" id="tabset-1-2-tab" data-bs-toggle="tab" data-bs-target="#tabset-1-2" aria-controls="tabset-1-2" aria-selected="false">Abstract</a></li><li class="nav-item"><a class="nav-link" id="tabset-1-3-tab" data-bs-toggle="tab" data-bs-target="#tabset-1-3" aria-controls="tabset-1-3" aria-selected="false">Slides</a></li><li class="nav-item"><a class="nav-link" id="tabset-1-4-tab" data-bs-toggle="tab" data-bs-target="#tabset-1-4" aria-controls="tabset-1-4" aria-selected="false">Video</a></li></ul>
<div class="tab-content">
<div id="tabset-1-1" class="tab-pane active" aria-labelledby="tabset-1-1-tab">
<p>At Coalesce for dbt user audience:</p>
<p><span><i class="bi bi-file-bar-graph"></i> <a href="slides.pdf">Slides</a> </span><br>
<span><i class="bi bi-play"></i> <a href="https://www.getdbt.com/coalesce-2021/operationalizing-columnname-contracts-with-dbtplyr/">Video</a> </span></p>
<p>At posit::conf for R user audience:</p>
<p><span><i class="bi bi-file-bar-graph"></i> <a href="slides-posit.pdf">Slides</a> </span><br>
<span><i class="bi bi-play"></i> Video - posit::conf for R User Audience <em>coming soon!</em> </span></p>
<p><span><i class="bi bi-pencil"></i> <a href="../..\post/column-name-contracts/">Post - Column Name Contracts</a> </span><br>
<span><i class="bi bi-pencil"></i> <a href="../..\post/convo-dbt/">Post - Column Name Contracts in dbt</a> </span><br>
<span><i class="bi bi-pencil"></i> <a href="../..\post/convo-dbt-update/">Post - Column Name Contracts with dbtplyr</a> </span></p>
</div>
<div id="tabset-1-2" class="tab-pane" aria-labelledby="tabset-1-2-tab">
<p>Complex software systems make performance guarantees through documentation and unit tests, and they communicate these to users with conscientious interface design.</p>
<p>However, published data tables exist in a gray area; they are static enough not to be considered a “service” or “software”, yet too raw to earn attentive user interface design. This ambiguity creates a disconnect between data producers and consumers and poses a risk for analytical correctness and reproducibility.</p>
<p>In this talk, I will explain how controlled vocabularies can be used to form contracts between data producers and data consumers. Explicitly embedding meaning in each component of variable names is a low-tech and low-friction approach which builds a shared understanding of how each field in the dataset is intended to work.</p>
<p>Doing so can offload the burden of data producers by facilitating automated data validation and metadata management. At the same time, data consumers benefit by a reduction in the cognitive load to remember names, a deeper understanding of variable encoding, and opportunities to more efficiently analyze the resulting dataset. After discussing the theory of controlled vocabulary column-naming and related workflows, I will illustrate these ideas with a demonstration of the {dbtplyr} dbt package which helps analytics engineers get the most value from controlled vocabularies by making it easier to effectively exploit column naming structures while coding.</p>
</div>
<div id="tabset-1-3" class="tab-pane" aria-labelledby="tabset-1-3-tab">
<div id="slides" style="width:100%; aspect-ratio:16/11;">
<embed src="slides.pdf#zoom=Fit" width="100%" height="100%">
</div>
</div>
<div id="tabset-1-4" class="tab-pane" aria-labelledby="tabset-1-4-tab">
<p>Coming Soon!</p>
</div>
</div>
</div>



 ]]></description>
  <category>workflow</category>
  <category>rmarkdown</category>
  <category>rstats</category>
  <guid>https://emilyriederer.com/talk/dbtplyr/</guid>
  <pubDate>Thu, 21 Sep 2023 05:00:00 GMT</pubDate>
  <media:content url="https://emilyriederer.com/talk/dbtplyr/featured.png" medium="image" type="image/png" height="82" width="144"/>
</item>
</channel>
</rss>
