Emily Riederer

Python Rgonomics - 2025 Update

Emily Riederer — Sun, 26 Jan 2025 06:00:00 GMT

Photo credit to the inimitable Allison Horst

About a year ago, I wrote the original version of Python Rgonomics to help fellow former R users who were entering into the world of python. The general point of the article was that new python tooling (e.g. polars versus pandas) has evolved to a point where there are tools that remain truly performant and pythonic while still having a more similar user experience for those coming from the R world. I also discussed this at posit::conf(2025).

Ironically, the thesis held so true that it condemned my first 2024 post on the topic. 2024 saw the release of a few game-changing tools that further streamline and simplify the python workflow. This post provides an updated set of recommendations. Specifically, it highlights:

Consolidating installation and environment management tooling: Previously, I recommended pyenv for instaling python versions and pdm for project and environment management. Then, last year saw the release of Astral’s excellent uv which nicely consolidates this functionality into a single highly performant tool.
Considering multiple IDE options: In addition to VS Code, I submit Posit PBC’s Positron for consideration depending on comfort, needs, and use cases. Both are backed by the open-source Code OSS with different layers of flexibility or customization. Positron is mostly interoperable with VS Code extensions, but provides a bit more of a “batteries included” opinionated design for the data analyst persona that may not want to navigate through the customization afforded by VS Code.

It is important to have a stable stack and not always jump to the next bright, shiny object; however, as I’ve watched these projects evolve throughout 2024, I feel confident to say they are not just a flash in the pan.

uv is supported by the Charlie Marsh’s Astral, which formerly made ruff to consolidate a number of code quality tools. Astral’s commitment to open source, the careful design, and the incredible performance becnhmarks of uv speak for itself. Similarly, Positron is backed by the reliable Posit PBC (formerly RStudio) as an open source extension of Code OSS (which is also the open-source skeleton for Microsoft’s VS Code).

The rest of this post is reproduced in full with relevant updates so it reads end-to-end instead of referencing the changes from old to new recommendations.

Now let’s get started

The “expert-novice” duality is an uncomfortable part of switching between languages like R and python. Learning a new language is easily enough done; programming 101 concepts like truth tables and control flow translate seamlessly. But ergonomics of a language do not. The tips and tricks we learn to be hyper productive in a primary language are comfortable, familiar, elegant, and effective. They just feel good. Working in a new language, developers often face a choice between forcing their favored workflows into a new tool where they may not “fit”, writing technically correct yet plodding code to get the job done, or approaching a new language as a true beginner to learn it’s “feel” from the ground up.

Fortunately, some of these higher-level paradigms have begun to bleed across languages, enriching previously isolated tribes with the and enabling developers to take their advanced skillsets with them across languages. For any R users who aim to upskill in python in 2024, recent tools and versions of old favorites have made strides in converging the R and python data science stacks. In this post, I will overview some recommended tools that are both truly pythonic while capturing the comfort and familiarity of some favorite R packages of the tidyverse variety.¹

What this post is not

Just to be clear:

This is not a post about why python is better than R so R users should switch all their work to python
This is not a post about why R is better than python so R semantics and conventions should be forced into python
This is not a post about why python users are better than R users so R users need coddling
This is not a post about why R users are better than python users and have superior tastes for their toolkit
This is not a post about why these python tools are the only good tools and others are bad tools

If you told me you liked the New York’s Museum of Metropolitan Art, I might say that you might also like Chicago’s Art Institute. That doesn’t mean you should only go to the museum in Chicago or that you should never go to the Louvre in Paris. That’s not how recommendations (by human or recsys) work. This is an “opinionated” post in the sense that “I like this” and not opinionated in the sense that “you must do this”.

On picking tools

The tools I highlight below tend to have two competing features:

They have aspects of their workflow and ergonomics that should feel very comfortable to users of favored R tools
They should be independently accepted, successful, and well-maintained python projects with the true pythonic spirit

The former is important because otherwise there’s nothing tailored about these recommendations; the latter is important so users actually engage with the python language and community instead of dabbling around in its more peripheral edges. In short, these two principles exclude tools that are direct ports between languages with that as their sole or main benefit.²

For example, siuba and plotnine were written with the direct intent of mirroring R syntax. They have seen some success and adoption, but more niche tools come with liabilities. With smaller user-bases, they tend to lack in the pace of development, community support, prior art, StackOverflow questions, blog posts, conference talks, discussions, others to collaborate with, cache in a portfolio, etc. Instead of enjoying the ergonomics of an old language or embracing the challenge of learning a new one, ports can sometimes force developers to invest energy into a “secret third thing” of learning tools that isolate them from both communities and facing inevitable snags by themselves.

When in Rome, do as the Romans do – but if you’re coming from the U.S. that doesn’t mean you can’t bring a universal adapter that can help charge your devices in European outlets.

The stack

WIth that preamble out of the way, below are a few recommendations for the most ergonomic tools for getting set up, conducting core data analysis, and communication results.

To preview these recommendations:

Set Up

Installation: uv
IDE:
- VS Code, or
- Positron

Analysis

Wrangling: polars
Visualization: seaborn

Communication

Tables: Great Tables
Notebooks: Quarto

Miscellaneous

Environment Management: uv
Code Quality: ruff

For setting up

The first hurdle is often getting started – both in terms of installing the tools you’ll need and getting into a comfortable IDE to run them.

Installation: R keeps installation simple; there’s one way to do it so you do and it’s done³. But before python converts can print("hello world"), they face a range of options (system Python, Python installer UI, Anaconda, Miniconda, etc.) each with its own kinks. These decisions are made harder in Python since projects tend to have stronger dependencies of the language, requiring one to switch between versions. Fortunately, uv now makes this task easy with many different commands for:
- Installing one or more specific versions: uv python install
- Listing all available installations: uv python list
- Returning path of python executables: uv python find
- Spinning up a quick REPL with a temporary python version and packages: e.g. uv run --python 3.12 --with pandas python
Integrated Development Environment: Once R is install, R users are typically off to the races with the intuitive RStudio IDE which helps them get immediately hands-on with the REPL. With the UI divided into quadrants, users can write an R script, run it to see results in the console, conceptualize what the program “knows” with the variable explorer, and navigate files through a file explorer. Once again, python is not lacking in IDE options, but users are confronted with yet another decision point before they even get started. Pycharm, Sublime, Spyder, Eclipse, Atom, Neovim, oh my! For python, I’d recommend either VS Code or Positron, which are both extensions of Code OSS.
- VS Code is an industry standard tool for software development. This means it has a rich set of features for coding, debugging, navigating large projects, etc. It’s rich extension ecosystem also means that most major tools (e.g. Quarto, git, linters and stylers, etc.) have nice add-ons so, like RStudio, you can customize your platform to perform many side-tasks in plaintext or with the support of extra UI components.⁴
- Positron is a newer entrant from Posit PBC (formerly RStudio). It streamlines the offerings of VS Code to center the features most useful for data analysis. Positron may feel easier to go from zero-to-one. It does a great job finding and consistently using the right versions of R, python, Quarto, etc. and prioritizes many of the IDE elements that make RStudio wonderful for working with data (e.g. object preview pane). Additionally, most VS Code extensions will work in Positron; however, Positron cannot use extensions that rely on Microsoft’s PyLance meaning some realtime linting and error detection tools like ErrorLens do not work out-of-the-box. Ultimately, your comfort navigating VS Code and your mix of dev versus data work may determine which is best for you.

For data analysis

As data practitioners know, we’ll spend most of our time on cleaning and wrangling. As such, R users may struggle particularly to abandon their favorite tools for exploratory data analysis like dplyr and ggplot2. Fans of those packages often appreciate how their functional paradigm helps achieve a “flow state”. Precise syntax may differ, but new developments in the python wrangling stack provide increasingly close analogs to some of these beloved Rgonomics.

Data Wrangling: (See my separate post on polars)Although pandas is undoubtedly the best-known wrangling tool in the python space, I believe the growing polars project offers the best experience for a transitioning developer (along with other nice-to-have benefits like being dependency free and blazingly fast). polars may feel more natural and less error-prone to R users for may reasons:
- it has more internal consistent (and similar to dplyr) syntax such as select, filter, etc. and has demonstrated that the project values a clean API (e.g. recently renaming groupby to group_by)
- it does not rely on the distinction between columns and indexes which can feel unintuitive and introduces a new set of concepts to learn
- it consistently returns copies of dataframes (while pandas sometimes alters in-place) so code is more idempotent and avoids a whole class of failure modes for new users
- it enables many of the same “advanced” wrangling workflows in dplyr with high-level, semantic code like making the transformation of multiple variables at once fast with column selectors, concisely expressing window functions, and working with nested data (or what dplyr calls “list columns”) with lists and structs
- supporting users working with increasingly large data. Similar to dplyr’s many backends (e.g. dbplyr), polars can be used to write lazily-evaluated, optimized transformations and it’s syntax is reminiscent of pyspark should users ever need to switch between
Visualization: Even some of R’s critics will acknowledge the strength of ggplot2 for visualization, both in terms of it’s intuitive and incremental API and the stunning graphics it can produce. seaborn’s object interface seems to strike a great balance between offering a similar workflow (which cites ggplot2 as an inspiration) while bringing all the benefits of using an industry-standard tool

For communication

Historically, one possible dividing line between R and python has been framed as “python is good at working with computers, R is good at working with people”. While that is partially inspired by reductive takes that R is not production-grade, it is not without truth that the R’s academic roots spurred it to overinvest in a rich “communication stack” and translating analytical outputs into human-readable, publishable outputs. Here, too, the gaps have begun to close.

Tables: R has no shortage of packages for creating nicely formatted tables, an area that has historically lacked a bit in python both in workflow and outcomes. Barring strong competition from the native python space, the one “port” I am bullish about is the recently announced Great Tables package. This is a pythonic clone of R’s gt package. I’m more comfortable recommending this since it’s maintained by the same developer as the R version (to support long-term feature parity), backed by an institution not just an individual (to ensure it’s not a short-lived hobby project), and the design feels like it does a good job balancing R inspiration with pythonic practices
Computational notebooks: Jupyter Notebooks are widely used, widely critiqued parts of many python workflows. While the ability to mix markdown and code chunks. However, notebooks can introduce new types of bugs for the uninitiated; for example, they are hard to version control and easy to execute in the wrong environment. For those coming from the world of R Markdown, plaintext computational notebooks like Quarto may provide a more transparent development experience. While Quarto allows users to write in .qmd files which are more like their .rmd predecessors, its renderer can also handle Jupyter notebooks to enable collaboration across team members with different preferences

Miscellaneous

A few more tools may be helpful and familiar to some R users who tend towards the more “developer” versus “analyst” side of the spectrum. These, in my mind, have even more varied pros and cons, but I’ll leave them for consideration:

Environment Management: There’s a truly overwhelming number of ways⁵ to manage project-level dependencies in python. As a consequence, there’s also a lot of outdated advice weighing pros and cons of feature sets that have since evolved. Here again, uv takes the cake as a swiss army knife tool. It features fast installation, auto-updating of the pyproject.toml and uv.lock files (so you don’t need to remember to pip freeze), separate trakcing of primary dependencies from the fully resolved environment (so you can cleanly and completely remove dependencies-of-dependencies you no longer need), and so much more. uv can operate as a drop in replacement for pip and generate a requirements.txt if needed for compatability; however, given it’s explosive popularity and ergonomic design, I doubt you’ll have trouble convincing collaborators to adopt the same.
Developer Tools: ruff (another Astral project) provides a range of linting and styling options (think R’s lintr and styler) and provides a one-stop-shop over what can be an overwhelming number of atomic tools in this space (isort, black, flake8, etc.). ruff is super fast, has a nice VS Code extension, and, while this class of tools is generally considered more advanced, I think linters can be a fantastic “coach” for new users about best practices

Footnotes

Of course, languages have their own subcultures too. The tidyverse and data.table parts of the R world tend to favor different semantics and ergonomics. This post caters more to the former.↩︎
There is no doubt a place for language ports, especially for earlier stage project where no native language-specific standard exists. For example, I like Karandeep Singh’s lab work on a tidyverse for Julia and maintain my own dbtplyr package to port dplyr’s select helpers to dbt↩︎
However, to highlight some advances here, Posit’s newer rig project seems to be inspired by python install management tools and offers a convenient CLI for managing multiple version of R↩︎
If anything, the one challenge of VS Code is the sheer number of set up options, but to start out, you can see these excellent tutorials from Rami Krispin on recommended python and R configurations ↩︎
pdm, virtualenv, conda, piptools, pipenv, poetry, and that doesn’t even scratch the surface↩︎

Role-Based Access Control for Quarto sites with Netlify Identity

Emily Riederer — Sun, 10 Nov 2024 06:00:00 GMT

Literate programming tools like R Markdown and Quarto make it easy to convert analyses into aesthetic documents, dashbaords, and websites for public sharing. But what if you don’t want your results too public?

I recently was working on a project that required me to set up a large number of dashboards with similar content but different data for about 10 small, separate organizations. As I considered by tech stack, I found that many Quarto users were asking similar questions, but understandably the Quarto team had no one slam-dunk answer because authentication management (a serving / hosting problem) would be a substantial scope creep beyond the goals and core functionality of Quarto (an open-source publishing system).

After evaluating my options, I found the best solution for my use case was role-based access controls with Netlify Identity. In this post, I’ll briefly describe how this solution works, how to set it up, and some of the pros and cons.

Demo

Using a minimal Netlify Identity set-up, you can be up and running with the following UX in about 10 minutes. For this post, I show the true “minimum viable deployment”, although the styling and aesthetics could be made much fancier.

When users first visit your site’s homepage, they will be prompted that they need to sign-up or login to continue.

If users navigate to any other part of the site before logging in, they’ll receive an error message prompting them to return to the home screen. (This could be customized as you would a 404 Not Found error.)

After clicking either button, an in-browser popup modal allows them to sign up, login in, or request forgotten credentials.

The example above shows the option to create a custom login or use Google to authenticate. Netlify also allows for the options to use other free (e.g. GitHub, GitLab) or paid (e.g. Okta) third-party login services.

For new signups, Netlify can automatically trigger confirmation emails with customized content based on a templated text or HTML file in your repository.

Once logged in, the homepage then offers the option to log back out.

Otherwise, users can then proceed to the rest of the site as if it were publicly available.

Set Up

The basics of how Netlify Identity works are described at length in this blog post. If you decide to implement this solution, I recommend reading those official documents for a more robust mental model. In short, Netlify Identity works by attaching a token to each user after they log in. This user-specific token can be assigned different roles on the backend, and depending on which roles a user has, they can be redirected to (or gated from) seeing different content.

Setting up Netlify Identify requires a few small tweaks throughout your site:

Add Javascript to each page to handle the JSON Web Tokens (JWTs) set by Identity. This is done most easily through the _quarto.yml
Configure site redirects to response to the JWTs. This is contained in its own _redirects file
Ensure you have a user interface that allows users to sign up and login, thus changing their JWTs and access. I put this in my index.qmd

Then, finally, within the Netlify admin panel, you must:

Configure the user signup workflow (e.g. by invitation, open sign-up)
Assign users to roles that determine what content they can see
Optionally, enable third-party forms of authentication (e.g. Google, GitHub)

Let’s take these one at a time.

Configure Role Authentiation

Netlify maintains an Identity widget that handles recognizing authenticated users and their roles from their JWTs. To inject this Javascript snippet into every page, open the _quarto.yml file and add the Javascript snippet to the include-in-header: key under the HTML format, e.g.:

format:
  html: 
    include-in-header: 
      text: |

Note, the official widget is injected using the src field of the first script tag.

Configure Site Redirects

Next, create a _redirects file at the top level of your project (or open the existing file) and add the following lines:

/login /
/*  /:splat  200!  Role=admin
/site_libs/* /site_libs/:splat 200!
/   /        200!
/*  /  401!

Syntax for the _redirects file is described here, but basically each line defines a rule with the structure:

And, like a case when statement, the first “matching” rule dominates.

So, the example above can roughly be read in English as:

If users go to the /login page, take them back to home
If users try to go anywhere else on my site and they have role admin, let them do that 
If users try to go to the hompage of my site (regardless of their role), let them do that
If users otherwise try to go to other parts of the site (and they don't have admin), give an error

Of course, this could be further customized to set different rules for different subdirectories.

Create User Interface

To create the user interface for the login screen, I added code to inject a Netlify-maintained login widget to my site’s index.qmd, e.g.:

---
date: last-modified
---

# Home {.unnumbered}



Welcome! Please sign in to view the dashboard. 

If you are a first time user, please create a login and email [emilyriederer@gmail.com](mailto:emilyriederer@gmail.com?subject=Dashboard%20Access%20Request) to elevate your access.

User Onboarding

After the changes above to your actual Quarto site, the rest of the work lies in the Netlify admin panel. For a small number of users, you can manually change their role in the user interface.

However, to work at any scale, you may need a more automated solution. For that, Netlify’s docs explain how to configure initial role assignment via lambda functions. However, out-of-the box functionality that I found to be lacking was assigning default roles for new users or the ability to configure basic logic such as assigning the same role to any new users onboarding from a certain email domain.

Is it for you?

Netlify Identity isn’t the perfect solution for all use cases, but for many small websites and blogs it’s possibly one of the lowest friction solutions available.

This solution is easy to set up initially, allows some degree of self-service for users (account set-up and password resets), user communication (email management), and third-party integration (e.g. authenticate with GitHub or Google). It also has a robust free tier, allowing 1K users to self register (and 5 registrations-by-invitation), and is a substantial step up over locking down HTML content with a single common password.

However, Netlify Identity is not a bullet-proof end-to-end security solution and could become painful or expensive at large scale. This solution, for example, doesn’t contemplate securing your website’s full “supply chain” (e.g. if the source code in in a public GitHub repo) and certainly is less secure than hosting your site completely within a sanboxed environment or intranet. For a large number of users, I also feel there’s a large opportunity to allow simple business rules to configure initial roles.

In summary, I would generally recommend Netlify Identity if you’re already using Netlify, expect a small number of users, and are comfortable adding friction to your sign-in process versus absolute security. For larger projects with higher usage and more bullet-proof security needs, it may be worth considering alternatives.

Python Rgonomics

Thu, 15 Aug 2024 05:00:00 GMT

Slides
Video
Post - Python Rgonomics
Post - Advanced polars versus dplyr

Warning

Tooling changes quickly. Since this talk occured, Astral’s uv project has come out as a very strong contender to replace pyenv, pdm, and more of the devtools part of a python stack.

Data science languages are increasingly interoperable with advances like Arrow, Quarto, and Posit Connect. But data scientists are not. Learning the basic syntax of a new language is easy, but relearning the ergonomics that help us be hyperproductive is hard. In this talk, I will explore the influential ergonomics of R’s tidyverse. Next, I will recommend a curated stack that mirrors these ergonomics while also being genuinely truly pythonic. In particular, we will explore packages (polars, seaborn objects, greattables), frameworks (Shiny, Quarto), dev tools (pyenv, ruff, and pdm), and IDEs (VS Code extensions). The audience should leave feeling inspired to try python while benefiting from their current knowledge and expertise.

Crosspost: Data discovery doesn’t belong in ad hoc queries

Emily Riederer — Thu, 18 Jul 2024 05:00:00 GMT

Credible documentation is the best tool for working with data. Short of that, labor (and computational) intensive validation may be required. Recently, I had the opportunity to expand on these ideas in a cross-post with Select Star. I explore how a “good” data analyst can interrogate a dataset with expensive queries and, more importantly, how best-in-class data products eliminate the need for this.

My post is reproduced below.

In the current environment of decreasing headcount and rising cloud costs, the benefits of data management are more objective and tangible than ever. Done well, data management can reduce the cognitive and computational costs of working with enterprise-scale data.

Analysts often jump into new-to-them tables to answer business questions. Without a robust data platform, this constant novelty leads analysts down one of two paths. Either they boldly gamble that they have found intuitive and relevant data or they painstakingly hypothesize and validate assumptions for each new table. The latter approach leads to more trustworthy outcomes, but it comes at the cost of human capital and computational power.

Consider an analyst at an e-commerce company asking the question “How many invoices did we generate for fulfilled orders to Ohio in June?” while navigating unfamiliar tables. In this post, we explore prototypical queries analysts might have to run to validate a new-to-them table. Many of these are “expensive” queries requiring full table scans. Next, we’ll examine how a data discovery platform can obviate this effort.

The impact of this inefficiency may range from a minor papercut to a major cost sink depending on the sizes of your analyst community, historical enterprise data, and warehouse.

6 Preventable Data Discovery Queries

1. What columns are in the table?

Without a good data catalog, analysts will first need to check what fields exist in a table. While there may be lower cost ways to do this like looking at a pre-rendered preview (ala BigQuery), using a DESCRIBE statement (ala Spark), or limiting their query to the first few rows, some analysts may default to requesting all the data.

select *
from invoices;

2. Is the table still live and updating?

After establishing that a table has potentially useful information, analysts should next wonder if the data is still live and updating. First they might check a date field to see if the table seems “fresh”.

select max(order_date) 
from invoices;

But, of course, tables often have multiple date fields. For example, an e-commerce invoice table might have fields for both the date an order was placed and the date the record was last modified. So, analysts may guess-and-check a few of these fields to determine table freshness.

select max(updated_date) 
from invoices;

After identifying the correct field, there’s still a question of refresh cadence. Are records added hourly? Daily? Monthly? Lacking system-level metrics and metadata on the upstream table freshness, analysts are still left in the dark. So, once again, they can check empirically by looking at the frequency of the date field.

select max(updated_date), count(1) as n
from invoices
group by 1;

3. What is the grain of the table?

Now that the table is confirmed to be usable, the question becomes how to use it. Specifically, to credibly query and join the table, analysts next must determine its grain. Often, they start with a guess informed by the business context and data modeling conventions, such as assuming an invoice table is unique by order_id.

select count(1) as n, count(distinct order_id)
from invoices;

‍However, if they learn that order_id has a different cardinality then the number of records, they must ask why. So, once again, they scan the full table to find examples of records with shared order_id values.

select *
from invoices
qualify count(1) over (partition by order_id) > 1
order by order_id
limit 10;

Eyeballing the results of this query, the analysts might notice that the same order_id value can coincide with different ship_id values, as a separate invoice is generated for each part of an order when a subset of items is shipped. With this new hypothesis, the analyst iterates on the validation of the grain.

select count(1) as n, count(distinct order_id, ship_id)
from invoices;

4. What values can categorical variables take?

The prior questions all involved table structure. Only now can an analyst finally begin to investigate the table’s content. A first step might be to understand the valid values for categorical variables. For example, if our analyst wanted to ensure only completed orders were queried, they might inspect the potential values of the order_status_id field to determine which values to include in a filter.

select distinct order_status_id
from invoices;

They’ll likely repeat this process for many categorical variables of interest. Since our analyst is interested in shipments specifically to Ohio, they might also inspect the cardinality of the ship_state field to ensure they correctly format the identifier.

select distinct ship_state
from invoices;

5. Do numeric columns have nulls or ‘sentinel’ values to encode nulls?

Similarly, analysts may wish to audit other variables for null handling or sentinel values by inspecting column-level statistics.

select distinct ship_state
from invoices;

6. Is the data stored with partitioning or clustering keys?

Inefficient queries aren’t only a symptom of ad hoc data validation. More complex and reused logic may also be written wastefully when table metadata like partitioning and clustering keys is not available to analysts. For example, an analyst might be able to construct a reasonable query filtering either on a shipment date or an order date, but if only one of these is a partitioning or clustering key, different queries could have substantial performance differences.

Understanding Your Data Without Relying on Queries

Analysts absolutely should ask themselves these types of questions when working with new data. However, it should not be analysts’ job to individually answer these questions by running SQL queries. Instead, best-in-class data documentation can provide critical information through a data catalog like Select Star.

1. What columns are in the table? And do we need a table?

Comprehensive search across all of an organization’s assets can help users quickly identify the right resources based on table names, field names, or data descriptions. Even better, search can incorporate observed tribal knowledge of table popularity and common querying patterns to prioritize the most relevant results. Moreover, when search also includes downstream data products like pre-built reports and dashboards, analysts might sometimes find an answer to their question exists off the shelf.

2. Is the table still live and updating? And are its own sources current?

Data is not a static artifact so metadata should not be either. After analysts identify a candidate table, they should have access to real-time operational information like table usage, table size, refresh date, and upstream dependencies to help confirm whether the table is a reliable resource.

Ideally, analysts can interrogate not just the freshness of a final table but also its dependencies by exploring the table’s data lineage.

3. What is the grain of the table? And how does it relate to others?

Table grain should be clearly documented at the table level and emphasized in the data dictionary via references to primary and foreign keys. Beyond basic documentation, entity-relationship (ER) diagrams will help analysts gain a richer mental model of grains of how they can use these primary-foreign key relationships to link tables to craft information with the desired grain and fields. Alternatively, they can glean this information from the wisdom of the crowds if they have access to how others have queried and joined the data previously.

4. What values can categorical variables take? Do numeric columns have nulls or ‘sentinel’ values to encode nulls?

Information about proper expectations and handling of categorical and null values may be published as field definitions, pointed to lookup tables, implied in data tests, or illustrated in past queries. To drive consistency and offload redundant work from data producers, such field definitions can be propagated from upstream tables.

‍5. Is the data stored with partitioning or clustering keys?

Analysts cannot write efficient code if they don’t know where the efficiency gains lie. Table-level documentation should clearly highlight the use of clustering or partitioning files so analysts can use the most impactful variables in filters and joins. Here, consistency of documentation is paramount; analysts may not always be incented to care about query efficiency, so if this information is hard to find or rarely available, they can be easily dissuaded from looking.

Beyond a poor user experience, poor data discoverability creates inefficiency and added cost. Even if you don’t have large scale historical data or broad data user communities today, slow queries and tedious work still detract from data team productivity while introducing context-switching and chaos. By focusing on improving data discoverability, you can streamline workflows and enhance the overall efficiency of your data operations.

Base Python Rgonomic Patterns

Emily Riederer — Sat, 20 Jan 2024 06:00:00 GMT

Photo credit to David Clode on Unsplash

In the past few weeks, I’ve been writing about a stack of tools and specific packages like polars that may help R users feel “at home” when working in python due to similiar ergonomics. However, one common snag in switching languages is ramping up on common “recipes” for higher-level workflows (e.g. how to build a sklearn modeling pipeline) but missing a languages’s fundamentals that make writing glue code feel smooth (and dare I say pleasant?) It’s a maddening feeling to get code for a complex task to finish only to have the result wrapped in an object that you can’t suss out how to save or manipulate.

This post goes back to the basics. We’ll briefly reflect on a few aspects of usability that have led to the success of many workflow packages in R. Then, I’ll demonstrate a grab bag of coding patterns in python that make it feel more elegant to connect bits of code into a coherent workflow.

We’ll look at the kind of functionality that you didn’t know to miss until it was gone, you may not be quite sure what to search to figure out how to get it back, and you wonder if it’s even reasonable to hope there’s an analog¹. This won’t be anything groundbreaking – just some nuts and bolts. Specifically: helper functions for data and time manipulation, advanced string interpolation, list comprehensions for more functional programming, and object serialization.

What other R ergonomics do we enjoy?

R’s passionate user and developer community has invested a lot in building tools that smooth over rough edges and provide slick, concise APIs to rote tasks. Sepcifically, a number of packages are devoted to:

Utility functions: Things that make it easier to “automate the boring stuff” like fs for naviating file systems or lubridate for more semantic date wrangling
Formatting functions: Things that help us make things look nice for users like cli and glue to improve human readability of terminal output and string interpolation
Efficiency functions: Things that help us write efficient workflows like purrr which provides a concise, typesafe interface for iteration

All of these capabilities are things we could somewhat trivially write ourselves, but we don’t want to and we don’t need to. Fortunately, we don’t need to in python either.

Wrangling Things (Date Manipulation)

I don’t know a data person who loves dates. In the R world, many enjoy lubridate’s wide range of helper functions for cleaning, formatting, and computing on dates.

Python’s datetime module is similarly effective. We can easily create and manage dates in date or datetime classes which make them easy to work with.

import datetime
from datetime import date
today = date.today()
print(today)
type(today)

2024-01-20

datetime.date

Two of the most important functions are strftime() and strptime().

strftime() formats dates into strings. It accepts both a date and the desired string format. Below, we demonstrate by commiting the cardinal sin of writing a date in non-ISO8601.

today_str = datetime.datetime.strftime(today, '%m/%d/%Y')
print(today_str)
type(today_str)

01/20/2024

str

strptime() does the opposite and turns a string encoding a date into an actual date. It can try to guess the format, or we can be nice and provide it guidance.

someday_dtm = datetime.datetime.strptime('2023-01-01', '%Y-%m-%d')
print(someday_dtm)
type(someday_dtm)

2023-01-01 00:00:00

datetime.datetime

Date math is also relatively easy with datetime. For example, you can see we calculate the date difference simply by… taking the difference! From the resulting delta object, we can access the days attribute.

n_days_diff = ( today - someday_dtm.date() )
print(n_days_diff)
type(n_days_diff)
type(n_days_diff.days)

384 days, 0:00:00

int

Formatting Things (f-strings)

R’s glue is beloved for it’s ability to easily combine variables and texts into complex strings without a lot of ugly, nested paste() functions.

python has a number of ways of doing this, but the most readable is the newest: f-strings. Simply put an f before the string and put any variable names to be interpolated in {curly braces}.

name = "Emily"
print(f"This blog post is written by {name}")

This blog post is written by Emily

f-strings also support formatting with formats specified after a colon. Below, we format a long float to round to 2 digits.

proportion = 0.123456789
print(f"The proportion is {proportion:.2f}")

The proportion is 0.12

Any python expression – not just a single variable – can go in curly braces. So, we can instead format that propotion as a percent.

proportion = 0.123456789
print(f"The proportion is {proportion*100:.1f}%")

The proportion is 12.3%

Despite the slickness of f-strings, sometimes other string interpolation approaches can be useful. For example, if all the variables I want to interpolate are in a dictionary (as often will happen, for example, with REST API responses), the string format() method is a nice alternative. It allows us to pass in the dictionary, “unpacking” the argument with **²

result = {
    'dog_name': 'Squeak',
    'dog_type': 'Chihuahua'
}
print("{dog_name} is a {dog_type}".format(**result))

Squeak is a Chihuahua

Application: Generating File Names

Combining what we’ve discussed about datetime and f-strings, here’s a pattern I use frequently. If I am logging results from a run of some script, I might save the results in a file suffixed with the run timestamp. We can generate this easily.

dt_stub = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
file_name = f"output-{dt_stub}.csv"
print(file_name)

output-20240120_071517.csv

Repeating Things (Iteration / Functional Programming)

Thanks in part to a modern-day fiction that for loops in R are inefficient, R users have gravitated towards concise mapping functions for iteration. These can include the *apply() family³, purrr’s map_*() functions, or the parallelized version of either.

Python too has a nice pattern for arbitrary iteration in list comprehensions. For any iterable, we can use a list comprehension to make a list of outputs by processing a list of inputs, with optional conditional and default expressions.

Here are some trivial examples:

l = [1,2,3]
[i+1 for i in l]

[2, 3, 4]

[i+1 for i in l if i % 2 == 1]

[2, 4]

[i+1 if i % 2 == 1 else i for i in l]

[2, 2, 4]

There are also closer analogs to purrr like python’s map() function. map() takes a function and an iterable object and applies the function to each element. Like with purrr, functions can be anonymous (as defined in python with lambda functions) or named. List comprehensions are popular for their concise syntax, but there are many different thoughts on the matter as expressed in this StackOverflow post.

def add_one(i): 
  return i+1

# these are the same
list(map(lambda i: i+1, l))
list(map(add_one, l))

[2, 3, 4]

Application: Simulation

As a (slightly) more realistic(ish) example, let’s consider how list comprehensions might help us conduct a numerical simulation or sensitivity analysis.

Suppose we want to simulate 100 draws from a Bernoulli distribution with different success probabilites and see how close our empirically calculated rate is to the true rate.

We can define the probabilites we want to simulate in a list and use a list comprehension to run the simulations.

import numpy as np
import numpy.random as rnd

probs = [0.1, 0.25, 0.5, 0.75, 0.9]
coin_flips = [ np.mean(np.random.binomial(1, p, 100)) for p in probs ]
coin_flips

[0.05, 0.3, 0.48, 0.77, 0.87]

Alternatively, instead of returning a list of the same length, our resulting list could include whatever we want – like a list of lists! If we wanted to keep the raw simulation results, we could. The following code returns a list of 5 lists - one with the raw simulation results.

coin_flips = [ list(np.random.binomial(1, p, 100)) for p in probs ]
print(f"""
  coin_flips has {len(coin_flips)} elements
  Each element is itself a {type(coin_flips[0])}
  Each element is of length {len(coin_flips[0])}
  """)


  coin_flips has 5 elements
  Each element is itself a 
  Each element is of length 100

If one wished, they could then put these into a polars dataframe and pivot those list-of-lists (going from a 5-row dataset to a 500-row dataset)to conduct whatever sort of analysis with want with all the replicates.

import polars as pl

df_flips = pl.DataFrame({'prob': probs, 'flip': coin_flips})
df_flips.explode('flip').glimpse()

Rows: 500
Columns: 2
$ prob  0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1
$ flip  0, 0, 0, 0, 1, 0, 1, 1, 0, 0

We’ll return to list comprehensions in the next section.

Faking Things (Data Generation)

Creating simple miniature datasets is often useful in analysis. When working with a new packages, it’s an important part of learning, developing, debugging, and eventually unit testing. We can easily run our code on a simplified data object where the desired outcome is easy to determine to sanity-check our work, or we can use fake data to confirm our understanding of how a program will handle edge cases (like the diversity of ways different programs handle null values). Simple datasets can also be used and spines and scaffolds for more complex data wrangling tasks (e.g. joining event data onto a date spine).

In R, data.frame() and expand.grid() are go-to functions, coupled with vector generators like rep() and seq(). Python has many similar options.

Fake Datasets

For the simplest of datasets, we can manually write a few entries as with data.frame() in R. Here, we define series in a named dictionary where each dictionary key turns into a column name.

import polars as pl

pl.DataFrame({
  'a': [1,2,3],
  'b': ['x','y','z']
})

shape: (3, 2)

a	b
i64	str
1	"x"
2	"y"
3	"z"

If we need longer datasets, we can use helper functions in packages like numpy to generate the series. Methods like arange and linspace work similarly to R’s seq().

import polars as pl
import numpy as np

pl.DataFrame({
  'a': np.arange(stop = 3),
  'b': np.linspace(start = 9, stop = 24, num = 3)
})

shape: (3, 2)

a	b
i32	f64
0	9.0
1	16.5
2	24.0

If we need groups in our sample data, we can use np.repeat() which works like R’s rep(each = TRUE).

pl.DataFrame({
  'a': np.repeat(np.arange(stop = 3), 2),
  'b': np.linspace(start = 3, stop = 27, num = 6)
})

shape: (6, 2)

a	b
i32	f64
0	3.0
0	7.8
1	12.6
1	17.4
2	22.2
2	27.0

Alternatively, for more control and succinct typing, we can created a nested dataset in polars and explode it out.

(
  pl.DataFrame({
    'a': [1, 2, 3],
    'b': ["a b c", "d e f", "g h i"]
  })
  .with_columns(pl.col('b').str.split(" "))
  .explode('b')
)

shape: (9, 2)

a	b
i64	str
1	"a"
1	"b"
1	"c"
2	"d"
2	"e"
2	"f"
3	"g"
3	"h"
3	"i"

Similarly, we could use what we’ve learned about polars list columns and list comprehensions.

a = [1, 2, 3]
b = [ [q*i for q in [1, 2, 3]] for i in a]
pl.DataFrame({'a':a,'b':b}).explode('b')

shape: (9, 2)

a	b
i64	i64
1	1
1	2
1	3
2	2
2	4
2	6
3	3
3	6
3	9

In fact, multidimensional list comprehensions can be used to mimic R’s expand.grid() function.

pl.DataFrame(
  [(x, y) for x in range(3) for y in range(3)],
  schema = ['x','y']
  )

shape: (9, 2)

x	y
i64	i64
0	0
0	1
0	2
1	0
1	1
1	2
2	0
2	1
2	2

Built-In Data

R has a number of canonical datasets like iris built in to the core language. This can be easy to quickly grab for experimentation⁴. While base python doesn’t include such capabilities, many of the exact same or similar datasets can be found in seaborn.

seaborn.get_dataset_names() provides the list of available options. Below, we load the Palmers Penguins data and, if you wish, convert it from pandas to polars.

import seaborn as sns
import polars as pl

df_pd = sns.load_dataset('penguins')
df = pl.from_pandas(df_pd)
df.glimpse()

Rows: 344
Columns: 7
$ species            'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie'
$ island             'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen'
$ bill_length_mm     39.1, 39.5, 40.3, None, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0
$ bill_depth_mm      18.7, 17.4, 18.0, None, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2
$ flipper_length_mm  181.0, 186.0, 195.0, None, 193.0, 190.0, 181.0, 195.0, 193.0, 190.0
$ body_mass_g        3750.0, 3800.0, 3250.0, None, 3450.0, 3650.0, 3625.0, 4675.0, 3475.0, 4250.0
$ sex                'Male', 'Female', 'Female', None, 'Female', 'Male', 'Female', 'Male', None, None

Saving Things (Object Serialization)

Sometimes, it can be useful to save objects as they existed in RAM in an active programming environment. R users may have experienced this if they’ve used .rds, .rda, or .Rdata files to save individual variables or their entire environment. These objects can often be faster to reload than plaintext and can better preserve information that may be lost in other formats (e.g. storing a dataframe in a way that preserves its datatypes versus writing to a CSV file⁵ or storing a complex object that can’t be easily reduced to plaintext like a model with training data, hyperparameters, learned tree splits or weights or whatnot for future predictions.) This is called object serializaton⁶

Python has comparable capabilities in the pickle module. There aren’t really style points here, so I’ve not much to add beyond “this exists” and “read the documentation”. But, at a high level, it looks something like this:

# to write a pickle
with open('my-obj.pickle', 'wb') as handle:
    pickle.dump(my_object, handle, protocol = pickle.HIGHEST_PROTOCOL)

# to read a pickle
my_object = pickle.load(open('my-obj.pickle','rb'))

Footnotes

I defined this odd scope to help limit the infinite number of workflow topics that could be included like “how to write a function” or “how to source code from another script”↩︎
This is called “**kwargs” and works a bit like do.call() in base R. You can read more about it here.↩︎
Speaking of non-ergonomic things in R, the *apply() family is notoriously diverse in its number and order of arguments↩︎
Particularly if you want to set wildly unrealistic expectations for the efficacy of k-means clustering, but I digress↩︎
And yes, you can and should use Parquet and then my example falls apart – but that’s not the point!↩︎
And, if you want to go incredibly deep here, check out this awesome post by Danielle Navarro.↩︎

Crosspost: Why You Need Data Documentation in 2024

Emily Riederer — Mon, 15 Jan 2024 06:00:00 GMT

We’ve all worked with poorly documented dataset, and we all know it isn’t pretty. However, it’s surprisingly easy for teams to continue to fall into “documentation debt” and deprioritize this foundational work in favor of flashy new projects. These tradeoff discussions may become even more painful in 2024 as teams are continually asked to do more with less.

Recently, I had the opportunity to articulate some of the underappreciated benefits of data documentation in a cross-post with Select Star. This builds on my prior post showing that documentation can be strategically created throughout the data development process. To make the case for taking those “raw” documentation resources to a polished final form, I return to the jobs-to-be-done framework that I’ve previously employed to talk about the value of innersource packages. In this perspective, documentation is like hiring an extra resource (or more!) to your team.

Some of the jobs discussed are:

Developer Advocacy and Product Evangelism for users
- Users think data doesn’t exist if they can’t find it, they think data is broken if they misinterpret it
- Documentation is both a “user interface” to make data usage easy and a bulwark against confusion and frustration
Producct and Project Management for developers
- Data intent can “drift” over time
- As teams evolve and collaborate, this risks initial intent getting lost and poluted (after all, what really is a “customer”?)
- Documentation serves as a contract and coach for one or more teams to force clarity and consistency of intent
Chief of Staff oversight for data leaders
- Leaders face increasing demands in data governance: navigating changing privacy regulations, fighting decaying data quality, and discerning their next strategic investments
- Documentation is their command center to understand what data assets exists and where to better spot risks and opportunities

If you or your team works on data documentation, I’d love to hear what other “jobs” you have found that data documentation performs in your organization.

polars’ Rgonomic Patterns

Emily Riederer — Sat, 13 Jan 2024 06:00:00 GMT

Photo credit to Hans-Jurgen Mager on Unsplash

A few weeks ago, I shared some recommended modern python tools and libraries that I believe have the most similar ergonomics for R (specifically tidyverse) converts. This post expands on that one with a focus on the polars library.

At the surface level, all data wrangling libraries have roughly the same functionality. Operations like selecting existing columns and making new ones, subsetting and ordering rows, and summarzing results is tablestakes.

However, no one falls in love with a specific library because it has the best select() or filter() function the world has ever seen. It’s the ability to easily do more complex transformations that differentiate a package expert versus novice, and the learning curve for everything that happens after the “Getting Started” guide ends is what can leave experts at one tool feeling so disempowered when working with another.

This deeper sense of intuition and fluency – when your technical brain knows intuitively how to translate in code what your analytical brain wants to see in the data – is what I aim to capture in the term “ergonomics”. In this post, I briefly discuss the surface-level comparison but spend most of the time exploring the deeper similarities in the functionality and workflows enabled by polars and dplyr.

What are `dplyr`’s ergonomics?

To claim polars has a similar aesthetic and user experience as dplyr, we first have to consider what the heart of dplyr‘s ergonomics actually is. The explicit design philosophy is described in the developers’ writings on tidy design principles, but I’ll blend those official intended principles with my personal definitions based on the lived user experience.

Consistent:
- Function names are highly consistent (e.g. snake case verbs) with dependable inputs and outputs (mostly dataframe-in dataframe-out) to increase intuition, reduce mistakes, and eliminate surprises
- Metaphors extend throughout the codebase. For example group_by() + summarize() or group_by() + mutate() do what one might expect (aggregation versus a window function) instead of requiring users to remember arbitrary command-specific syntax
- Always returns a new dataframe versus modifying in-place so code is more idempotent¹ and less error prone
Composable:
- Functions exist at a “sweet spot” level of abstraction. We have the right primitive building blocks that users have full control to do anything they want to do with a dataframe but almost never have to write brute-force glue code. These building blocks can be layered however one choose to conduct
- Conistency of return types leads to composability since dataframe-in dataframe-out allows for chaining
Human-Centered:
- Packages hit a comfortable level of abstraction somewhere between fully procedural (e.g. manually looping over array indexes without a dataframe abstraction) and fully declarative (e.g. SQL-style languages where you “request” the output but aspects like the order of operations may become unclear). Writing code is essentially articulating the steps of an analysis
- This focus on code as recipe writing leads to the creation of useful optional functions and helpers (like my favorite – column selectors)
- User’s rarely need to break the fourth wall of this abstraction-layer (versus thinking about things like indexes in pandas)

TLDR? We’ll say dplyr’s ergonomics allow users to express complex transformation precisely, concisely, and expressively.

So, with that, we will import polars and get started!

import polars as pl

This document was made with polars version 0.20.4.

Basic Functionality

The similarities between polars and dplyr’s top-level API are already well-explored in many posts, including those by Tidy Intelligence and Robert Mitchell.

We will only do the briefest of recaps of the core data wrangling functions of each and how they can be composed in order to make the latter half of the piece make sense. We will meet these functions again in-context when discussing dplyr and polar’s more advanced workflows.

Main Verbs

dplyr and polars offer the same foundational functionality for manipulating dataframes. Their APIs for these operations are substantially similar.

For a single dataset:

Column selection: select() -> select() + drop()
Creating or altering columns: mutate() -> with_columns()
Subsetting rows: filter() -> filter()
Ordering rows: arrange() -> sort()
Computing group-level summary metrics: group_by() + summarize() -> group_by() + agg()

For multiple datasets:

Merging on a shared key: *_join() -> join(strategy = '*')
Stacking datasets of the same structure: union() -> concat()
Transforming rows and columns: pivot_{longer/wider}()² -> pivot()

Main Verb Design

Beyond the similarity in naming, dplyr and polars top-level functions are substantially similar in their deeper design choices which impact the ergonomics of use:

Referencing columns: Both make it easy to concisely references columns in a dataset without the repeated and redundant references to said dataset (as sometimes occurs in base R or python’s pandas). dplyr does this through nonstandard evaluation wherein a dataframe’s coumns can be reference directly within a data transformation function as if they were top-level variables; in polars, column names are wrapped in pl.col()
Optional argument: Both tend to have a wide array of nice-to-have optional arguments. For example the joining capabilities in both libraries offer optional join validation³ and column renaming by appended suffix
Consistent dataframe-in -> dataframe-out design: dplyr functions take a dataframe as their first argument and return a dataframe. Similarly, polars methods are called on a dataframe and return a dataframe which enables the chaining workflow discussed next

Chaining (Piping)

These methods are applied to polars dataframes by chaining which should feel very familiar to R dplyr fans.

In dplyr and the broad tidyverse, most functions take a dataframe as their first argument and return a dataframe, enabling the piping of functions. This makes it easy to write more human-readable scripts where functions are written in the order of execution and whitespace can easily be added between lines. The following lines would all be equivalent.

transformation2(transformation1(df))

df |> transformation1() |> transformation2()

df |>
  transformation1() |>
  transformation2()

Similarly, polars’s main transfomration methods offer a consistent dataframe-in dataframe-out design which allows method chaining. Here, we similarly can write commands in order where the . beginning the next method call serves the same purpose as R’s pipe. And for python broadly, to achieve the same affordance for whitespace, we can wrap the entire command in parentheses.

(
  df
  .transformation1()
  .transformation2()
)

One could even say that polars dedication to chaining goes even deeper than dplyr. In dplyr, while core dataframe-level functions are piped, functions on specific columns are still often written in a nested fashion⁴

df %>% mutate(z = g(f(a)))

In contrast, most of polars column-level transformation methods also make it ergonomic to keep the same literate left-to-right chaining within column-level definitions with the same benefits to readability as for dataframe-level operations.

df.with_columns(z = pl.col('a').f().g())

Advanced Wrangling

Beyond the surface-level similarity, polars supports some of the more complex ergonomics that dplyr users may enjoy. This includes functionality like:

expressive and explicit syntax for transformations across multiple rows
concise helpers to identify subsets of columns and apply transformations
consistent syntax for window functions within data transformation operations
the ability to work with nested data structures

Below, we will examine some of this functionality with a trusty fake dataframe.⁵ As with pandas, you can make a quick dataframe in polars by passing a dictionary to pl.DataFrame().

import polars as pl 

df = pl.DataFrame({'a':[1,1,2,2], 
                   'b':[3,4,5,6], 
                   'c':[7,8,9,0]})
df.head()

shape: (4, 3)

a	b	c
i64	i64	i64
1	3	7
1	4	8
2	5	9
2	6	0

Explicit API for row-wise operations

While row-wise operations are relatively easy to write ad-hoc, it can still be nice semantically to have readable and stylistically consistent code for such transformations.

dplyr’s rowwise() eliminates ambiguity in whether subsequent functions should be applied element-wise or collectively. Similiarly, polars has explicit *_horizontal() functions.

df.with_columns(
  b_plus_c = pl.sum_horizontal(pl.col('b'), pl.col('c')) 
)

shape: (4, 4)

a	b	c	b_plus_c
i64	i64	i64	i64
1	3	7	10
1	4	8	12
2	5	9	14
2	6	0	6

Column Selectors

dplyr’s column selectors dynamically determine a set of columns based on pattern-matching their names (e.g. starts_with(), ends_with()), data types, or other features. I’ve previously written and spoken at length about how transformative this functionality can be when paired with

polars has a similar set of column selectors. We’ll import them and see a few examples.

import polars.selectors as cs

To make things more interesting, we’ll also turn one of our columns into a different data type.

df = df.with_columns(pl.col('a').cast(pl.Utf8))

In `select`

We can select columns based on name or data type and use one or more conditions.

df.select(cs.starts_with('b') | cs.string())

shape: (4, 2)

b	a
i64	str
3	"1"
4	"1"
5	"2"
6	"2"

Negative conditions also work.

df.select(~cs.string())

shape: (4, 2)

b	c
i64	i64
3	7
4	8
5	9
6	0

In `with_columns`

Column selectors can play multiple rows in the transformation context.

The same transformation can be applied to multiple columns. Below, we find all integer variables, call a method to add 1 to each, and use the name.suffix() method to dynamically generate descriptive column names.

df.with_columns(
  cs.integer().add(1).name.suffix("_plus1")
)

shape: (4, 5)

a	b	c	b_plus1	c_plus1
str	i64	i64	i64	i64
"1"	3	7	4	8
"1"	4	8	5	9
"2"	5	9	6	10
"2"	6	0	7	1

We can also use selected variables within transformations, like the rowwise sums that we just saw earlier.

df.with_columns(
  row_total = pl.sum_horizontal(cs.integer())
)

shape: (4, 4)

a	b	c	row_total
str	i64	i64	i64
"1"	3	7	10
"1"	4	8	12
"2"	5	9	14
"2"	6	0	6

In `group_by` and `agg`

Column selectors can also be passed as inputs anywhere else that one or more columns is accepted, as with data aggregation.

df.group_by(cs.string()).agg(cs.integer().sum())

shape: (2, 3)

a	b	c
str	i64	i64
"1"	7	15
"2"	11	9

Consistent API for Window Functions

Window functions are another incredibly important tool in any data wrangling language but seem criminally undertaught in introductory analysis classes. Window functions allows you to apply aggregation logic over subgroups of data while preserving the original grain of the data (e.g. in a table of all customers and orders and a column for the max purchase account by customer).

dplyr make window functions trivially easy with the group_by() + mutate() pattern, invoking users’ pre-existing understanding of how to write aggregation logic and how to invoke transformations that preserve a table’s grain.

polars takes a slightly different but elegant approach. Similarly, it reuses the core with_columns() method for window functions. However, it uses a more SQL-reminiscent specification of the “window” in the column definition versus a separate grouping statement. This has the added advantage of allowing one to use multiple window functions with different windows in the same with_columns() call if you should so choose.

A simple window function tranformation can be done by calling with_columns(), chaining an aggregation method onto a column, and following with the over() method to define the window of interest.

df.with_columns(
  min_b = pl.col('b').min().over('a')
)

shape: (4, 4)

a	b	c	min_b
str	i64	i64	i64
"1"	3	7	3
"1"	4	8	3
"2"	5	9	5
"2"	6	0	5

The chaining over and aggregate and over() can follow any other arbitrarily complex logic. Here, it follows a basic “case when”-type statement that creates an indicator for whether column b is null.

df.with_columns(
  n_b_odd = pl.when( (pl.col('b') % 2) == 0)
              .then(1)
              .otherwise(0)
              .sum().over('a')
)

shape: (4, 4)

a	b	c	n_b_odd
str	i64	i64	i32
"1"	3	7	1
"1"	4	8	1
"2"	5	9	1
"2"	6	0	1

List Columns and Nested Frames

While the R tidyverse’s raison d’etre was originally around the design of heavily normalize tidy data, modern data and analysis sometimes benefits from more complex and hierarchical data structures. Sometimes data comes to us in nested forms, like from an API⁶, and other times nesting data can help us perform analysis more effectively⁷ Recognizing these use cases, tidyr provides many capability for the creation and manipulation of nested data in which a single cell contains values from multiple columns or sometimes even a whoel miniature dataframe.

polars makes these operations similarly easy with its own version of structs (list columns) and arrays (nested dataframes).

List Columns & Nested Frames

List columns that contain multiple key-value pairs (e.g. column-value) in a single column can be created with pl.struct() similar to R’s list().

df.with_columns(list_col = pl.struct( cs.integer() ))

shape: (4, 4)

a	b	c	list_col
str	i64	i64	struct[2]
"1"	3	7	{3,7}
"1"	4	8	{4,8}
"2"	5	9	{5,9}
"2"	6	0	{6,0}

These structs can be further be aggregated across rows into miniature datasets.

df.group_by('a').agg(list_col = pl.struct( cs.integer() ) )

shape: (2, 2)

a	list_col
str	list[struct[2]]
"2"	[{5,9}, {6,0}]
"1"	[{3,7}, {4,8}]

In fact, this could be a good use case for our column selectors! If we have many columns we want to keep unnested and many we want to next, it could be efficient to list out only the grouping variables and create our nested dataset by examining matches.

cols = ['a']
(df
  .group_by(cs.by_name(cols))
  .agg(list_col = pl.struct(~cs.by_name(cols)))
)

shape: (2, 2)

a	list_col
str	list[struct[2]]
"2"	[{5,9}, {6,0}]
"1"	[{3,7}, {4,8}]

Undoing

Just as we constructed our nested data, we can denormalize it and return it to the original state in two steps. To see this, we can assign the nested structure above as df_nested.

df_nested = df.group_by('a').agg(list_col = pl.struct( cs.integer() ) )

First explode() returns the table to the original grain, leaving use with a single struct in each row.

df_nested.explode('list_col')

shape: (4, 2)

a	list_col
str	struct[2]
"1"	{3,7}
"1"	{4,8}
"2"	{5,9}
"2"	{6,0}

Then, unnest() unpacks each struct and turns each element back into a column.

df_nested.explode('list_col').unnest('list_col')

shape: (4, 3)

a	b	c
str	i64	i64
"1"	3	7
"1"	4	8
"2"	5	9
"2"	6	0

Footnotes

Meaning you can’t get the same result twice because if you rerun the same code the input has already been modified↩︎
Of the tidyverse funtions mentioned so far, this is the only one found in tidyr not dplyr↩︎
That is, validating an assumption that joins should have been one-to-one, one-to-many, etc.↩︎
However, this is more by convention. There’s not a strong reason why they would strictly need to be.↩︎
I recently ran a Twitter poll on whether people prefer real, canonical, or fake datasets for learning and teaching. Fake data wasn’t the winner, but a strategy I find personally fun and useful as the unit-test analog for learning.↩︎
For example, an API payload for a LinkedIn user might have nested data structures representing professional experience and educational experience↩︎
For example, training a model on different data subsets.↩︎

Crosspost: Why you’re closer to data documentation than you think

Emily Riederer — Fri, 05 Jan 2024 06:00:00 GMT

Documentation can be a make-or-break for the success of a data initiative, but it’s too often considered an optional nice-to-have. I’m a big believer that writing is thinking. Similarly, documenting is planning, executing, and validating.

Previously, I’ve explored how we can create latent and lasting documentation of data products and how column names can be self documenting.

Recently, I had the opportunity to expand on these ideas in a cross-post with Select Star. I argue that teams can produce high-quality and maintainable documentation with low overhead with a form of “documentation-driven development”. That is, smartly structuring and re-using artifacts from the development process into long-term documentation. For example:

At the planning stage:
- Structuring requirements docs in the form of data dictionaries
- Creating early alignment on higher-order concepts like entity definitions (and writing them down)
- Mentally beta testing data usability with an entity-relationship diagram
At the development stage:
- Ensuring relevant parts of internal “development documentation” (e.g. dbt column definitions, docstrings) are published to a format and location accessible to users
- With different information but similar motivation to ER diagrams, sharing the full orchestration DAG to help users trace column-level lineage and internalize how each field maps to a real-world data generating process
- Sharing data tests being executed (the “user contract”) and their results
Throughout the lifecycle:
- Answering questions “in public” (e.g. Slack versus email) to create a searchable collection of insights
- Producing table usage statistics to help large, decentralized orgs capture the “wisdom of the crowds”

If you or your team works on data documentation, I’d love to hear what other patterns you’ve found to collect useful documentation assets during a data development process.

Python Rgonomics

Emily Riederer — Sat, 30 Dec 2023 06:00:00 GMT

Photo credit to the inimitable Allison Horst

Warning

Some advice in this post has gone stale regarding IDEs, installers, and environment management tools. Please see me 2025 update for more recent thoughts following the release of uv and Positron

Interoperability was a key theme in open-source data languages in 2023. Ongoing innovations in Arrow (a language-agnostic in-memory standard for data storage), growing adoption of Quarto (the language-agnostic heir apparent to R Markdown), and even pandas creator Wes McKinney joining Posit (the language-agnostic rebranding of RStudio) all illustrate the ongoing investment in breaking down barriers between different programming languages and paradigms.

Despite these advances in technical interoperability, individual developers will always face more friction than state-of-the-art tools when moving between languages. Learning a new language is easily enough done; programming 101 concepts like truth tables and control flow translate seamlessly. But ergonomics of a language do not. The tips and tricks we learn to be hyper productive in a primary language are comfortable, familiar, elegant, and effective. They just feel good. Working in a new language, developers often face a choice between forcing their favored workflows into a new tool where they may not “fit”, writing technically correct yet plodding code to get the job done, or approaching a new language as a true beginner to learn it’s “feel” from the ground up.

What this post is not

Just to be clear:

This is not a post about why python is better than R so R users should switch all their work to python
This is not a post about why R is better than python so R semantics and conventions should be forced into python
This is not a post about why python users are better than R users so R users need coddling
This is not a post about why R users are better than python users and have superior tastes for their toolkit
This is not a post about why these python tools are the only good tools and others are bad tools

On picking tools

The tools I highlight below tend to have two competing features:

They have aspects of their workflow and ergonomics that should feel very comfortable to users of favored R tools
They should be independently accepted, successful, and well-maintained python projects with the true pythonic spirit

When in Rome, do as the Romans do – but if you’re coming from the U.S. that doesn’t mean you can’t bring a universal adapter that can help charge your devices in European outlets.

The stack

WIth that preamble out of the way, below are a few recommendations for the most ergonomic tools for getting set up, conducting core data analysis, and communication results.

To preview these recommendations:

Set Up

Installation: pyenv
IDE: VS Code

Analysis

Wrangling: polars
Visualization: seaborn

Communication

Tables: Great Tables
Notebooks: Quarto

Miscellaneous

Environment Management: pdm
Code Quality: ruff

Note

I don’t want this advice to set up users for a potential snag. If you are on Windows and install python with pyenv-win, Quarto (as of writing on v1.3) may struggle to find the correct executable. Better support for this is on the backlog, but if you run into this issue, checkout this brilliant fix.

For setting up

The first hurdle is often getting started – both in terms of installing the tools you’ll need and getting into a comfortable IDE to run them.

Installation: R keeps installation simple; there’s one way to do it* so you do and it’s done. But before python converts can print("hello world"), they face a range of options (system Python, Python installer UI, Anaconda, Miniconda, etc.) each with its own kinks. These decisions are made harder in Python since projects tend to have stronger dependencies of the language, requiring one to switch between versions. For both of these reasons, I favor the pyenv (or pyenv-win for those on Windows) for easily managing python installation(s) from the command line. While the installation process of pyenv may be technically different, it’s similar in that it “just works” with just a few commands. In fact, the workflow is so slick that things seem to have gone 180 degrees with pyenv inspiring similar project called rig to manage R installations. This may sound intimidating, but the learning curve is actually quite shallow:
- pyenv install --list: To see what python versions are available to install
- pyenv install : To install a specific version
- pyenv versions: To see what python versions are installed on your system
- pyenv global : The set one python version as a global default
- pyenv local : The set a python version to be used within a specific directory/project
Integrated Development Environment: Once R is install, R users are typically off to the races with the intuitive RStudio IDE which helps them get immediately hands-on with the REPL. With the UI divided into quadrants, users can write an R script, run it to see results in the console, conceptualize what the program “knows” with the variable explorer, and navigate files through a file explorer. Once again, python is not lacking in IDE options, but users are confronted with yet another decision point before they even get started. Pycharm, Sublime, Spyder, Eclipse, Atom, Neovim, oh my! I find that VS Code offers the best functionality. It’s rich extension ecosystem also means that most major tools (e.g. Quarto, git, linters and stylers, etc.) have nice add-ons so, like RStudio, you can customize your platform to perform many side-tasks in plaintext or with the support of extra UI components.³

For data analysis

Data Wrangling: Although pandas is undoubtedly the best-known wrangling tool in the python space, I believe the growing polars project offers the best experience for a transitioning developer (along with other nice-to-have benefits like being dependency free and blazingly fast). polars may feel more natural and less error-prone to R users for may reasons:
- it has more internal consistent (and similar to dplyr) syntax such as select, filter, etc. and has demonstrated that the project values a clean API (e.g. recently renaming groupby to group_by)
- it does not rely on the distinction between columns and indexes which can feel unintuitive and introduces a new set of concepts to learn
- it consistently returns copies of dataframes (while pandas sometimes alters in-place) so code is more idempotent and avoids a whole class of failure modes for new users
- it enables many of the same “advanced” wrangling workflows in dplyr with high-level, semantic code like making the transformation of multiple variables at once fast with column selectors, concisely expressing window functions, and working with nested data (or what dplyr calls “list columns”) with lists and structs
- supporting users working with increasingly large data. Similar to dplyr’s many backends (e.g. dbplyr), polars can be used to write lazily-evaluated, optimized transformations and it’s syntax is reminiscent of pyspark should users ever need to switch between
Visualization: Even some of R’s critics will acknowledge the strength of ggplot2 for visualization, both in terms of it’s intuitive and incremental API and the stunning graphics it can produce. seaborn’s object interface seems to strike a great balance between offering a similar workflow (which cites ggplot2 as an inspiration) while bringing all the benefits of using an industry-standard tool

For communication

Tables: R has no shortage of packages for creating nicely formatted tables, an area that has historically lacked a bit in python both in workflow and outcomes. Barring strong competition from the native python space, the one “port” I am bullish about is the recently announced Great Tables package. This is a pythonic clone of R’s gt package. I’m more comfortable recommending this since it’s maintained by the same developer as the R version (to support long-term feature parity), backed by an institution not just an individual (to ensure it’s not a short-lived hobby project), and the design feels like it does a good job balancing R inspiration with pythonic practices
Computational notebooks: Jupyter Notebooks are widely used, widely critiqued parts of many python workflows. While the ability to mix markdown and code chunks. However, notebooks can introduce new types of bugs for the uninitiated; for example, they are hard to version control and easy to execute in the wrong environment. For those coming from the world of R Markdown, plaintext computational notebooks like Quarto may provide a more transparent development experience. While Quarto allows users to write in .qmd files which are more like their .rmd predecessors, its renderer can also handle Jupyter notebooks to enable collaboration across team members with different preferences

Miscellaneous

Environment Management: Joining the python world means never having to settle on an environment management tool for installing packages. There’s a truly overwhelming number of ways to manage project-level dependencies (virtualenv, conda, piptools, pipenv, poetry, and that doesn’t even scratch the surface) with different pros and cons and phenomenal amount of ink/pixels have been spilled over litigating these trade-offs. Putting all that aside, lately, I’ve been favoring pdm because it prioritizes features I care most about (auto-updating pyproject.toml, isolating dependencies from dependencies-of-dependencies, active development and error handling, mostly just works pretty undramatically)
Developer Tools: ruff provides a range of linting and styling options (think R’s lintr and styler) and provides a one-stop-shop over what can be an overwhelming number of atomic tools in this space (isort, black, flake8, etc.). ruff is super fast, has a nice VS Code extension, and, while this class of tools is generally considered more advanced, I think linters can be a fantastic “coach” for new users about best practices

More to come!

Each recommendation here itself could be its own tutorial or post. In particular, I hope to showcase the Rgonomics of polars, seaborn, and great_tables in future posts.

Footnotes

Of course, languages have their own subcultures too. The tidyverse and data.table parts of the R world tend to favor different semantics and ergonomics. This post caters more to the former.↩︎
There is no doubt a place for language ports, especially for earlier stage project where no native language-specific standard exists. For example, I like Karandeep Singh’s lab work on a tidyverse for Julia and maintain my own dbtplyr package to port dplyr’s select helpers to dbt↩︎
If anything, the one challenge of VS Code is the sheer number of set up options, but to start out, you can see these excellent tutorials from Rami Krispin on recommended python and R configurations ↩︎

Big ideas from the 2023 Causal Data Science Meeting

Emily Riederer — Sat, 18 Nov 2023 06:00:00 GMT

Last week, I enjoyed attending parts of the annual virtual Causal Data Science Meeting organized by researchers from Maastricht University, Netherlands, and Copenhagen Business School, Denmark. This has been one of my favorite virtual events since the first iteration in 2020, and I find it consistently highlights the best of the causal research community: brining together industry and academia with concise talks that are at once thought-provoking, theoretically well-grounded, yet thoroughly pragmatic.

While I could not join the entire event (running in CET time, some sessions fit snuggly between my first cup of coffee and first work meeting of the day in CST), this year’s conference did not disappoint! Below, I share a sampling with five “big ideas” from the sessions.

What’s the current “gold standard” of causal ML methods in industry? Dima Goldenberg presented a great case study on heterogeneous uplift modeling at Booking.com. (While I couldn’t find the exact slides or paper, you can get a flavor of Booking’s work in experimentation and causal inference from their excellent tech blog )
How does causal evidence add value? Robert Kubinec conceptualized a measurable spectrum of descriptive to causal studies based on entropy. This framework broadens the aperture to think about how both quantitative and qualitative evidence can come together to form causal conclusions. (Preprint)
But how do we know the methods work? Causal methods are notoriously hard to validate since, by definition, we lack a ground truth against which to compare our estimate. To validate new methods, Lingjie Shen and coauthors presented one approach with their new [RCTrep R package] (https://github.com/duolajiang/RCTrep) which can be used to compare outcomes between real-world data (RWD) and randomized control trial data (RCT).
And what do we do when they can’t get all the way there? Carlos Fernández-Loría and Jorge Loría talk on “Causal Scoring” explores how we can accept and make use of “causal ranking” or “causal classification” even when we do not believe we can generate fully credible, calibrated causal estimates. By defining which type of estimand is really necessary for a specific use case, they show how one can tailor their modeling approach and broaden the range of applications. (Preprint)
Finally, do the best methods that correctly accrue causal evidence and validate matter? Ron Berman and Anya Shchetkina tackled this question in their paper about when correctly modeling uplift heterogeneity does and doesn’t matter. They decomposed potential causes using real-world marketing and public health examples and presented a methodology for identifying when uplift-based personalization makes a business impact (I couldn’t find pre-print, but they also presented at MIT’s CODE this week, so hopefully there will be a video soon!)

One of the joys of the causal DS community’s mindset is the inherent focus on impact and pragmatism, and this year’s conference continued to deliver in that vein. I’m marking my calendar (and setting my 4AM alarm!) for next year already.

Data Downtime Horror Stories Panel

Mon, 23 Oct 2023 05:00:00 GMT

Abstract

In October, I joined a Halloween-themed panel along with Chad Sanderson and Joe Reis to discuss our horror stories of data quality gone wrong and how to build successful data quality strategies in large organizations. Key takeaways are summarized on Monte Carlo’s blog.

Operationalizing Column-Name Contracts with dbtplyr

Thu, 21 Sep 2023 05:00:00 GMT

url_video: “”

At Coalesce for dbt user audience:

Slides
Video

At posit::conf for R user audience:

Slides
Video - posit::conf for R User Audience coming soon!

Post - Column Name Contracts
Post - Column Name Contracts in dbt
Post - Column Name Contracts with dbtplyr

Complex software systems make performance guarantees through documentation and unit tests, and they communicate these to users with conscientious interface design.

However, published data tables exist in a gray area; they are static enough not to be considered a “service” or “software”, yet too raw to earn attentive user interface design. This ambiguity creates a disconnect between data producers and consumers and poses a risk for analytical correctness and reproducibility.

In this talk, I will explain how controlled vocabularies can be used to form contracts between data producers and data consumers. Explicitly embedding meaning in each component of variable names is a low-tech and low-friction approach which builds a shared understanding of how each field in the dataset is intended to work.

Doing so can offload the burden of data producers by facilitating automated data validation and metadata management. At the same time, data consumers benefit by a reduction in the cognitive load to remember names, a deeper understanding of variable encoding, and opportunities to more efficiently analyze the resulting dataset. After discussing the theory of controlled vocabulary column-naming and related workflows, I will illustrate these ideas with a demonstration of the {dbtplyr} dbt package which helps analytics engineers get the most value from controlled vocabularies by making it easier to effectively exploit column naming structures while coding.

Coming Soon!

Scaling Personalized Volunteer Emails

Wed, 21 Jun 2023 05:00:00 GMT

Slides
Video

In this four-minute lightning talk, I explain how Two Million Texans used components of our existing data stack to provide personalized success metrics and action recommendations to over 5,000 volunteers in the lead up to the 2022 midterm elections. I briefly describe our pipeline and how we frontloaded key computational steps in BigQuery to circumvent limitations of downstream tools.

Causal Design Patterns

Wed, 07 Jun 2023 05:00:00 GMT

Slides
Video
Video - Discussion
Post - Causal Design Patterns
Post - Causal Data Management

Experimentation is a pillar of product data science and machine learning. But what can you do when experimentation is impractical, costly, risky to customer experience, or too slow to read the desired long-term results?

While industry is often spoiled by their ability to AB test, the question of how to draw valid causal measurements from non-randomized data has long been a focus of many fields from epidemiology to public policy. This talk will review four common ‘design pattern’ for observational causal inference and how they can apply to industry. Exploring the assumptions, limitations, and applications of these methods will help practicing data scientists recognize opportunities to use this methods to tackle seemingly unanswerable questions they face.

Moving beyond the basics, we will see how these building-block patterns are fueling an explosion in modern causal machine learning and discuss how to seed your organization for success with enterprise knowledge and data management.

Industry information management for causal inference

Emily Riederer — Tue, 30 May 2023 05:00:00 GMT

Data strategy motivated by causal methods

This post summarizes the final third of my talk at Data Science Salon NYC in June 2023. Please see the talk details for more content.

Techniques of observational causal inference are becoming increasingly popular in industry as a complement to experimentation. Causal methods offer the promise of accelerating measurement agendas and facilitating the estimation of previously un-measurable targets by allowing analysts to extract causal insights from “found” data (e.g. observational data collected without specific intent). However, if executed without careful attention to their assumptions and limitations, they can lead to spurious conclusions.

Both experimental and observational methods attempt to address the fundamental problem of causal inference: that is, the fact that for a given treatment of interest, we can never “see” the individual-level outcome both for the case when an individual received a treatment and a counterfactual scenario in which for the same individual in the exact same context that treatment was withheld. Some literature casts this as a “missing data” problem.¹ Counterfactual data is uncollectable; however, this fundamental missingness can be partially mitigated by diligent collection of other types of quantitative and qualitative information to control for confounding² and interrogate assumptions.

In this post, I argue that industry has unique advantages when using causal techniques over the social science disciplines that originated many foundational methods due to industry’s (theoretically) superior ability to observe and capture relevant supplemental data and context. Examining the implicit assumptions in common causal design patterns motivates the types of proactive enterprise information management – including data, metadata, and knowledge management – that will help preserve the raw inputs that future data scientists will need to effectively deploy causal techniques on historical data and answer questions that our organizations cannot even anticipate today. By casting an intentionally wide net on what information we observationally collect, we increase the likelihood that the future “found” data will have what those analysts need to succeed.

Why industry needs causal inference

Industry data science tends to highly value the role of A/B testing and experimentation. However, there are many situations where experimentation is not an optimal approach to learning. Experiments can be infeasible if we worry about the ethics or reputational risk of offering disparate customer treatments; they may be impractical in situations that are hard to randomize or avoid spillover effects; they can be costly to run and configure either in direct or opportunity costs; and, finally, they can just be slow if we wish to measure complex and long-term impacts on customer behaviors (e.g. retention, lifetime value).

What causal methods require

These limitations are one of the reasons why observational causal inference is gaining increasing popularity in industry. Methods of observational causal inference allows us to estimate treatment effects without randomized controlled experimentation by using existing historical data. At the highest level, these methods work by replacing randomization with strategies to exploit other forms of semi-random variation in historical exposures of a population to a treatment. Since this semi-random variation could be susceptible to confounding, observational methods supplement variation with additional data to control for other observable sources of bias in our estimates and contextual assumptions about the data generating process.

My previous post on causal design patterns outlines a number of foundational causal methods, but I’ll briefly recap to emphasize the different ways that sources of variation, data, and context are used:

Stratification and Inverse Propensity Score Weighting:
- Exploits “similar” populations of treated and untreated individuals
- Assumes we can observe and control for common causes of the treatment and the outcome
Regression Discontinuity:
- Exploits a sharp, semi-arbitrary cut-off between treated and untreated individuals
- Assumes that the outcome is continuous with respect to the assigment variable and the assignment mechanism is unknown to individuals (to avoid self-selection)
Difference in Differences:
- Exploits variation between behavior over time of treated and untreated groups
- Assumes that the treatment assignment is unrelated to expected future outcomes and that the treatment is well-isolated to the treatment group

Notably, the assumptions mentioned above are largely untestable statistically (e.g. not like testing for normality or multicolinearity) but rely on knowledge of past strategies and policies that guided differential treatment in historical data.³

Industry’s unique advantages deploying causal inference

Many causal methods originated in fields like epidemiology, economics, political science, and other social sciences. In such fields, direct experimentation is often impossible and even first-hand data collection is less common. Often, researchers may have to rely on pre-existing data sources like censuses, surveys, and administrative data (e.g. electronic health records).

Despite the lineage of these methods, industry has many advantages over traditional research fields in using them because each company controls the entire “universe” in which its customers exist. This should in theory provide a distinct advantage when collecting each of the three “ingredients” that causal methods use to replace randomization:

Variation: We control customer engagement strategies through methods like customer segmentation or models. Subsequent customer treatments are completely known to us but inherently have some arbitrary, judgmental component to exploit
Data: We tend to be able to collect more measurements of our customers both as a snapshot (more variety in fields) and longitudinally (more observations over time) that can be brought into our analyses to control for confounders⁴, reduce other sources of variation in our estimate, and have additional ‘out of time’ data left over to conduct forms of validation like placebo tests
Context: We tend to know how past strategies were set-up, how they looked to individuals involved, and why those decisions were made. This can be critical in reasoning whether our assumptions hold

However, to convert this theoretical benefit to a practical one requires information management.

Data management for causal inference

While all causal methods will be enhanced with better enterprise information management, it’s easiest to see the motivation by thinking back to specific examples. Causal inference can benefit from better data, metadata, and knowledge management. These are illustrated by propensity score weighting, regression discontinuity, and diff-in-diff respectively.

Integrated Data Management

Earlier, we posited that one advantage that industry has over academia for causal inference is access to richer historical data sources as a higher level of resolution (more measures per individual at more time points). A rich set of customer measures is critical for stratification and propensity score weighting where we attempt to control for selection on observables by balancing populations along dimensions that might be common causes of treatment assignment and outcome. (And, we may also wish to control for other unrelated sources of variation that effect only the outcome to develop more precise estimates.)

However, this is only true if customer data is proactively collected, cleaned, and harmonized across sources in the true spirit of a customer 360 view. Enterprises may collect data about customers from many different operational systems – for example, demographic information provided at registration, digital data on their logins and web activity, campaign data on attempted customer touchpoints and engagement, behavioral or fulfillment data on purchases / subscription renewals / etc. Any of these sources could be useful “observables” that help close confounding pathways in our analyses.

To make this data useful and accessible for analysis, it must be proactively integrated into a common source like a data warehouse, well-documented to help future users understand the nuances of each system, harmonized so fields have standard definitions (e.g. common definitions of an “account” and a “customer”), and unified by using techniques like entity resolution to ensure all sources share common identifiers so that they can be merged for analysis.

Metadata Management

Beyond those “typical” sources of customer data, our past customer strategies create data beyond the data directly generated by our customers. Metadata about past campaigns such as precise business logic on the different treatments offered (e.g. if sending customers a discount, what algorithmically determined the amount?), the campaign targeting and segmentation (e.g. What historical behaviors were used to segments customers? Was treatment determined by a predictive model?), and launch timing can all be critical to clearly identifying those sources of variation that we wish to exploit. For example, we might know that we once ran an re-engagement campaign to attempt the nudge interaction from customers who didn’t log-in to a website for some amount of time, but knowing whether that campaign was targeting customers >30 days inactive or >45 days inactive impacts our ability to analyze it with a regression discontinuity.

This means that we need to treat metadata as first-class data and ensure that it is extracted from operational source systems (or intent docs, config files, etc.), structured in a machine-readable format, and preserved in analytical data stores along with our customer data.

The importance of “metadata as data” extends beyond business-as-usual organization strategies. We can also fuel future causal inference with better metadata management of past formal experiments and execution errors.

As discussed above, formal experiments may represent a substantial investment in company resources so the data collected from them should be regarded as an asset. Beyond their utility for one-time reads and decisions, experiment designs and results should be carefully catalogued along with the assigned treatment group and the randomization criteria (such as fields characterizing sampling weights as provided in US Census data). This can support future observational analysis of past experiments, including generalizing and transporting results to different populations.

Furthermore, even mistakes in executing past strategies may become “natural experiments” to help businesses understand scenarios that they might never have prioritized for testing. So, machine-readable incident logs and impacted populations can be useful as well.

Knowledge Management

Of course, not all information can be condensed into a nice, machine-readable spreadsheet. Methods like difference-in-differences illustrate how conceptual context can also help us battle-test assumptions like whether the decision-to-treat could have spilled over into the control population or been influenced by an anticipated change in the future outcome. This is the one area where industry may sometimes lag social sciences in information since some population-level treatments like a state law or local ordinance often have documented histories through the legislative process, news coverage, and historical knowledge about their implementation.

Industry can catch up on knowledge management by documenting and preserving in a centralized knowledge repository key information about strategic decisions undertaken, the motivating factors, and the anticipated customer experience. Such documents are inevitably created when working on new projects through memos ad decks intended to communicate the business case, intent, and expected customer experience. However, proactively figuring out how to organize and index this information through a classification system and democratize access through centralized knowledge repositories is critical to giving future users entree to this tribal knowledge. Projects like Airbnb’s Knowledge Repository suggest what such a system might look like in practice.

Footnotes

For example, see https://arxiv.org/abs/1710.10251↩︎
If you’ve heard of ‘selection on observables’ in causal literature, richer data means observables!↩︎
There are some exceptions to this like placebo tests, bunching checks, etc.↩︎
Notable, the availability of more data absolutely does not mean that we should simply “dump in” all the data we have. Controlling for certain variables like colliders is counterproductive.↩︎

DataFold Data Quality Meet Up

Fri, 12 May 2023 05:00:00 GMT

Slides
Video

This is the full recording from Datafold’s 9th Data Quality Meetup on Thursday, May 11th, 2023, which was focused on ‘Running dbt at scale’.

Following our usual structure, each of our speakers present a lightning talk and then we transition into a panel discussion moderated by Gleb Mezhanskiy - who pulls in the audiences’ questions.

We had 6 guest speakers & panelists: 1. Emily Riederer @ Capital One - “Operationalizing Column Name Contracts” 2. Felix Kreitschmann and Jorrit Posor @ FINN Auto - “Supercharging Analytics Engineers: How to save time and prevent technical debt by automating CI checks” 3. Alexandra Gronemeyer @ Airbyte - “adopting and running dbt within a small data team at Airbyte” 4. Jason Jones @ Virgin Media O2 - “Zero to 200: scaling analytics engineering within an enterprise” 5. Sung Won Chung @ dbt Labs - “Experiences implementing dbt at scale”

Crosspost: The Art of Abstraction in ETL

Emily Riederer — Wed, 03 May 2023 05:00:00 GMT

I previously shared the first in my three-part series of guest posts on Airbyte’s developer blog about ETL. The first focused on errors in data extraction. The next two focused on the countless, small decisions one makes when loading data, and finally the DataOps burden to keep things up-and-running.

This post serves only to serve as a quick reference to those posts:

Posit Data Science Hangout

Thu, 13 Apr 2023 05:00:00 GMT

Video

We were recently joined by Emily Riederer, Senior Manager - Customer Management Data Science & Analytics at Capital One. We discussed how a strong foundation in high-quality data infrastructure and reproducible tools sets the stage for innovation in modeling, causal inference, and analytics, and so much more.

Not applicable - live conversation

The Art of Abstraction in ETL: Dodging Data Extraction Errors

Emily Riederer — Wed, 22 Mar 2023 05:00:00 GMT

Whenever I think about data developer tooling, I always like to take the perspectives of:

Understanding what higher-level abstractions that it provides that help eliminate rote work or reduce mental overhead for data teams. In the spirit of my post on the jobs-to-be-done of innersource analysis tools, this can be framed as what ‘jobs’ that tool can be hired to do (and with what level of responsibility and autonomy)
Interrogating the likely failure modes in the data stack based on the mechanics of the system, in the spirit of my call for hypothesis-driven data quality testing

These two themes motivated my recent guest post for Airbyte’s developer blog on The Art of Abstraction in ETL: Dodging Data Extraction Errors. In this post, I argue:

Cooking a meal versus grocery shopping. Interior decorating versus loading the moving van. Transformation versus Extract-Load. It’s human nature to get excited by flashy outcomes and, consequently, the most proximate processes that evidently created them.

This pattern repeats in the data world. Conferences, blog posts, corporate roadmaps, and even budgets focus on data transformation and the allure of “business insights” that might follow. The steps to extract and load data are sometimes discounted as a trivial exercise of scripting and scheduling a few API calls.

However, the elegance of Extract-Load is not just the outcome but the execution – the art of things not going wrong. Just as interior decorating cannot salvage a painting damaged in transit or a carefully planned menu cannot be prepared if half of the ingredients are out-of-stock, the Extract-Load steps of data processing have countless pitfalls which can sideline data teams from their ambitious agendas and aspirations.

I then go on to explore common challenges in successfully extracting data from an API and the abstractions that can aid in this process.

Please check out the full post on Airbyte’s site! I hope it resonates.

Evaluation without Experimentation

Wed, 22 Mar 2023 05:00:00 GMT

Slides
Video Post - Causal Design Patterns

In this four-minute lightning talk, I explain the motivation for using observational versus experimental methods when measuring the impact of GOTV (voter turnout) efforts. Next, I provide a high-level explanation of a specific method (inverse propensity of treatment weighting) and demonstrate the method’s ability to balance baseline traits and historical performance of a synthetic control population for credible effect estimation.

This talk is based on my related blog post and summarizes my work as a Bluebonnet Data Fellow with the Two Million Texans relational organizing campaign during the 2022 midterms.

Emily Riederer

Python Rgonomics - 2025 Update

Now let’s get started

What this post is not

On picking tools

The stack

For setting up

For data analysis

For communication

Miscellaneous

Footnotes

Role-Based Access Control for Quarto sites with Netlify Identity

Demo

Set Up

Configure Role Authentiation

Configure Site Redirects

Create User Interface

User Onboarding

Is it for you?

Python Rgonomics

Crosspost: Data discovery doesn’t belong in ad hoc queries

6 Preventable Data Discovery Queries

1. What columns are in the table?

2. Is the table still live and updating?

3. What is the grain of the table?

4. What values can categorical variables take?

5. Do numeric columns have nulls or ‘sentinel’ values to encode nulls?

6. Is the data stored with partitioning or clustering keys?

Understanding Your Data Without Relying on Queries

1. What columns are in the table? And do we need a table?

2. Is the table still live and updating? And are its own sources current?

3. What is the grain of the table? And how does it relate to others?

4. What values can categorical variables take? Do numeric columns have nulls or ‘sentinel’ values to encode nulls?

‍5. Is the data stored with partitioning or clustering keys?

Base Python Rgonomic Patterns

What other R ergonomics do we enjoy?

Wrangling Things (Date Manipulation)

Formatting Things (f-strings)

Application: Generating File Names

Repeating Things (Iteration / Functional Programming)

Application: Simulation

Faking Things (Data Generation)

Fake Datasets

Built-In Data

Saving Things (Object Serialization)

Footnotes

Crosspost: Why You Need Data Documentation in 2024

polars’ Rgonomic Patterns

What are dplyr’s ergonomics?

Basic Functionality

Main Verbs

Main Verb Design

Chaining (Piping)

Advanced Wrangling

Explicit API for row-wise operations

Column Selectors

In select

In with_columns

In group_by and agg

Consistent API for Window Functions

List Columns and Nested Frames

List Columns & Nested Frames

Undoing

Footnotes

Crosspost: Why you’re closer to data documentation than you think

Python Rgonomics

What this post is not

On picking tools

The stack

For setting up

For data analysis

For communication

Miscellaneous

More to come!

Footnotes

Big ideas from the 2023 Causal Data Science Meeting

Data Downtime Horror Stories Panel

Abstract

Operationalizing Column-Name Contracts with dbtplyr

Scaling Personalized Volunteer Emails

What are `dplyr`’s ergonomics?

In `select`

In `with_columns`

In `group_by` and `agg`