In my orbit: hacking orbital’s ML-to-SQL for xgboost

Posit’s recently-announced project orbital translates fitted SciKitLearn pipelines to SQL for easy prediction scoring at scale. This project has many exciting applications to deploy models for batch prediction with near-zero dependencies or custom infrastructure and have scores accessible to operatilize from their data warehouse.

As soon as I heard about the project, I was eager to test it out. However, much of my recent work is in pure xgboost and neither xgboost’s learning API nor the scikit-learn compatible XGBClassifier() and inherently supported by orbital. This post describes a number of workarounds to get orbital working with xgboost. This mostly works, so we’ll also cover the known limitations.

Just want the code? The source notebook for this post is linked throughout and available to run end-to-end. I’m also stashing this and other ongoing explorations of wins, snags, and workflows with orbital in this repo.

(Separately, I’m planning to write about my current test-drive of orbital, possible applicatons/workflows, and current pitfalls. It would have been imminently logical to write that post first. However, I saw others requesting xgboost support for orbital on LinkedIn and began a conversation, so I wanted to pull forward this post.)

Ready for Production?

By orbital’s own admission in its README, it is still under development. The vision is exciting enough, I think it’s more than worth digging it, but be aware that it is likely not production-ready for enterprise-grade application without rigorous independent validation. I’ve found some corner cases (logged on GitHub issues) and will share more thoughts in other posts.

Step-by-Step Guide

Preparing an xgboost model for use in orbital requires a number of transformations. Specifically, this quick “tech note” will cover:

Converting a trained xgboost model into an XGBClassifier
Adding a pre-trained classifier to a scikit-learn pipeline
Enabling XGBClassifier translation from onnxmltools for orbital
Getting final SQL
Validating our results after this hop-scotch game of transformations

Executing this sucessfully requires dealing with a handful of rough edges, largely driven by onnxmltools:

onnxmltools requires variables names of format f{number}
xgboost and XGBClassifier must use base_score of 0.5 (no longer the default!)
orbital seems to complain if the pipeline does not include at least one column transformation
XGBClassifier converter must be registered from onnxmltools
orbital’s parse function must be overwritten to hard-code the ONNX version for compatibility
in rare cases, final predictions vary due to different floating point logic in python and SQL (<0.1% of our test cases)

As we go, we’ll see how to address each of these challenges.

First, we’ll grab some sample data to work with:

# make data dataset
X_train, y_train = make_classification(10000, random_state = 102)
X_train = X_train.round(3)

# get column names for use in pipeline
## onnxmltools forces these to be formatted as "f<number>"
n_cols = len(X_train[0])
nm_cols = [f"f{i}" for i in range(n_cols)]
feat_dict = {c:orbital.types.DoubleColumnType() for c in nm_cols}

Source: End-to-end notebook

Converting `xgboost` model to an `XGBClassifier` pipeline

xgboost provides two interfaces: a native learning API and a scikit-learn compatible API. The learning API is sometimes favored for performance advantages. However, since orbital can only work with scikit-learn pipelines, it’s necessary to move to a compatible API.

The strategy here is to fit an xgboost model (assuming that’s what you wanted to do in the first place), initialize a XGBClassifier, and set its attributes. Then, we can directly put our trained XGBClassifier into the a pipeline.

Base Score/Magin

Currently, we must use a base_score of 0.5 for training xgboost and set the same value for the XGBClassifier. Current versions of xgboost pick smarter values by default, but currently orbital (or perhaps onnxmltools) does not know how to correctly incorporate other base margins into SQL, resulting in incorrect predictions.

This is probably currently the biggest weakness of this overall approach because it’s the only blocker where the fix requires fundamentally changing a modeling decision.

# train with xgb learning api
## keeping parameters super simple so it trains fast and easy to compare
## important: this only works for now with base_score=0.5 
## this is the default assumed by orbital's logic, and I haven't figured out how to convince it otherwise
dtrain = xgb.DMatrix(X_train, y_train, feature_names = nm_cols)
params = {'max_depth':2, 
          'objective':'binary:logistic', 
          'base_score':0.5, 
          'seed':504}
model = xgb.train(params, num_boost_round = 1, dtrain = dtrain)
preds_xgb = model.predict(xgb.DMatrix(X_train, feature_names = nm_cols))

# convert back to skl interface & rebuild needed metadata
clf = xgb.XGBClassifier()
clf._Booster = model
clf.n_classes_ = 2
clf.base_score = 0.5
preds_skl = clf.predict_proba(X_train)[:,-1]

# validate that the results are the same
print(f"xgb and skl match: {np.all(np.isclose(preds_xgb, preds_skl))}")

# add to skl pipeline
ppl = Pipeline([("gbm", clf)])
preds_ppl = ppl.predict_proba(X_train)[:,-1]

# validate that the results are the same
print(f"xgb and ppl match: {np.all(np.isclose(preds_xgb, preds_ppl))}")

xgb and skl match: True
xgb and ppl match: True
xgb and skl match: True
xgb and ppl match: True

Source: End-to-end notebook

We see all three approaches produce the same predictions.

Unfortunately, things aren’t quite that simple.

Add multiple pipeline steps

orbital seems to complain if it does not have at least one column-transformation pipeline step. I’ve yet to figure out exactly why, but in the meantime it’s no-cost to make a “fake” step that changes no columns.

Below, I remake the pipeline with a column transformer, ask it to apply to an empty list of variables, and request the rest (i.e. all of them) be passed through untouched.

# now we actually make a slightly more complicated pipeline
# orbital seems unhappy if there isn't at least one preprocessing step,
# so we make one that processes no variables and passes through the rest
pipeline = Pipeline(
    [
        ("preprocess", ColumnTransformer([("scaler", StandardScaler(), [])], remainder="passthrough")),
    ]
)
pipeline.fit(X_train)
pipeline.steps.append(("gbm", clf))
preds_ppl2 = pipeline.predict_proba(X_train)[:,1]
print(f"xgb and ppl2 matches: {np.all(np.isclose(preds_xgb, preds_ppl2))}")

xgb and ppl2 matches: True
xgb and ppl2 matches: True

Source: End-to-end notebook

Again, we see this “null” step does not change our predictions.

Enabling `onnxmltools` for `XGBClassifier` conversion

orbital depends on skl2onnx which implements a smaller set of model types. onnxmltools offers many additional model converters. However, for skl2onnx to correctly find and apply these converters, they must be registered.

# `options` copied straight from `onnxmltools` docs
update_registered_converter(
    XGBClassifier,
    "XGBoostXGBClassifier",
    calculate_linear_classifier_output_shapes,
    convert_xgboost,
    options={"nocl": [True, False], 
             "zipmap": [True, False, "columns"], 
            },
)

Source: End-to-end notebook

However, there’s another nuance here. We all know the challenges of python package versioning, but both skl2onnx and onnxmltools also require coordinating on a version of the ONNX spec’s version as a universal way to represent model objects. The skl2onnx function that allows us to request a version is wrapped in orbital without the ability to pass in parameters. So, we must override that function.

Override orbital’s parse_pipeline()

This is required to set an ONNX version compatible between skl2onnx and onnxmltools. This is a lightweight function and not a class method, so we can just steal the code from the orbital package, modify it, and call it for ourselves. There is no need to monkeypatch.

def parse_pipeline_local(
    pipeline: Pipeline, features: orbital.types.FeaturesTypes
) -> orbital.ast.ParsedPipeline:

    onnx_model = skl2onnx.to_onnx(
        pipeline,
        initial_types=[
            (fname, ftype._to_onnxtype())
            for fname, ftype in features.items()
            if not ftype.is_passthrough
        ],
        target_opset={"": 15,'ai.onnx.ml':3}
    )
    return orbital.ast.ParsedPipeline._from_onnx_model(onnx_model, features)

Source: End-to-end notebook

Run `orbital`!

If you’ve made it this far, you’ll be happy to know the next step is straightforward. We can now run orbital to generate the SQL representation of our model prediction logic.

# translate into an Orbital Pipeline
orbital_pipeline = parse_pipeline_local(pipeline, features=feat_dict)
sql_raw = orbital.export_sql("DATA_TABLE", orbital_pipeline, dialect="duckdb")

Source: End-to-end notebook

Validate results

So, after all that, did we get the right result? One way we can confirm (especially because we kept the initial xgboost model very simple) is to compare the visual of our tree with the resulting SQL.

Here’s the tree grown by xgboost:

Source: End-to-end notebook

Here’s the SQL developed by orbital:

SELECT
  1 / (
    EXP(
      -CASE
        WHEN "t0"."f4" < -0.04800000041723251
        THEN CASE
          WHEN "t0"."f4" < -0.8119999766349792
          THEN -0.5087512135505676
          ELSE -0.21405750513076782
        END
        ELSE CASE
          WHEN "t0"."f18" < -0.4269999861717224
          THEN -0.3149999976158142
          ELSE 0.5008015036582947
        END
      END
    ) + 1
  ) AS "pred",
  "f1"
FROM "DATA_TABLE" AS "t0"
SELECT
  1 / (
    EXP(
      -CASE
        WHEN "t0"."f4" < -0.04800000041723251
        THEN CASE
          WHEN "t0"."f4" < -0.8119999766349792
          THEN -0.5087512135505676
          ELSE -0.21405750513076782
        END
        ELSE CASE
          WHEN "t0"."f18" < -0.4269999861717224
          THEN -0.3149999976158142
          ELSE 0.5008015036582947
        END
      END
    ) + 1
  ) AS "pred",
  "f1"
FROM "DATA_TABLE" AS "t0"

Source: End-to-end notebook

These appear to match!

However, if we go to use the results, we find that there are some non-equal predictions.

DATA_TABLE = pd.DataFrame(X_train, columns = nm_cols)
db_preds = duckdb.sql(sql_mod).df()
preds_orb = db_preds['pred']
print(f"xgb and orb match: {np.all(np.isclose(preds_xgb, preds_orb))}")

xgb and orb match: False
xgb and orb match: False

Source: End-to-end notebook

Floating point math

Predictions may differ slightly across platforms due to floating point precision. Below, we see 5 of 10K predictions were non-equal. We can pull out the values of f4 and f18 for those 5 records (the only variables used in the model) and compare them to either the SQL or the flowchart. All 5 misses lie right at the cutpoint for one of the nodes.

# isolate and size misses
misses = np.where(~np.isclose(preds_xgb, preds_orb))
print(f'Different predictions (N): {len(misses[0])}')
print(f'Different predictions (P): {len(misses[0]) / len(X_train):.4f}')

# pull out f4 and f18; notice that all discrepancies lie exactly at the splitting points
X_train[misses][:,[4,18]]

Different predictions (N): 5
Different predictions (P): 0.0005
Different predictions (N): 5
Different predictions (P): 0.0005

array([[-0.812, -0.515],
       [-0.812, -0.739],
       [ 1.715, -0.427],
       [ 0.025, -0.427],
       [ 2.119, -0.427]])

Source: End-to-end notebook

Step-by-Step Guide

Converting xgboost model to an XGBClassifier pipeline

Enabling onnxmltools for XGBClassifier conversion

Run orbital!

Validate results

Converting `xgboost` model to an `XGBClassifier` pipeline

Enabling `onnxmltools` for `XGBClassifier` conversion

Run `orbital`!