Feature Store
Serve features for realtime decisions and model development.
Chalk is a centralized place to store, serve and discover features for machine learning. Accelerate new model and feature development by re-using engineering work from previous models. Fetch dataframes directly from Jupyter notebooks so your production and training data is guaranteed to be identical.
In []:
df = client.offline_query(
input=labels[[User.uid]],
input_times=[datetime.now()] * len(labels),
output=[
User.name,
User.credit_report,
User.plaid_account.mean_balance,
]
)
Out[]:
# xgboost train / predict
xgb = XGBClassifier(
eval_metric="logloss",
use_label_encoder=False
)
In []:
df = client.offline_query(
input=labels[[User.uid]],
input_times=[datetime.now()] * len(labels),
output=[
User.name,
User.credit_report,
User.plaid_account.mean_balance,
]
)
Out[]:
# xgboost train / predict
xgb = XGBClassifier(
eval_metric="logloss",
use_label_encoder=False
)
Traditionally, teams write one pipeline to fetch data for training and another to fetch that same data for production. To make matters worse, teams often re-write pipelines for each model that relies on that data. You wind up with dozens of implementations to fetch the same data – which are never quite identical. This can lead to significant bugs. Chalk solves this with a unified feature repository – a single pipeline to fetch data which is then accessible for all training and production models.
Typically, developing new features or iterating on existing ones is a slow and painful process. You edit your pipeline or create a new one, then wait days, weeks, or months for results to roll in from production. Finally, you can evaluate whether the feature is correct and/or valuable for your predictions. In another scenario, you might attempt a complex one-off script to synthesize historical results.
With Chalk, you can preview feature updates in real time, so you can iterate quickly and develop better pipelines faster. Once your changes are finalized, Chalk automatically backfills features using historical data so you can immediately start using it to train models.
When computing historical training data, it’s common for teams to accidentally include data “from the future.” Models trained on these faulty data sets fail to perform in production because you can’t actually see into the future at prediction time. Chalk's time-travel functionality makes it easy to compute historical feature sets that accurately show how your features would have appeared in the past.
Spark and Snowflake can’t serve production traffic. Chalk's feature serving platform scales horizontally out-of-the-box and can handle extremely high-volume production workloads at low-latency. Don't spend your time engineering a complex tiered storage system. Let us handle the complexity and operational overhead of scaling feature serving.
Stop reinventing the wheel for each new model. Chalk’s feature catalog makes it easy to discover and organize the features that your team develops. Chalk’s tags and usage tracking make it easy for model authors to discover relevant features that others have already engineered. Version and track feature definitions and metadata in code stored in git for easy review and change control.
Ever need to debug, complete an audit, or otherwise figure out why your model made the prediction it made? Chalk automatically tracks detailed metadata about the provenance of each feature computed and served to your models. Understand the exact code version, data sources, and upstream features that were used to compute a particular feature. Chalk’s audit capability makes it easy to debug issues, justify decisions to regulators, and perform exploratory data analysis.