Chalk for Data Engineers

Linda Zhou

Marketing Manager

by Linda Zhou

August 20, 2025

Your daily Spark jobs run successfully. Your feature store is synced incrementally. Your operational dashboards report zero incidents with green across the board.

Another team messages: “It seems like we’re not integrating refunds into our recommendation engine. We keep suggesting products that have high return rates.”

Your stomach sinks because adding a refund filter requires recomputing six months of features and redeploying all of your ETL pipelines — in the correct order!

Why ML pipelines feel like a chore

Every new data source piles on complexity.

Changing core business logic or even just integrating a new API requires updating multiple DAGs and rewriting the data contracts you have with the teams that depend on our feature logic.

Real-time compute is riddled with bottlenecks.

Warehouses take seconds for simple queries. Pre-computed features go stale quickly. Even worse, ML features need complex joins and aggregations — try computing "the likelihood a customer buys a product based on the similarity of their current browsing session with that of other customers" in 10ms.

Any optimization made to squeeze out another millisecond risks breaking something else. This complexity is compounded by the fact that data engineers are drowning in tools: Airflow, Spark, Kafka, Flink, Redis — each of which is necessary for real-time. But stitching them together with glue code that is written in Python kills off any chance of being real-time in the first place.

When something does break...inevitably, good luck debugging what went wrong!

How Chalk simplifies everything

Chalk abstracts away the complexity of doing real-time inference by enabling machine learning teams to express features and data relationships without worrying about data pipelines, caching, or serving infrastructure.

Streamline feature engineering with declarative feature definitions and programmatic feature management:

Entity relationships that can be derived on the fly
- Joining feature classes with composite keys that are computed in real-time
Built-in state management for incremental processing
- Incrementally cache features with scheduled queries
Materialize (tiling) and window features as and with code
- Persist individual features with multiple bucket durations at various cadences

from chalk.features import DataFrame,
  feature,
  features,
  _
from chalk.streams import Windowed, windowed

@features
class Transaction:
    id: int
    created_at: datetime
    amount: float

    user_id: "User.id"
    user: "User"

@features
class User:
    id: int
    domain: str

    # composite keys that can be used as join keys
    workspace_id: str = _.domain + "-" + _.id
    expensive_api_call: str = feature(max_staleness="30d") # cache values

    # maintain different resolvers to A/B test function calls e.g. gemini vs openai
    llm_response: str = feature(version=3)

    # multi-attribute joins
    txns: DataFrame[Transaction]

    count_txns: Windowed[int] = windowed(
        "1d", "365d",
        expression=_.txns[_.created_at > _.chalk_window].count(),
        # https://docs.chalk.ai/docs/materialized_aggregations
        materialization=True,
    )

One platform in your infrastructure

Chalk runs in your VPC, inheriting your existing security groups and ACLs. Choose your own memory stores (Redis, DynamoDB) based on your performance needs. Your data never leaves your environment - full isolation and compliance. One less external service to manage, monitor, and secure.

Any data source in minutes

With Chalk, connecting a new data source is as easy as writing a new Python function or adding a .sql file to your Chalk repo. Remove cross-team dependencies, unblock data scientists, and iterate on your features with self-serve data access and A/B testing.

-- resolves: Transaction
-- type: online
-- source: pg_txns
-- incremental:
--   mode: row
--   lookback_period: 60m
--   incremental_column: created_at
select
    id,
    created_at,
    amount,
    user_id,
from txns

Smart caching and computation

Define caching strategies down to the feature level. This prevents over-engineering while guaranteeing fresh data where it matters.

Automatic optimization: Set a staleness tolerance per feature — Chalk automatically re-uses features that have already been computed
Time-windowed aggregations: Compute feature aggregations at multiple intervals without building separate pipelines

Build features, not pipelines

Modern machine learning teams have been able to cut model development from months to days with Chalk. Stop spending a week setting up data pipelines just to test a new model idea. Connect new data sources, write features, and deploy a new model all in the same day. Chalk helps data teams focus on what matters most, creating value. Implement new features as fast as you can dream them up!

Chalk for Data Engineers

Why ML pipelines feel like a chore

How Chalk simplifies everything

One platform in your infrastructure

Any data source in minutes

Smart caching and computation

Build features, not pipelines

Also on the Chalkboard

Chalk for AI Engineers

Feature store vs. Feature engine

Quarterly Product Update: Fall