Chalk for AI Engineers

Linda Zhou - Marketing Manager
Elvis Kahoro - Developer Advocate
by Linda Zhou, Elvis Kahoro
June 27, 2025

You built a customer support bot that analyzes tickets and suggests responses. In your notebook, it understands context — pulling recent interactions, checking account status, and drafting helpful replies.

Three weeks into production, it's suggesting refunds for customers who’ve already received them, and you can't tell which prompt version is running or what context the model has been given.

So what led to that faulty suggestion in the first place?

Why productionizing AI is so hard

The context that your model needs lives in various systems: account status, refund history, previous support tickets. You sync it all to a feature store via overnight ETL jobs. When a customer writes in, your bot fetches these pre-computed features and calls an LLM, but the context is already stale.

Snowflake to LLM data flow

The freshness problem hits immediately. The refund processed this morning isn't in last night's batch. The policy that changed an hour ago isn't reflected in your bot’s responses.

The tooling complexity makes it worse. To build this system, you're juggling:

  • ETL pipelines that sync your data to a vector database
  • An embedding pipeline with retry logic and rate limiting
  • Fetching features in real-time (but your context is still from last night's batch)
  • A prompt versioning system for testing and compliance
  • Cost tracking and monitoring

Every component is a separate system with its own failure modes. When the AI makes decisions on stale data or breaks randomly, you're debugging across at least five different tools.

Furthermore, since different model providers have different syntax for structured output (if they even support it in the first place), you end up writing adapters for each one.

AI as infrastructure: Unified ML + LLM pipeline

What if you could treat AI like any other infrastructure? Ultimately, that's what it is — most agents are just for loops with API calls. The engineering principles that make any system reliable apply here too.

Multi-source feature computation

That’s exactly what Chalk does — it brings engineering discipline to AI systems. In Chalk, there's no distinction between 'ML' and 'AI' features: they're all just features in a unified platform. Whether it’s PostgreSQL data, API calls or LLM outputs, everything maps to feature classes that are versioned, typed, and testable like any other code.

Instead of ETL jobs populating a feature store and making separate LLM calls, Chalk pulls fresh data and computes everything on demand when requests come in. These features are cached and immediately available and reusable across all your models.

What you can do with Chalk

Start where you are

Chalk features work directly in Jupyter and Colab notebooks. When you're ready to move beyond experimentation, Chalk integrates with your existing vector database and infrastructure. The code you write for exploration is the same code that runs in production — no translation layer or rewriting for deployment!

Define features once, use everywhere

In Chalk, LLM outputs are features: computed once, cached, and reusable across teams and use cases.

Your sentiment analysis becomes a feature. Your entity extraction becomes a feature. Any downstream model or service can use them.

@features
class Ticket:
    id: str
    user_id: User.id
    User: User

    status: AccountStatus = has_one(lambda: Ticket.user_id == AccountStatus.user_id)

    llm: P.PromptResponse = P.completion(
        model="gpt-4o-mini",
        messages=[
            # Chalk injects features into the prompt
            # include features you want with Jinja templates
            P.message("user", USER_PROMPT)
        ],
        output_structure=SupportResponse,
    )
    escalate_to_human: bool = F.json_value(
        _.llm.response,
        "$.escalate_to_human", # from structured output
    )
    personalized_message: str = F.json_value(
        _.llm.response,
        "$.personalized_message", # from structured output
    )

Real-time context injection

Chalk pulls fresh data at inference time using Jinja templates. Your prompts reference features, and Chalk ensures those features are current when the LLM runs.

prompt = """
Analyze the user's support request and account data to provide a personalized response.

Most recent message: {{Ticket.user.last_message}}

Account context:
- Tier: {{Ticket.status.tier}} (VIP: {{Ticket.status.is_vip}})
- Total spend: ${{Ticket.status.total_spend}}
- Recent tickets: {{Ticket.user.recent_ticket_count}} (unresolved: {{Ticket.user.open_ticket_count}})
- Average sentiment: {{Ticket.user.avg_sentiment}}

Generate a SupportResponse with:
1. personalized_message: Acknowledge their status and address their concern
2. escalate_to_human: True if VIP, high-value (>$1000), or negative sentiment pattern
"""

Every decision is traceable. You see exactly what data the model received, when it was received, and how it influenced the output.

Native vector search

Whether you're finding similar support tickets or searching product catalogs, define embeddings alongside your data and pre-filter however you want, eg. by category, user segment, price range. Search only the vectors that matter for your query.

Prompt versioning & evaluation

Prompts become versioned artifacts you can test, compare, and deploy. Run evaluations on historical data. Compare model performance on accuracy, latency, and cost. Your prompts are now engineered assets, with full audit trails and rollback capabilities.

The simpler way ahead

The right abstractions dissolve complexity. We built Chalk to unify the AI stack: everything maps to typed feature classes, LLM outputs become reusable features, and data stays fresh by default. Ship AI like you ship code!

Build ML Features faster with Chalk

See what Chalk can do for your team