The Chalk team travelled up to Tacoma for SciPy 2025! Here’s what caught our attention.
The composable Python stack has arrived
Deepyaman Datta (a Kedro maintainer) demonstrated just how far the Python community has come! It’s now feasible to build a pure Python, composable data stack.
Kedro + Ibis + dlt = Production pipelines in Python
- dlt > Scaffolding for ETL pipelines with tens of thousands of sources and destinations
- kedro > Data pipeline and orchestration framework
- ibis > DataFrame library for that can execute on over 20 backends e.g. SQL engines
With these three, you can extract and load massive datasets (dlt), orchestrate transformations and models (Kedro), and push computation down to the database or warehouse without leaving Python (Ibis).
# ibis
t.filter(t.a > 1).select("a", "b", "c")
Turns into this SQL expression:
SELECT a, b, c FROM t WHERE a > 1
Tag on DuckDB and you can build a lightweight data warehouse that runs locally.
Pandera + Ibis = Data validation at warehouse scale
- No more pulling data to memory for checks
- Schema validation runs where your data lives
GPUs go native
Christopher Lamb from NVIDIA shared with the audience on how CUDA now has native Python support. No more C++ gatekeeping for GPU programming.
The ecosystem now offers multiple paths to GPU acceleration. Through RAPIDS, cuDF provides true drop-in pandas acceleration, while Polars users get GPU power through the new cudf-polars backend.
On that note: RAPIDS cuDF is making substantial progress integrating with Velox, an execution engine from Meta (that we optimize for online inference). Exciting times for those of us building high-performance data systems!
The <10ms Challenge
Our co-founder Elliot went down a different path, where fast inference is achieved through intelligent system design rather than raw compute power. His talk, "Real-time ML: Accelerating Python for inference at scale," tackled a specific challenge: How do you achieve sub-10ms latency while allowing teams to maintain the development velocity of Python?
His key insights:
- Parse Python's AST to build a computation DAG
- Push filters and projections all the way down to the data source
- Execute in Velox (Meta's unified execution engine) for massive parallelization
- Functions that can't be transpiled (e.g. LLM calls) run in isolated processes with Ray
Working with petabytes
The Zarr team shared major updates for Icechunk, which is like Iceberg's scientific computing cousin; only solidifying the proliferation of separating storage from compute!
Virtual references let you point to byte ranges in existing files without copying data - imagine building a unified view over petabytes without moving a single byte!
Tom Nicholas from Earthmover demonstrated building massive datacubes from archival files. Instead of migrating petabytes of archival data to cloud-native formats, VirtualiZarr creates virtual layers that make existing files accessible as if they were cloud-optimized.
The key insight: most scientific data is stuck in pre-cloud formats. Instead of a costly migration, create virtual layers that make old data look cloud-native.
Notebooks and IDEs reimagined
Unlike traditional notebooks, Marimo eliminates hidden state by treating notebooks as directed acyclic graphs of cells — making execution predictable, reproducible, and collaborative. It’s a batteries-included environment that can replace Jupyter, Streamlit, ipywidgets, Papermill, and more, with features such as:
- Reactive and interactive workflows – run a cell and all dependent cells update automatically; bind sliders, tables, and plots to Python without callbacks.
- Reproducible, shareable, and Git-friendly – deterministic execution, .py-based storage, built-in package management, and deployment as interactive apps or slides.
- Easily export and re-remix your notebooks into a presentation, web app, or script!
Together, these capabilities make Marimo a modern, end-to-end notebook environment built for both experimentation and production.
Meanwhile, Posit (makers of RStudio) announced a new IDE built on VS Code's foundation but optimized for data workflows:
- Integrated data viewers (not just print statements)
- Variable explorers that understand dataframes
- Native plot panes
- Message: Data scientists deserve specialized tools
Positron bridges the gap between lightweight notebooks and full IDEs, giving data scientists an environment that’s both powerful and purpose-built. It’s a reminder that general-purpose editors aren’t enough—data science thrives with tools designed for exploration, visualization, and analysis.
Elvis has been using the editor since the announcement and has really enjoyed an AI-native notebook experience with the same powerful UX of VSCode.
The art of going fast
Brodie Vidrine from NOAA shared hard-won lessons migrating from Java to Polars in their Global Summary of the Month, scoring an impressive 80% performance improvement. When Polars wasn't enough for complex business logic, they dropped down to Numba for custom compiled functions.
for v, v2, qc, qc2, dv in zip(tmin_vals, tmax_vals, min_qc_flags, max_qc_flags, daily_averages):
daily_average_expressions.append(
pl.when(
pl.col("Element").is_in(["TAVG"])
& ~pl.col( v ).is_in([-9999.0])
& pl.col( qc ).is_in([""])
& ~pl.col( v2 ).is_in([-9999.0])
& pl.col( qc2 ).is_in([""])
).then(
((pl.col(v) + pl.col(v2))/2)*.1
).otherwise(
-9999.0
).alias(dv)
)
Imagine dynamically building queries that compare over a hundred columns with just SQL!
What this means for production ML
SciPy 2025 marked a turning point. Python can now hold its own for enterprise workloads and has potential for becoming the primary user interface for interacting with production systems.
Key trends include:
- Unified execution: Write once, run anywhere (Chalk & Ibis)
- Virtual data: Reference, don't copy (Icechunk)
- Reactive development: Reproducible from the start (Marimo & Positron)
The tools and ideas we saw in Tacoma weren’t incremental improvements — they’re signposts for where data and scientific computing are headed next.
You can watch Elliot's talk here.