See what Databricks’ Tecton acquisition means for your ML stack. Watch on demand.Concerned about Tecton? Watch now.

SciPy 2025 Recap

Linda Zhou

Marketing Manager

Elvis Kahoro

Developer Advocate

by Linda Zhou, Elvis Kahoro

August 13, 2025

The Chalk team travelled up to Tacoma for SciPy 2025! Here’s what caught our attention.

The composable Python stack has arrived

Deepyaman Datta (a Kedro maintainer) demonstrated just how far the Python community has come! It’s now feasible to build a pure Python, composable data stack.

Kedro + Ibis + dlt = Production pipelines in Python

dlt > Scaffolding for ETL pipelines with tens of thousands of sources and destinations
kedro > Data pipeline and orchestration framework
ibis > DataFrame library for that can execute on over 20 backends e.g. SQL engines

With these three, you can extract and load massive datasets (dlt), orchestrate transformations and models (Kedro), and push computation down to the database or warehouse without leaving Python (Ibis).

# ibis
t.filter(t.a > 1).select("a", "b", "c")

Turns into this SQL expression:

SELECT a, b, c FROM t WHERE a > 1

Tag on DuckDB and you can build a lightweight data warehouse that runs locally.

Pandera + Ibis = Data validation at warehouse scale

No more pulling data to memory for checks
Schema validation runs where your data lives

GPUs go native

Christopher Lamb from NVIDIA shared with the audience on how CUDA now has native Python support. No more C++ gatekeeping for GPU programming.

The ecosystem now offers multiple paths to GPU acceleration. Through RAPIDS, cuDF provides true drop-in pandas acceleration, while Polars users get GPU power through the new cudf-polars backend.

On that note: RAPIDS cuDF is making substantial progress integrating with Velox, an execution engine from Meta (that we optimize for online inference). Exciting times for those of us building high-performance data systems!

The <10ms Challenge

Our co-founder Elliot went down a different path, where fast inference is achieved through intelligent system design rather than raw compute power. His talk, "Real-time ML: Accelerating Python for inference at scale," tackled a specific challenge: How do you achieve sub-10ms latency while allowing teams to maintain the development velocity of Python?

His key insights:

Parse Python's AST to build a computation DAG
Push filters and projections all the way down to the data source
Execute in Velox (Meta's unified execution engine) for massive parallelization
Functions that can't be transpiled (e.g. LLM calls) run in isolated processes with Ray

Working with petabytes

The Zarr team shared major updates for Icechunk, which is like Iceberg's scientific computing cousin; only solidifying the proliferation of separating storage from compute!

Virtual references let you point to byte ranges in existing files without copying data - imagine building a unified view over petabytes without moving a single byte!

Tom Nicholas from Earthmover demonstrated building massive datacubes from archival files. Instead of migrating petabytes of archival data to cloud-native formats, VirtualiZarr creates virtual layers that make existing files accessible as if they were cloud-optimized.

The key insight: most scientific data is stuck in pre-cloud formats. Instead of a costly migration, create virtual layers that make old data look cloud-native.

Notebooks and IDEs reimagined

Unlike traditional notebooks, Marimo eliminates hidden state by treating notebooks as directed acyclic graphs of cells — making execution predictable, reproducible, and collaborative. It’s a batteries-included environment that can replace Jupyter, Streamlit, ipywidgets, Papermill, and more, with features such as:

Reactive and interactive workflows – run a cell and all dependent cells update automatically; bind sliders, tables, and plots to Python without callbacks.
Reproducible, shareable, and Git-friendly – deterministic execution, .py-based storage, built-in package management, and deployment as interactive apps or slides.
Easily export and re-remix your notebooks into a presentation, web app, or script!

Together, these capabilities make Marimo a modern, end-to-end notebook environment built for both experimentation and production.

Meanwhile, Posit (makers of RStudio) announced a new IDE built on VS Code's foundation but optimized for data workflows:

Integrated data viewers (not just print statements)
Variable explorers that understand dataframes
Native plot panes
Message: Data scientists deserve specialized tools

Positron bridges the gap between lightweight notebooks and full IDEs, giving data scientists an environment that’s both powerful and purpose-built. It’s a reminder that general-purpose editors aren’t enough—data science thrives with tools designed for exploration, visualization, and analysis.

Elvis has been using the editor since the announcement and has really enjoyed an AI-native notebook experience with the same powerful UX of VSCode.

The art of going fast

Brodie Vidrine from NOAA shared hard-won lessons migrating from Java to Polars in their Global Summary of the Month, scoring an impressive 80% performance improvement. When Polars wasn't enough for complex business logic, they dropped down to Numba for custom compiled functions.

for v, v2, qc, qc2, dv in zip(tmin_vals, tmax_vals, min_qc_flags, max_qc_flags, daily_averages):
    daily_average_expressions.append(
        pl.when(
            pl.col("Element").is_in(["TAVG"])
            & ~pl.col( v ).is_in([-9999.0])
            & pl.col( qc ).is_in([""])
            & ~pl.col( v2 ).is_in([-9999.0])
            & pl.col( qc2 ).is_in([""])
        ).then(
            ((pl.col(v) + pl.col(v2))/2)*.1
        ).otherwise(
            -9999.0
        ).alias(dv)
    )

Imagine dynamically building queries that compare over a hundred columns with just SQL!

What this means for production ML

SciPy 2025 marked a turning point. Python can now hold its own for enterprise workloads and has potential for becoming the primary user interface for interacting with production systems.

Key trends include: