SciPy 2025 Recap

Linda Zhou - Marketing Manager
Elvis Kahoro - Developer Advocate
by Linda Zhou, Elvis Kahoro
August 13, 2025

The Chalk team travelled up to Tacoma for SciPy 2025! Here’s what caught our attention.

The composable Python stack has arrived

Deepyaman Datta (a Kedro maintainer) demonstrated just how far the Python community has come! It’s now feasible to build a pure Python, composable data stack.

Kedro + Ibis + dlt = Production pipelines in Python

  • dlt > Scaffolding for ETL pipelines with tens of thousands of sources and destinations
  • kedro > Data pipeline and orchestration framework
  • ibis > DataFrame library for that can execute on over 20 backends e.g. SQL engines

With these three, you can extract and load massive datasets (dlt), orchestrate transformations and models (Kedro), and push computation down to the database or warehouse without leaving Python (Ibis).

# ibis
t.filter(t.a > 1).select("a", "b", "c")

Turns into this SQL expression:

SELECT a, b, c FROM t WHERE a > 1

Tag on DuckDB and you can build a lightweight data warehouse that runs locally.

Pandera + Ibis = Data validation at warehouse scale

  • No more pulling data to memory for checks
  • Schema validation runs where your data lives

GPUs go native

Christopher Lamb from NVIDIA shared with the audience on how CUDA now has native Python support. No more C++ gatekeeping for GPU programming.

The ecosystem now offers multiple paths to GPU acceleration. Through RAPIDS, cuDF provides true drop-in pandas acceleration, while Polars users get GPU power through the new cudf-polars backend.

On that note: RAPIDS cuDF is making substantial progress integrating with Velox, an execution engine from Meta (that we optimize for online inference). Exciting times for those of us building high-performance data systems!

The <10ms Challenge

Our co-founder Elliot went down a different path, where fast inference is achieved through intelligent system design rather than raw compute power. His talk, "Real-time ML: Accelerating Python for inference at scale," tackled a specific challenge: How do you achieve sub-10ms latency while allowing teams to maintain the development velocity of Python?

His key insights:

  • Parse Python's AST to build a computation DAG
  • Push filters and projections all the way down to the data source
  • Execute in Velox (Meta's unified execution engine) for massive parallelization
  • Functions that can't be transpiled (e.g. LLM calls) run in isolated processes with Ray

Working with petabytes

The Zarr team shared major updates for Icechunk, which is like Iceberg's scientific computing cousin; only solidifying the proliferation of separating storage from compute!

Virtual references let you point to byte ranges in existing files without copying data - imagine building a unified view over petabytes without moving a single byte!

Tom Nicholas from Earthmover demonstrated building massive datacubes from archival files. Instead of migrating petabytes of archival data to cloud-native formats, VirtualiZarr creates virtual layers that make existing files accessible as if they were cloud-optimized.

The key insight: most scientific data is stuck in pre-cloud formats. Instead of a costly migration, create virtual layers that make old data look cloud-native.

Notebooks and IDEs reimagined

Unlike traditional notebooks, Marimo eliminates hidden state by treating notebooks as directed acyclic graphs of cells — making execution predictable, reproducible, and collaborative. It’s a batteries-included environment that can replace Jupyter, Streamlit, ipywidgets, Papermill, and more, with features such as:

  • Reactive and interactive workflows – run a cell and all dependent cells update automatically; bind sliders, tables, and plots to Python without callbacks.
  • Reproducible, shareable, and Git-friendly – deterministic execution, .py-based storage, built-in package management, and deployment as interactive apps or slides.
  • Easily export and re-remix your notebooks into a presentation, web app, or script!

Together, these capabilities make Marimo a modern, end-to-end notebook environment built for both experimentation and production.

Meanwhile, Posit (makers of RStudio) announced a new IDE built on VS Code's foundation but optimized for data workflows:

  • Integrated data viewers (not just print statements)
  • Variable explorers that understand dataframes
  • Native plot panes
  • Message: Data scientists deserve specialized tools

Positron bridges the gap between lightweight notebooks and full IDEs, giving data scientists an environment that’s both powerful and purpose-built. It’s a reminder that general-purpose editors aren’t enough—data science thrives with tools designed for exploration, visualization, and analysis.

Elvis has been using the editor since the announcement and has really enjoyed an AI-native notebook experience with the same powerful UX of VSCode.

The art of going fast

Brodie Vidrine from NOAA shared hard-won lessons migrating from Java to Polars in their Global Summary of the Month, scoring an impressive 80% performance improvement. When Polars wasn't enough for complex business logic, they dropped down to Numba for custom compiled functions.

for v, v2, qc, qc2, dv in zip(tmin_vals, tmax_vals, min_qc_flags, max_qc_flags, daily_averages):
    daily_average_expressions.append(
        pl.when(
            pl.col("Element").is_in(["TAVG"])
            & ~pl.col( v ).is_in([-9999.0])
            & pl.col( qc ).is_in([""])
            & ~pl.col( v2 ).is_in([-9999.0])
            & pl.col( qc2 ).is_in([""])
        ).then(
            ((pl.col(v) + pl.col(v2))/2)*.1
        ).otherwise(
            -9999.0
        ).alias(dv)
    )

Imagine dynamically building queries that compare over a hundred columns with just SQL!

What this means for production ML

SciPy 2025 marked a turning point. Python can now hold its own for enterprise workloads and has potential for becoming the primary user interface for interacting with production systems.

Key trends include:

  • Unified execution: Write once, run anywhere (Chalk & Ibis)
  • Virtual data: Reference, don't copy (Icechunk)
  • Reactive development: Reproducible from the start (Marimo & Positron)

The tools and ideas we saw in Tacoma weren’t incremental improvements — they’re signposts for where data and scientific computing are headed next.

Elliot in his element presenting at SciPy 2025

You can watch Elliot's talk here.

Build ML Features faster with Chalk

See what Chalk can do for your team