Skip to content

_decode_cf_datetime_dtype triggers 2 store reads per time variable — expensive on remote stores #11303

@aladinor

Description

@aladinor

Problem

_decode_cf_datetime_dtype (xarray/coding/times.py:339-371) reads the first and last element of every time-encoded variable during open_dataset / open_datatree to infer the decoded dtype. On local backends these reads are memory-mapped and free; on remote stores (zarr + S3, zarr + icechunk) each read is a round-trip through zarr.sync() → asyncio, even when chunks are already cached in the backend.

Cost scales as 2 × N_time_variables reads per open.

Downstream impact: this hits any workflow that opens many time-encoded groups from remote zarr — radar (CfRadial2: 1 time per sweep, 100+ sweeps per volume), CMIP model archives, satellite time-series, Pangeo-style DataTrees. Interactive notebooks, dashboards, and tile servers pay the cost on every open.

Impact — cProfile on main, 107-group DataTree (nexrad-arco/KLOT icechunk, 106 time variables)

total open_datatree time: 77.56s
Function cumtime share
_decode_cf_datetime_dtype 35.17s 45%
└─ first_n_items 25.16s 32%
└─ last_item 9.51s 12%
└─ decode_cf_datetime (actual decode) 0.09s 0.1%

The CPU cost of CF decoding is negligible. The 35 seconds are 100% I/O round-trip overhead.

Reproducer (public icechunk store, anonymous S3)

import icechunk, xarray as xr
from time import time

storage = icechunk.s3_storage(
    bucket="nexrad-arco", prefix="KLOT", region="us-east-1", anonymous=True,
)
repo = icechunk.Repository.open(storage)
session = repo.readonly_session("main")

start = time()
dtree = xr.open_datatree(session.store, engine="zarr", zarr_format=3,
                         consolidated=False, chunks={})
print(f"open_datatree: {time() - start:.1f}s")

Why the reads exist

The comment at times.py:346 (2018) explains: "Verify that at least the first and last date can be decoded successfully. Otherwise, tracebacks end up swallowed by Dataset.__repr__."

Empirically, no longer reproducible on current xarray — modern repr returns "..." for lazy arrays without triggering decode. But there is a second, undocumented purpose: with use_cftime=None and chunked data, decode_cf_datetime may return either datetime64[ns] or object (cftime) depending on whether values overflow the pandas ns range, and dask's map_blocks(func, array, dtype=dtype) casts output to the declared dtype — so a wrong declaration silently corrupts overflow values.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions