Problem
_decode_cf_datetime_dtype (xarray/coding/times.py:339-371) reads the first and last element of every time-encoded variable during open_dataset / open_datatree to infer the decoded dtype. On local backends these reads are memory-mapped and free; on remote stores (zarr + S3, zarr + icechunk) each read is a round-trip through zarr.sync() → asyncio, even when chunks are already cached in the backend.
Cost scales as 2 × N_time_variables reads per open.
Downstream impact: this hits any workflow that opens many time-encoded groups from remote zarr — radar (CfRadial2: 1 time per sweep, 100+ sweeps per volume), CMIP model archives, satellite time-series, Pangeo-style DataTrees. Interactive notebooks, dashboards, and tile servers pay the cost on every open.
Impact — cProfile on main, 107-group DataTree (nexrad-arco/KLOT icechunk, 106 time variables)
total open_datatree time: 77.56s
| Function |
cumtime |
share |
_decode_cf_datetime_dtype |
35.17s |
45% |
└─ first_n_items |
25.16s |
32% |
└─ last_item |
9.51s |
12% |
└─ decode_cf_datetime (actual decode) |
0.09s |
0.1% |
The CPU cost of CF decoding is negligible. The 35 seconds are 100% I/O round-trip overhead.
Reproducer (public icechunk store, anonymous S3)
import icechunk, xarray as xr
from time import time
storage = icechunk.s3_storage(
bucket="nexrad-arco", prefix="KLOT", region="us-east-1", anonymous=True,
)
repo = icechunk.Repository.open(storage)
session = repo.readonly_session("main")
start = time()
dtree = xr.open_datatree(session.store, engine="zarr", zarr_format=3,
consolidated=False, chunks={})
print(f"open_datatree: {time() - start:.1f}s")
Why the reads exist
The comment at times.py:346 (2018) explains: "Verify that at least the first and last date can be decoded successfully. Otherwise, tracebacks end up swallowed by Dataset.__repr__."
Empirically, no longer reproducible on current xarray — modern repr returns "..." for lazy arrays without triggering decode. But there is a second, undocumented purpose: with use_cftime=None and chunked data, decode_cf_datetime may return either datetime64[ns] or object (cftime) depending on whether values overflow the pandas ns range, and dask's map_blocks(func, array, dtype=dtype) casts output to the declared dtype — so a wrong declaration silently corrupts overflow values.
Problem
_decode_cf_datetime_dtype(xarray/coding/times.py:339-371) reads the first and last element of every time-encoded variable duringopen_dataset/open_datatreeto infer the decoded dtype. On local backends these reads are memory-mapped and free; on remote stores (zarr + S3, zarr + icechunk) each read is a round-trip throughzarr.sync()→ asyncio, even when chunks are already cached in the backend.Cost scales as
2 × N_time_variablesreads per open.Downstream impact: this hits any workflow that opens many time-encoded groups from remote zarr — radar (CfRadial2: 1 time per sweep, 100+ sweeps per volume), CMIP model archives, satellite time-series, Pangeo-style DataTrees. Interactive notebooks, dashboards, and tile servers pay the cost on every open.
Impact — cProfile on
main, 107-group DataTree (nexrad-arco/KLOTicechunk, 106 time variables)_decode_cf_datetime_dtypefirst_n_itemslast_itemdecode_cf_datetime(actual decode)The CPU cost of CF decoding is negligible. The 35 seconds are 100% I/O round-trip overhead.
Reproducer (public icechunk store, anonymous S3)
Why the reads exist
The comment at
times.py:346(2018) explains: "Verify that at least the first and last date can be decoded successfully. Otherwise, tracebacks end up swallowed byDataset.__repr__."Empirically, no longer reproducible on current xarray — modern repr returns
"..."for lazy arrays without triggering decode. But there is a second, undocumented purpose: withuse_cftime=Noneand chunked data,decode_cf_datetimemay return eitherdatetime64[ns]orobject(cftime) depending on whether values overflow the pandas ns range, and dask'smap_blocks(func, array, dtype=dtype)casts output to the declared dtype — so a wrong declaration silently corrupts overflow values.