[Feature] Switch from numpy void() to frombuffer() - requires major release by jan-janssen · Pull Request #984 · pyiron/executorlib

jan-janssen · 2026-05-08T09:43:58Z

p.void is a fixed-size raw byte scalar. Its size is stored internally using NumPy’s npy_intp / Python Py_ssize_t-like sizing, but in practice some NumPy scalar/array paths still hit a ~2 GiB signed 32-bit limit (2**31 - 1) for a single element / scalar buffer. So yes: the limit you are hitting is plausibly on the order of 2 GB, not a cloudpickle limit.

For large pickles, don’t store the whole pickle as one np.void. Store it as a byte array dataset instead:

import cloudpickle
import numpy as np
import h5py

obj = ...
blob = cloudpickle.dumps(obj)

with h5py.File("x.h5", "w") as f:
    f.create_dataset(
        "pickle",
        data=np.frombuffer(blob, dtype=np.uint8),
        compression="gzip",  # optional
    )

Read it back:

with h5py.File("x.h5", "r") as f:
    blob = f["pickle"][()].tobytes()

obj = cloudpickle.loads(blob)

Summary by CodeRabbit

Bug Fixes
- Improved handling of missing optional data groups, ensuring operations complete successfully with appropriate default values when data is absent.
Chores
- Enhanced data serialization with improved storage efficiency and robustness in data persistence operations across all data retrieval pathways.

coderabbitai · 2026-05-08T09:44:11Z

📝 Walkthrough

Walkthrough

This PR updates HDF5 serialization in executorlib from np.void(...) encoding to uint8 byte arrays with gzip compression. The dump() function now writes pickled objects as compressed byte arrays; load() and all accessor functions (get_output, get_runtime, get_queue_id, _get_content_of_file) deserialize using the matching [()].tobytes() + cloudpickle.loads() pattern.

Changes

HDF5 Serialization Format Upgrade

Layer / File(s)	Summary
HDF5 Dump Serialization `src/executorlib/standalone/hdf.py`	`dump()` now encodes all mapped groups as compressed `uint8` byte arrays via `np.frombuffer(cloudpickle.dumps(...), dtype=np.uint8)` instead of `np.void(...)`.
HDF5 Load Deserialization `src/executorlib/standalone/hdf.py`	`load()` deserializes datasets using `hdf["/key"][()].tobytes()` → `cloudpickle.loads()`, adds explicit defaults for optional groups (`args`, `kwargs`, `resource_dict`, `error_log_file`) when missing, and preserves `TypeError` for absent `function`.
Accessor Function Deserialization `src/executorlib/standalone/hdf.py`	`get_output()`, `get_runtime()`, `get_queue_id()`, and `_get_content_of_file()` apply the same `[()].tobytes()` + `cloudpickle.loads()` pattern to deserialize individual stored components.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

pyiron/executorlib#775: Both PRs modify the same HDF5 serialization paths and functions (dump, load, get_output, get_runtime, get_queue_id, _get_content_of_file) in executorlib/standalone/hdf.py.

Poem

🐰 A rabbit hops through data streams,
Byte arrays dance in HDF dreams,
Cloudpickle's compression, gzip's embrace,
Serialization finds a faster place! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title clearly identifies the main technical change: switching from numpy void() to frombuffer() for HDF5 storage, and correctly signals that it requires a major release.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch npfrombuffer

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

for more information, see https://pre-commit.ci

codecov · 2026-05-08T09:48:14Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.16%. Comparing base (4a63cb1) to head (c65a65b).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #984   +/-   ##
=======================================
  Coverage   94.16%   94.16%           
=======================================
  Files          39       39           
  Lines        2092     2092           
=======================================
  Hits         1970     1970           
  Misses        122      122

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

🧹 Nitpick comments (1)

src/executorlib/standalone/hdf.py (1)

39-43: ⚡ Quick win

Extract shared pickle/HDF encode-decode helpers to prevent format drift.

The serialization/deserialization expression is duplicated across many sites; centralizing it will make future format changes safer.

Proposed refactor

@@
 import cloudpickle
 import h5py
 import numpy as np
@@
+def _serialize_to_uint8(value: Any) -> np.ndarray:
+    return np.frombuffer(cloudpickle.dumps(value), dtype=np.uint8)
+
+
+def _deserialize_from_key(hdf: h5py.File, key: str) -> Any:
+    return cloudpickle.loads(hdf[f"/{key}"][()].tobytes())
+
+
 def dump(file_name: Optional[str], data_dict: dict) -> None:
@@
                     fname.create_dataset(
                         name="/" + group_dict[data_key],
-                        data=np.frombuffer(
-                            cloudpickle.dumps(data_value), dtype=np.uint8
-                        ),
+                        data=_serialize_to_uint8(data_value),
                         compression="gzip",
                     )
@@
-            data_dict["fn"] = cloudpickle.loads(hdf["/function"][()].tobytes())
+            data_dict["fn"] = _deserialize_from_key(hdf, "function")
@@
-            data_dict["args"] = cloudpickle.loads(hdf["/input_args"][()].tobytes())
+            data_dict["args"] = _deserialize_from_key(hdf, "input_args")

Also applies to: 59-79, 97-99, 126-126, 144-144, 230-230

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/executorlib/standalone/hdf.py` around lines 39 - 43, Replace the repeated
serialization/deserialization expression (cloudpickle.dumps(...) wrapped with
np.frombuffer(..., dtype=np.uint8) and compression="gzip") with two shared
helpers—e.g., encode_pickle_for_hdf(obj) that returns the uint8 ndarray ready to
write to HDF and returns any needed metadata, and
decode_pickle_from_hdf(uint8_array) that calls cloudpickle.loads on the buffer
when reading; update every site currently doing cloudpickle.dumps +
np.frombuffer + compression="gzip" (the duplicated expression) to call
encode_pickle_for_hdf when writing and decode_pickle_from_hdf when reading so
all places use a single implementation and a single compression/format contract.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/executorlib/standalone/hdf.py`:
- Around line 39-43: Replace the repeated serialization/deserialization
expression (cloudpickle.dumps(...) wrapped with np.frombuffer(...,
dtype=np.uint8) and compression="gzip") with two shared helpers—e.g.,
encode_pickle_for_hdf(obj) that returns the uint8 ndarray ready to write to HDF
and returns any needed metadata, and decode_pickle_from_hdf(uint8_array) that
calls cloudpickle.loads on the buffer when reading; update every site currently
doing cloudpickle.dumps + np.frombuffer + compression="gzip" (the duplicated
expression) to call encode_pickle_for_hdf when writing and
decode_pickle_from_hdf when reading so all places use a single implementation
and a single compression/format contract.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4d5ec173-654e-4ee6-90a8-c48f7fcbfa04

📥 Commits

Reviewing files that changed from the base of the PR and between 4a63cb1 and c65a65b.

📒 Files selected for processing (1)

src/executorlib/standalone/hdf.py

jan-janssen added 2 commits May 8, 2026 11:40

Switch from numpy void() to frombuffer()

072779d

fix reader

bcac159

[pre-commit.ci] auto fixes from pre-commit.com hooks

c65a65b

for more information, see https://pre-commit.ci

jan-janssen mentioned this pull request May 8, 2026

[bug] backend_write_file should catch serialization errors and write them to the output file #982

Open

jan-janssen marked this pull request as draft May 8, 2026 09:47

jan-janssen changed the title ~~Switch from numpy void() to frombuffer()~~ [Feature] Switch from numpy void() to frombuffer() - requires major release May 8, 2026

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Switch from numpy void() to frombuffer() - requires major release#984

[Feature] Switch from numpy void() to frombuffer() - requires major release#984
jan-janssen wants to merge 3 commits intomainfrom
npfrombuffer

jan-janssen commented May 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 8, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

codecov Bot commented May 8, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jan-janssen commented May 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

codecov Bot commented May 8, 2026

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jan-janssen commented May 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 8, 2026 •

edited

Loading