Skip to content

[Feature] Switch from numpy void() to frombuffer() - requires major release#984

Draft
jan-janssen wants to merge 3 commits intomainfrom
npfrombuffer
Draft

[Feature] Switch from numpy void() to frombuffer() - requires major release#984
jan-janssen wants to merge 3 commits intomainfrom
npfrombuffer

Conversation

@jan-janssen
Copy link
Copy Markdown
Member

@jan-janssen jan-janssen commented May 8, 2026

p.void is a fixed-size raw byte scalar. Its size is stored internally using NumPy’s npy_intp / Python Py_ssize_t-like sizing, but in practice some NumPy scalar/array paths still hit a ~2 GiB signed 32-bit limit (2**31 - 1) for a single element / scalar buffer. So yes: the limit you are hitting is plausibly on the order of 2 GB, not a cloudpickle limit.

For large pickles, don’t store the whole pickle as one np.void. Store it as a byte array dataset instead:

import cloudpickle
import numpy as np
import h5py

obj = ...
blob = cloudpickle.dumps(obj)

with h5py.File("x.h5", "w") as f:
    f.create_dataset(
        "pickle",
        data=np.frombuffer(blob, dtype=np.uint8),
        compression="gzip",  # optional
    )

Read it back:

with h5py.File("x.h5", "r") as f:
    blob = f["pickle"][()].tobytes()

obj = cloudpickle.loads(blob)

Summary by CodeRabbit

  • Bug Fixes

    • Improved handling of missing optional data groups, ensuring operations complete successfully with appropriate default values when data is absent.
  • Chores

    • Enhanced data serialization with improved storage efficiency and robustness in data persistence operations across all data retrieval pathways.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 8, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR updates HDF5 serialization in executorlib from np.void(...) encoding to uint8 byte arrays with gzip compression. The dump() function now writes pickled objects as compressed byte arrays; load() and all accessor functions (get_output, get_runtime, get_queue_id, _get_content_of_file) deserialize using the matching [()].tobytes() + cloudpickle.loads() pattern.

Changes

HDF5 Serialization Format Upgrade

Layer / File(s) Summary
HDF5 Dump Serialization
src/executorlib/standalone/hdf.py
dump() now encodes all mapped groups as compressed uint8 byte arrays via np.frombuffer(cloudpickle.dumps(...), dtype=np.uint8) instead of np.void(...).
HDF5 Load Deserialization
src/executorlib/standalone/hdf.py
load() deserializes datasets using hdf["/key"][()].tobytes()cloudpickle.loads(), adds explicit defaults for optional groups (args, kwargs, resource_dict, error_log_file) when missing, and preserves TypeError for absent function.
Accessor Function Deserialization
src/executorlib/standalone/hdf.py
get_output(), get_runtime(), get_queue_id(), and _get_content_of_file() apply the same [()].tobytes() + cloudpickle.loads() pattern to deserialize individual stored components.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • pyiron/executorlib#775: Both PRs modify the same HDF5 serialization paths and functions (dump, load, get_output, get_runtime, get_queue_id, _get_content_of_file) in executorlib/standalone/hdf.py.

Poem

🐰 A rabbit hops through data streams,
Byte arrays dance in HDF dreams,
Cloudpickle's compression, gzip's embrace,
Serialization finds a faster place! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly identifies the main technical change: switching from numpy void() to frombuffer() for HDF5 storage, and correctly signals that it requires a major release.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch npfrombuffer

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jan-janssen jan-janssen marked this pull request as draft May 8, 2026 09:47
@jan-janssen jan-janssen changed the title Switch from numpy void() to frombuffer() [Feature] Switch from numpy void() to frombuffer() - requires major release May 8, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.16%. Comparing base (4a63cb1) to head (c65a65b).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #984   +/-   ##
=======================================
  Coverage   94.16%   94.16%           
=======================================
  Files          39       39           
  Lines        2092     2092           
=======================================
  Hits         1970     1970           
  Misses        122      122           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/executorlib/standalone/hdf.py (1)

39-43: ⚡ Quick win

Extract shared pickle/HDF encode-decode helpers to prevent format drift.

The serialization/deserialization expression is duplicated across many sites; centralizing it will make future format changes safer.

Proposed refactor
@@
 import cloudpickle
 import h5py
 import numpy as np
@@
+def _serialize_to_uint8(value: Any) -> np.ndarray:
+    return np.frombuffer(cloudpickle.dumps(value), dtype=np.uint8)
+
+
+def _deserialize_from_key(hdf: h5py.File, key: str) -> Any:
+    return cloudpickle.loads(hdf[f"/{key}"][()].tobytes())
+
+
 def dump(file_name: Optional[str], data_dict: dict) -> None:
@@
                     fname.create_dataset(
                         name="/" + group_dict[data_key],
-                        data=np.frombuffer(
-                            cloudpickle.dumps(data_value), dtype=np.uint8
-                        ),
+                        data=_serialize_to_uint8(data_value),
                         compression="gzip",
                     )
@@
-            data_dict["fn"] = cloudpickle.loads(hdf["/function"][()].tobytes())
+            data_dict["fn"] = _deserialize_from_key(hdf, "function")
@@
-            data_dict["args"] = cloudpickle.loads(hdf["/input_args"][()].tobytes())
+            data_dict["args"] = _deserialize_from_key(hdf, "input_args")

Also applies to: 59-79, 97-99, 126-126, 144-144, 230-230

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/executorlib/standalone/hdf.py` around lines 39 - 43, Replace the repeated
serialization/deserialization expression (cloudpickle.dumps(...) wrapped with
np.frombuffer(..., dtype=np.uint8) and compression="gzip") with two shared
helpers—e.g., encode_pickle_for_hdf(obj) that returns the uint8 ndarray ready to
write to HDF and returns any needed metadata, and
decode_pickle_from_hdf(uint8_array) that calls cloudpickle.loads on the buffer
when reading; update every site currently doing cloudpickle.dumps +
np.frombuffer + compression="gzip" (the duplicated expression) to call
encode_pickle_for_hdf when writing and decode_pickle_from_hdf when reading so
all places use a single implementation and a single compression/format contract.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/executorlib/standalone/hdf.py`:
- Around line 39-43: Replace the repeated serialization/deserialization
expression (cloudpickle.dumps(...) wrapped with np.frombuffer(...,
dtype=np.uint8) and compression="gzip") with two shared helpers—e.g.,
encode_pickle_for_hdf(obj) that returns the uint8 ndarray ready to write to HDF
and returns any needed metadata, and decode_pickle_from_hdf(uint8_array) that
calls cloudpickle.loads on the buffer when reading; update every site currently
doing cloudpickle.dumps + np.frombuffer + compression="gzip" (the duplicated
expression) to call encode_pickle_for_hdf when writing and
decode_pickle_from_hdf when reading so all places use a single implementation
and a single compression/format contract.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4d5ec173-654e-4ee6-90a8-c48f7fcbfa04

📥 Commits

Reviewing files that changed from the base of the PR and between 4a63cb1 and c65a65b.

📒 Files selected for processing (1)
  • src/executorlib/standalone/hdf.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant