[ENH] V1 → V2 API Migration - datasets#1608
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1608 +/- ##
===========================================
+ Coverage 54.64% 80.85% +26.21%
===========================================
Files 63 63
Lines 5124 5495 +371
===========================================
+ Hits 2800 4443 +1643
+ Misses 2324 1052 -1272 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
FYI @geetu040 Currently the
Issues:
Example:current def _get_dataset_features_file(did_cache_dir: str | Path | None, dataset_id: int) -> dict[int, OpenMLDataFeature]:
return _featuresOr by updating the Dataset class to use the underlining interface method from api_context directly. def _load_features(self) -> None:
...
self._features = api_context.backend.datasets.get_features(self.dataset_id)Another option is to add |
| bool | ||
| True if the deletion was successful. False otherwise. | ||
| """ | ||
| return openml.utils._delete_entity("data", dataset_id) |
There was a problem hiding this comment.
if you implement the delete logic yourself instead of openml.utils._delete_entity, how would that look? I think it would be better.
There was a problem hiding this comment.
Makes Sense , It would look like a delete request from client along with exception handling
| def list( | ||
| self, | ||
| limit: int, | ||
| offset: int, | ||
| **kwargs: Any, | ||
| ) -> pd.DataFrame: |
There was a problem hiding this comment.
same as above, it can use private helper methods
| # Minimalistic check if the XML is useful | ||
| if "oml:data_qualities_list" not in qualities: | ||
| raise ValueError('Error in return XML, does not contain "oml:data_qualities_list"') | ||
| from openml._api import api_context |
There was a problem hiding this comment.
can't we have this import at the very top? does it create circular import error? if not, should be moved to top from all functions.
There was a problem hiding this comment.
It does raise circular import
Thanks for a detailed explanation, I now have good understanding of the download mechanism.
minio can be handled easily, we will use a separate client along with
these are actually different objects in both apis, v1 uses xml and v2 keeps them in json
yes you are right, they are the same files, which are not required to be downloaded again for both versions, but isn't this true for almost all the http objects? they may have different format
I don't understand this point
agreed, should be handled by
agreed, adding in conclusion, I may ask, if we ignore the fact that it downloads the |
|
@geetu040 making a new client for FYI the new commit adds better handling of feature and qualites in OpenMLDataset class moving the v1 specific parsing logic to the interface. So only part left is to handle
|
|
From the standup discussion and earlier conversations, I think we can agree on a few points:
Consider this a green light to experiment with the client design. Try an approach, use whatever caching strategy you think fits best, and aim for a clean, sensible design. Feel free to ask for suggestions or reviews along the way. I'll review it in code. Just make sure this doesn't significantly impact the base classes or other stacked PRs. |
|
The points do make sense to me, I will propose the design along with how it would be used in the resource. |
|
@geetu040 I have a design implemented which needs reviews
Question:
|
I have taken a quick look, the design looks really good, though I have some suggestions/questions in the code, which I will leave in a detailed review. But this in general fixes all our blockers without damaging the original design.
Is it provided by the user? I don't think so. In that case, how does it affect the users? From looking at the code, this cache directory is generated programmatically inside the functions, we can completely remove these lines and always rely on the
|
This makes a sense now. having an independent download method as I have setup is better than updating requests/caching to return path right?
Yes that would work, but the function definition would be changed i.e. tests etc corresponding to them |
I am not sure about that, would require a detailed review
yes, that is expected |
| parsed_url = urllib.parse.urlparse(source) | ||
|
|
||
| # expect path format: /BUCKET/path/to/file.ext | ||
| _, bucket, *prefixes, _file = parsed_url.path.split("/") |
There was a problem hiding this comment.
_file should be renamed to _ given it is never called, surprised ruff does not call it out.
| Parameters | ||
| ---------- | ||
| data_id : list, optional | ||
| dataset_id : list, optional |
| def _get_dataset_parquet( | ||
| description: dict | OpenMLDataset, | ||
| cache_directory: Path | None = None, | ||
| cache_directory: Path | None = None, # noqa: ARG001 |
There was a problem hiding this comment.
Why the ARG001 addition, I assume ruff checks were passing before this was added too?
Edit1: Just saw that this is not being used, why not remove it then?
Edit2: This is happening in more than 1 place, I won't mention it in all of them so we can just discuss it here.
There was a problem hiding this comment.
Still under work,This method is now updated to not use the param, working on removing it here and update the corresponding test when starting to write tests for this pr, as mentioned in the meet
satvshr
left a comment
There was a problem hiding this comment.
Left a few comments, will look at this again once it is actually ready for review with all implementations complete.
|
Is there an API_KEY that you are using to test the endpoints? I was going to run a test script for your code but I could not given I have no API_KEY for it. |
If you are talking about the invalid api_key regex match in v2 you can set this in the config |
Um why are we discussing API keys here. Even if its just for testing. |
geetu040
left a comment
There was a problem hiding this comment.
sdk code look good so far, please take a look at #1575 (comment) and make changes accordingly where needed.
all tests (existing and new) should pass to make sure we are retaining the original functionality of the sdk
| ) -> pd.DataFrame: ... | ||
|
|
||
| @abstractmethod | ||
| def delete(self, dataset_id: int) -> bool: ... |
There was a problem hiding this comment.
you can remove it from here as well
see point 5 in #1575 (comment)
|
|
||
| did_cache_dir = _create_cache_directory_for_id( | ||
| DATASETS_CACHE_DIR_NAME, | ||
| return api_context.backend.datasets.get( |
There was a problem hiding this comment.
to keep the functionality of force_refresh_cache, you can use reset_cache for http.get
see point 1 in #1575 (comment)
|
@JATAYU000 The recent changes in the base branch fixes the test fail in the windows test job. Please update the base branch. |
| original_data_url: str | None = None, | ||
| paper_url: str | None = None, | ||
| ) -> int: | ||
| raise NotImplementedError(self._not_supported(method="edit")) |
There was a problem hiding this comment.
You can just use self._not_supported(method="edit")
No need for raise NotImplementedError()
| from filelock import FileLock | ||
|
|
|
|
||
| features = _parse_features_xml(features_xml_string) | ||
|
|
||
| except FileNotFoundError: |
|
|
||
| def get_minio_download_path(self, url: str) -> str: | ||
| parsed_url = urlparse(url) | ||
| return os.path.join(self.get_cache_directory(), "minio", parsed_url.path.lstrip("/")) # noqa: PTH118 |
| if response.content.startswith(b"PK\x03\x04"): | ||
| return "body.zip" | ||
|
|
||
| try: | ||
| arff.loads(response.text) | ||
| return "body.arff" | ||
| except arff.ArffException: | ||
| pass |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 24 changed files in this pull request and generated 6 comments.
Comments suppressed due to low confidence (1)
tests/test_datasets/test_dataset.py:16
pytestis imported twice in this module; the second import is redundant and should be removed.
from openml.datasets import OpenMLDataFeature, OpenMLDataset
from openml.testing import TestBase
import pytest
| import pandas as pd | ||
| import scipy.sparse | ||
| import xmltodict | ||
| from filelock import FileLock |
| if need_to_create_pickle or need_to_create_feather: | ||
| if self.data_file is None: | ||
| self._download_data() | ||
| cache_file = self.data_pickle_file if need_to_create_pickle else self.data_feather_file | ||
| lock_path = str(cache_file) + ".lock" | ||
| with FileLock(lock_path): | ||
| if self.data_file is None: |
| try: | ||
| with features_pickle_file.open("rb") as fh_binary: | ||
| return pickle.load(fh_binary) # type: ignore # noqa: S301 | ||
|
|
||
| except: # noqa: E722 | ||
| with Path(features_file).open("r", encoding="utf8") as fh: | ||
| features_xml_string = fh.read() | ||
|
|
||
| features = _parse_features_xml(features_xml_string) | ||
|
|
||
| except FileNotFoundError: | ||
| features = openml._backend.dataset.parse_features_file(features_file, features_pickle_file) | ||
| with features_pickle_file.open("wb") as fh_binary: | ||
| pickle.dump(features, fh_binary) | ||
|
|
||
| return features |
| try: | ||
| arff.loads(response.text) | ||
| return "body.arff" | ||
| except arff.ArffException: | ||
| pass |
There was a problem hiding this comment.
this file has been removed but not replaced by another file, why is that?
There was a problem hiding this comment.
This file is not used by any tests at the moment, There are some minio tests that needs to be updated that might require this file to be moved. will keep this in mind
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 24 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (2)
openml/_api/resources/dataset.py:924
- This has the same no-qualities regression as the V1 implementation:
download_qualities_file()raises when the server reports no qualities, even thoughget_qualities()handles that case by returningNone. Handle the no-qualities response here or inside the download helper soget(..., download_qualities=True)does not fail for datasets without qualities.
qualities_file = self.download_qualities_file(dataset_id)
openml/_api/resources/dataset.py:916
force_refresh_cacheis only applied to the dataset JSON request. Feature, quality, and ARFF downloads later in this method still read from existing cached responses, so forcing a refresh can still return stale metadata or ARFF data. Propagate the refresh flag to those helpers or invalidate the related cached files before downloading them.
response = self._http.get(path, enable_cache=True, refresh_cache=force_refresh_cache)
| if download_features_meta_data: | ||
| features_file = self.download_features_file(dataset_id) | ||
| if download_qualities: | ||
| qualities_file = self.download_qualities_file(dataset_id) |
| elif isinstance(description, OpenMLDataset): | ||
| assert description.url is not None | ||
| assert description.dataset_id is not None | ||
|
|
||
| url = description.url | ||
| did = int(description.dataset_id) | ||
| else: | ||
| raise TypeError("`description` should be either OpenMLDataset or Dict.") | ||
|
|
||
| try: | ||
| # save the file in cache and get it's path | ||
| self._http.get(url, enable_cache=True) |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 24 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
openml/_api/resources/dataset.py:930
- The V2 implementation has the same cache-refresh gap: only the dataset JSON is refreshed, while feature, quality, parquet, and ARFF downloads below can still be served from stale cache entries. Please pass the refresh flag through to those download paths or invalidate the related cache entries when
force_refresh_cache=True.
response = self._http.get(path, enable_cache=True, refresh_cache=force_refresh_cache)
| elif isinstance(description, OpenMLDataset): | ||
| assert description.url is not None | ||
| assert description.dataset_id is not None | ||
|
|
||
| url = description.url | ||
| did = int(description.dataset_id) | ||
| else: | ||
| raise TypeError("`description` should be either OpenMLDataset or Dict.") | ||
|
|
||
| try: | ||
| # save the file in cache and get it's path | ||
| self._http.get(url, enable_cache=True) |
| self.download_minio_file( | ||
| source=source.rsplit("/", 1)[0] | ||
| + "/" | ||
| + file_object.object_name.rsplit("/", 1)[1], |
Metadata
Fixes [ENH] V1 → V2 API Migration - datasets #1592
Depends on: [ENH] V1 → V2 API Migration - core structure #1576
Change Log Entry:This PR implements Datasets resource, and refactor its existing functions