[ENH] V1 → V2 API Migration - datasets by JATAYU000 · Pull Request #1608 · openml/openml-python

JATAYU000 · 2026-01-08T10:30:37Z

Metadata

Fixes [ENH] V1 → V2 API Migration - datasets #1592
Depends on: [ENH] V1 → V2 API Migration - core structure #1576
Change Log Entry:This PR implements Datasets resource, and refactor its existing functions

codecov-commenter · 2026-01-08T10:36:04Z

Codecov Report

❌ Patch coverage is 78.12018% with 142 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.85%. Comparing base (c62bf51) to head (a1c75f2).

Files with missing lines	Patch %	Lines
openml/_api/resources/dataset.py	78.89%	103 Missing ⚠️
openml/_api/clients/minio.py	53.33%	28 Missing ⚠️
openml/datasets/dataset.py	86.00%	7 Missing ⚠️
openml/_api/clients/http.py	87.50%	3 Missing ⚠️
openml/datasets/functions.py	94.44%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1608       +/-   ##
===========================================
+ Coverage   54.64%   80.85%   +26.21%     
===========================================
  Files          63       63               
  Lines        5124     5495      +371     
===========================================
+ Hits         2800     4443     +1643     
+ Misses       2324     1052     -1272

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

JATAYU000 · 2026-01-09T05:08:45Z

FYI @geetu040 Currently the get_dataset() function has 3 download requirement

download_data : uses api_calls._download_minio_bucket() to download all the files in the bucket if download_all_files param was True and api_calls._download_minio_file() to download the dataset.pq file if it was not found in cache. When download parquet fails it fallback to download dataset.arff file with get request
download_features : if feature_file is passed via init it extracts during initialization else does get request and caches the xml
download_qualities : if qulities_file is passed via init it extracts during initialization else does get request and caches the xml

Issues:

The data files .pq and .arff are common for versions and doesn't make sense to be downloaded multiple times
Path handling for download to return the path especially the data files, As mentioned in the meet I can try the Download specific class which uses the cache mixin and only inherited by dataset resource.
Current implementation in OpenMLDataset has v1 specific parsing which in my opinion should be using the current interface (api_context)

Example:

current load_features() ref link
This calls a function which downloads and returns a file path and then parse from the file path
This can be changed by changing that function's definition ref link to get -> parse -> return features instead of file paths

def _get_dataset_features_file(did_cache_dir: str | Path | None, dataset_id: int) -> dict[int, OpenMLDataFeature]:
        return _features

Or by updating the Dataset class to use the underlining interface method from api_context directly.

def _load_features(self) -> None:
       ...
        self._features = api_context.backend.datasets.get_features(self.dataset_id)

Another option is to add return_path to client requests, which in my opinion would be wasteful since adding a param to all the methods of client for just the dataset resource, and that too which could be handled without it as mentioned above.

geetu040

Left an intermediate review. This is solid work and well done overall. Nice job. I'll look into the download part now.

geetu040 · 2026-01-13T17:43:53Z

+        bool
+            True if the deletion was successful. False otherwise.
+        """
+        return openml.utils._delete_entity("data", dataset_id)


if you implement the delete logic yourself instead of openml.utils._delete_entity, how would that look? I think it would be better.

Makes Sense , It would look like a delete request from client along with exception handling

geetu040 · 2026-01-13T17:43:57Z

+    def list(
+        self,
+        limit: int,
+        offset: int,
+        **kwargs: Any,
+    ) -> pd.DataFrame:


same as above, it can use private helper methods

geetu040 · 2026-01-13T17:44:00Z

-    # Minimalistic check if the XML is useful
-    if "oml:data_qualities_list" not in qualities:
-        raise ValueError('Error in return XML, does not contain "oml:data_qualities_list"')
+    from openml._api import api_context


can't we have this import at the very top? does it create circular import error? if not, should be moved to top from all functions.

It does raise circular import

geetu040 · 2026-01-14T09:16:26Z

FYI @geetu040 Currently the get_dataset() function has 3 download requirement

Thanks for a detailed explanation, I now have good understanding of the download mechanism.

download_data : uses api_calls._download_minio_bucket() to download all the files in the bucket if download_all_files param was True and api_calls._download_minio_file() to download the dataset.pq file if it was not found in cache. When download parquet fails it fallback to download dataset.arff file with get request

minio can be handled easily, we will use a separate client along with HTTPClient or implement it's methods in the HTTPClient, which work independently of the api version

download_features : if feature_file is passed via init it extracts during initialization else does get request and caches the xml

download_qualities : if qulities_file is passed via init it extracts during initialization else does get request and caches the xml

these are actually different objects in both apis, v1 uses xml and v2 keeps them in json

The data files .pq and .arff are common for versions and doesn't make sense to be downloaded multiple times

yes you are right, they are the same files, which are not required to be downloaded again for both versions, but isn't this true for almost all the http objects? they may have different format xml or json, slightly different structure, but if parsed most are identical, so shouldn't this rule be applied to all the responses?

Path handling for download to return the path especially the data files, As mentioned in the meet I can try the Download specific class which uses the cache mixin and only inherited by dataset resource.

I don't understand this point

Current implementation in OpenMLDataset has v1 specific parsing which in my opinion should be using the current interface (api_context)

agreed, should be handled by api_context

Another option is to add return_path to client requests, which in my opinion would be wasteful since adding a param to all the methods of client for just the dataset resource, and that too which could be handled without it as mentioned above.

agreed, adding return_path for just one specific method of one resource is not preffered

in conclusion, I may ask, if we ignore the fact that it downloads the .arff files for both versions separately, does everything else works out smooth without any blocker? I think ignoring this part is not really bad because conceptually this rule could be applied to almost every other response object

JATAYU000 · 2026-01-15T05:20:18Z

@geetu040 making a new client for minio handles just the parquet file, we would still need to migrate download_text_file() for the arff file (this is also used by tasks and runs)
So maybe we can have a DownloadClient which can contain all of these along with a save method which can save content to a specified path and hence also fixes our issue with features/qualities path ?

FYI the new commit adds better handling of feature and qualites in OpenMLDataset class moving the v1 specific parsing logic to the interface. So only part left is to handle

return path of saved file (feature, qualities, arff, pq)
downloader for arff or implementation of download_text_file() which is used for arff download
minio file and bucket download for the pq file

geetu040 · 2026-01-15T16:24:02Z

From the standup discussion and earlier conversations, I think we can agree on a few points:

Have a separate client for MinIO interactions alongside HTTPClient. In future if we plan to add more providers like dropbox, google drive, e.t.c, we don't end up with too many changes, rather have their client implemented as a separate class and just pass that down to relevant resource.
DownloadClient doesn't feel like the right abstraction; instead, implement download-specific methods directly in HTTPClient.

Consider this a green light to experiment with the client design. Try an approach, use whatever caching strategy you think fits best, and aim for a clean, sensible design. Feel free to ask for suggestions or reviews along the way. I'll review it in code. Just make sure this doesn't significantly impact the base classes or other stacked PRs.

JATAYU000 · 2026-01-16T04:35:27Z

The points do make sense to me, I will propose the design along with how it would be used in the resource.

JATAYU000 · 2026-01-19T06:08:02Z

@geetu040 I have a design implemented which needs reviews

MinIOClient similar to HTTPClient is being used by DatasetAPI from self._minio , It implements 2 methods download file and download bucket, it uses _get_cache_dir() for the destination
New method download implemented under HTTPclient that can be used for features,qualities and arff files, along with specific v1/v2 interface using handler callback.

Question:

most methods signature include cache_directory how should that be handled? if the directory is passed use that if not use our cache dir? i am not sure how this would effect the old users
Also the caching implemented currently suggest the Response() is cached but I remember from a meeting you mentioned the respective files (.xml .json) are cached, I am not sure about it , I have went through the design as if the caching is done on the response.

geetu040 · 2026-01-19T07:40:19Z

@geetu040 I have a design implemented which needs reviews

I have taken a quick look, the design looks really good, though I have some suggestions/questions in the code, which I will leave in a detailed review. But this in general fixes all our blockers without damaging the original design.

most methods signature include cache_directory how should that be handled? if the directory is passed use that if not use our cache dir? i am not sure how this would effect the old users

Is it provided by the user? I don't think so. In that case, how does it affect the users? From looking at the code, this cache directory is generated programmatically inside the functions, we can completely remove these lines and always rely on the CacheMixin class. How does that sound?

Also the caching implemented currently suggest the Response() is cached but I remember from a meeting you mentioned the respective files (.xml .json) are cached, I am not sure about it , I have went through the design as if the caching is done on the response.

CacheMixin._set_cache_response will look at the response object and extract json or xml content from it and save it respectively in .json and .xml files.
CacheMixin._get_cache_response will read these files (.json or .xml) from the given path and create a dummy Response object then fill it with status_code and content. Therefore a Response object will be returned.

JATAYU000 · 2026-01-19T08:01:14Z

CacheMixin._set_cache_response will look at the response object and extract json or xml content from it and save it respectively in .json and .xml files.
CacheMixin._get_cache_response will read these files (.json or .xml) from the given path and create a dummy Response object then fill it with status_code and content. Therefore a Response object will be returned.

This makes a sense now. having an independent download method as I have setup is better than updating requests/caching to return path right?

we can completely remove these lines and always rely on the CacheMixin class. How does that sound?

Yes that would work, but the function definition would be changed i.e. tests etc corresponding to them

geetu040 · 2026-01-19T08:12:51Z

This makes a sense now. having an independent download method as I have setup is better than updating requests/caching to return path right?

I am not sure about that, would require a detailed review

Yes that would work, but the function definition would be changed i.e. tests etc corresponding to them

yes, that is expected

satvshr · 2026-01-20T14:45:48Z

+        parsed_url = urllib.parse.urlparse(source)
+
+        # expect path format: /BUCKET/path/to/file.ext
+        _, bucket, *prefixes, _file = parsed_url.path.split("/")


_file should be renamed to _ given it is never called, surprised ruff does not call it out.

satvshr · 2026-01-20T14:49:10Z

    Parameters
    ----------
-    data_id : list, optional
+    dataset_id : list, optional


Should be data_id

satvshr · 2026-01-20T14:54:04Z

 def _get_dataset_parquet(
    description: dict | OpenMLDataset,
-    cache_directory: Path | None = None,
+    cache_directory: Path | None = None,  # noqa: ARG001


Why the ARG001 addition, I assume ruff checks were passing before this was added too?

Edit1: Just saw that this is not being used, why not remove it then?

Edit2: This is happening in more than 1 place, I won't mention it in all of them so we can just discuss it here.

Still under work,This method is now updated to not use the param, working on removing it here and update the corresponding test when starting to write tests for this pr, as mentioned in the meet

satvshr

Left a few comments, will look at this again once it is actually ready for review with all implementations complete.

satvshr · 2026-01-20T16:04:35Z

Is there an API_KEY that you are using to test the endpoints? I was going to run a test script for your code but I could not given I have no API_KEY for it.

JATAYU000 · 2026-01-20T16:09:35Z

Is there an API_KEY that you are using to test the endpoints? I was going to run a test script for your code but I could not given I have no API_KEY for it.

If you are talking about the invalid api_key regex match in v2 you can set this in the config

key="AD000000000000000000000000000000"

SubhadityaMukherjee · 2026-01-27T15:40:07Z

Is there an API_KEY that you are using to test the endpoints? I was going to run a test script for your code but I could not given I have no API_KEY for it.

If you are talking about the invalid api_key regex match in v2 you can set this in the config
key="AD000000000000000000000000000000"

Um why are we discussing API keys here. Even if its just for testing.

geetu040

sdk code look good so far, please take a look at #1575 (comment) and make changes accordingly where needed.
all tests (existing and new) should pass to make sure we are retaining the original functionality of the sdk

geetu040 · 2026-01-30T07:25:28Z

+    ) -> pd.DataFrame: ...
+
+    @abstractmethod
+    def delete(self, dataset_id: int) -> bool: ...


you can remove it from here as well
see point 5 in #1575 (comment)

geetu040 · 2026-01-30T07:25:31Z


-    did_cache_dir = _create_cache_directory_for_id(
-        DATASETS_CACHE_DIR_NAME,
+    return api_context.backend.datasets.get(


to keep the functionality of force_refresh_cache, you can use reset_cache for http.get
see point 1 in #1575 (comment)

SimonBlanke · 2026-02-01T19:50:02Z

@JATAYU000 The recent changes in the base branch fixes the test fail in the windows test job. Please update the base branch.

EmanAbdelhaleem · 2026-02-04T17:50:38Z

+        original_data_url: str | None = None,
+        paper_url: str | None = None,
+    ) -> int:
+        raise NotImplementedError(self._not_supported(method="edit"))


You can just use self._not_supported(method="edit")
No need for raise NotImplementedError()

Copilot

Pull request overview

Copilot reviewed 15 out of 24 changed files in this pull request and generated 4 comments.

+from filelock import FileLock



-
-        features = _parse_features_xml(features_xml_string)
-
+    except FileNotFoundError:



+    def get_minio_download_path(self, url: str) -> str:
+        parsed_url = urlparse(url)
+        return os.path.join(self.get_cache_directory(), "minio", parsed_url.path.lstrip("/"))  #  noqa: PTH118


+        if response.content.startswith(b"PK\x03\x04"):
+            return "body.zip"
+
+        try:
+            arff.loads(response.text)
+            return "body.arff"
+        except arff.ArffException:
+            pass


Copilot

Pull request overview

Copilot reviewed 15 out of 24 changed files in this pull request and generated 6 comments.

Comments suppressed due to low confidence (1)

tests/test_datasets/test_dataset.py:16

pytest is imported twice in this module; the second import is redundant and should be removed.

from openml.datasets import OpenMLDataFeature, OpenMLDataset
from openml.testing import TestBase

import pytest

 import pandas as pd
 import scipy.sparse
-import xmltodict
+from filelock import FileLock


        if need_to_create_pickle or need_to_create_feather:
-            if self.data_file is None:
-                self._download_data()
+            cache_file = self.data_pickle_file if need_to_create_pickle else self.data_feather_file
+            lock_path = str(cache_file) + ".lock"
+            with FileLock(lock_path):
+                if self.data_file is None:


    try:
        with features_pickle_file.open("rb") as fh_binary:
            return pickle.load(fh_binary)  # type: ignore  # noqa: S301

-    except:  # noqa: E722
-        with Path(features_file).open("r", encoding="utf8") as fh:
-            features_xml_string = fh.read()
-
-        features = _parse_features_xml(features_xml_string)
-
+    except FileNotFoundError:
+        features = openml._backend.dataset.parse_features_file(features_file, features_pickle_file)
        with features_pickle_file.open("wb") as fh_binary:
            pickle.dump(features, fh_binary)
-
        return features


+        try:
+            arff.loads(response.text)
+            return "body.arff"
+        except arff.ArffException:
+            pass


geetu040 · 2026-05-09T06:23:53Z

this file has been removed but not replaced by another file, why is that?

This file is not used by any tests at the moment, There are some minio tests that needs to be updated that might require this file to be moved. will keep this in mind

Copilot

Pull request overview

Copilot reviewed 15 out of 24 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (2)

openml/_api/resources/dataset.py:924

This has the same no-qualities regression as the V1 implementation: download_qualities_file() raises when the server reports no qualities, even though get_qualities() handles that case by returning None. Handle the no-qualities response here or inside the download helper so get(..., download_qualities=True) does not fail for datasets without qualities.

                qualities_file = self.download_qualities_file(dataset_id)

openml/_api/resources/dataset.py:916

force_refresh_cache is only applied to the dataset JSON request. Feature, quality, and ARFF downloads later in this method still read from existing cached responses, so forcing a refresh can still return stale metadata or ARFF data. Propagate the refresh flag to those helpers or invalidate the related cached files before downloading them.

            response = self._http.get(path, enable_cache=True, refresh_cache=force_refresh_cache)

+            if download_features_meta_data:
+                features_file = self.download_features_file(dataset_id)
+            if download_qualities:
+                qualities_file = self.download_qualities_file(dataset_id)


+        elif isinstance(description, OpenMLDataset):
+            assert description.url is not None
+            assert description.dataset_id is not None
+
+            url = description.url
+            did = int(description.dataset_id)
+        else:
+            raise TypeError("`description` should be either OpenMLDataset or Dict.")
+
+        try:
+            # save the file in cache and get it's path
+            self._http.get(url, enable_cache=True)


Copilot

Pull request overview

Copilot reviewed 15 out of 24 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

openml/_api/resources/dataset.py:930

The V2 implementation has the same cache-refresh gap: only the dataset JSON is refreshed, while feature, quality, parquet, and ARFF downloads below can still be served from stale cache entries. Please pass the refresh flag through to those download paths or invalidate the related cache entries when force_refresh_cache=True.

            response = self._http.get(path, enable_cache=True, refresh_cache=force_refresh_cache)

+        elif isinstance(description, OpenMLDataset):
+            assert description.url is not None
+            assert description.dataset_id is not None
+
+            url = description.url
+            did = int(description.dataset_id)
+        else:
+            raise TypeError("`description` should be either OpenMLDataset or Dict.")
+
+        try:
+            # save the file in cache and get it's path
+            self._http.get(url, enable_cache=True)


+                self.download_minio_file(
+                    source=source.rsplit("/", 1)[0]
+                    + "/"
+                    + file_object.object_name.rsplit("/", 1)[1],


Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

geetu040 mentioned this pull request Jan 9, 2026

[ENH] V1 → V2 API Migration #1575

Open

18 tasks

geetu040 suggested changes Jan 13, 2026

View reviewed changes

geetu040 mentioned this pull request Jan 16, 2026

[ENH] V1 → V2 API Migration - Tasks #1611

Open

geetu040 assigned JATAYU000 Jan 19, 2026