Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uploading datasets with string columns to openml via api fails #1123

Open
Louquinze opened this issue Nov 17, 2021 · 5 comments
Open

Uploading datasets with string columns to openml via api fails #1123

Louquinze opened this issue Nov 17, 2021 · 5 comments
Labels
bug Data OpenML concept

Comments

@Louquinze
Copy link

Louquinze commented Nov 17, 2021

More informative Code to Reproduce

import requests
import pandas as pd
import openml
from openml.datasets.functions import create_dataset

# upload to test server
openml.config.start_using_configuration_for_example()

url = 'https://zenodo.org/record/3665663/files/dataset.csv?download=1'
r = requests.get(url)
with open('cybertroll.csv', 'wb') as f:
    f.write(r.content)
df = pd.read_csv("cybertroll.csv")
# uncomment the next line fixes the problem (backslashes at the end of lines are deleted)
# df['content'] = df['content'].str.replace(r'\W', '')

cybertroll_dataset = create_dataset(
    name="Cybertroll",
    description="Tweets classified as aggressive or not to help fight trolls.",
    creator="Saima Sadiq",
    contributor=None,
    collection_date="11-17-2021",
    language="English",
    licence="Creative Commons Attribution 1.0 Generic",
    default_target_attribute="annotation",
    row_id_attribute=None,
    ignore_attribute=None,
    citation="dummy citation",
    attributes="auto",
    data=df,
    version_label="test",
    original_data_url="https://zenodo.org/record/3665663",
    paper_url="https://zenodo.org/record/3665663",
)

cybertroll_dataset.publish()
print(f"URL for dataset: {cybertroll_dataset.openml_url}")

Steps/Code to Reproduce

import pandas as pd

import openml
from openml.datasets.functions import create_dataset


openml.config.start_using_configuration_for_example()

# the error occurs only if the double backslash is at the end of the string
# uncomment line 12 and delete line 13, the upload is successfull

# df = pd.DataFrame({"X1": [1], "X2": [r"\\test"], "y": [1]}).astype({"X2": "string"})
df = pd.DataFrame({"X1": [1], "X2": [r"test\\"], "y": [1]}).astype({"X2": "string"})
dummy_dataset = create_dataset(
    name="DummyDataset",
    description="dummy dataset",
    creator="Lukas Strack",
    contributor=None,
    collection_date="11-17-2021",
    language="English",
    licence=None,
    default_target_attribute="y",
    row_id_attribute=None,
    ignore_attribute=None,
    citation="dummy citation",
    attributes="auto",
    data=df,
    version_label="test",
    original_data_url=None,
    paper_url=None,
)

dummy_dataset.publish()
print(f"URL for dataset: {dummy_dataset.openml_url}")

Expected Results

python print(f"URL for dataset: {dummy_dataset.openml_url}")
should output something like
python URL for dataset: https://test.openml.org/d/4005

Actual Results

/home/lukas/anaconda3/envs/Hackathon/lib/python3.9/site-packages/openml/config.py:177: UserWarning: Switching to the test server https://test.openml.org/api/v1/xml to not upload results to the live server. Using the test server may result in reduced performance of the API!
  warnings.warn(
Traceback (most recent call last):
  File "*working_directory*/upload_arff_error.py", line 33, in <module>
    dummy_dataset.publish()
  File "*conda_env_path*/lib/python3.9/site-packages/openml/base.py", line 130, in publish
    response_text = openml._api_calls._perform_api_call(
  File "*conda_env_path*/lib/python3.9/site-packages/openml/_api_calls.py", line 65, in _perform_api_call
    response = _read_url_files(url, data=data, file_elements=file_elements)
  File "*conda_env_path*/lib/python3.9/site-packages/openml/_api_calls.py", line 197, in _read_url_files
    response = _send_request(request_method="post", url=url, data=data, files=file_elements,)
  File "*conda_env_path*/lib/python3.9/site-packages/openml/_api_calls.py", line 248, in _send_request
    __check_response(response=response, url=url, file_elements=files)
  File "*conda_env_path*/lib/python3.9/site-packages/openml/_api_calls.py", line 295, in __check_response
    raise __parse_server_exception(response, url, file_elements=file_elements)
openml.exceptions.OpenMLServerException: https://test.openml.org/api/v1/xml/data/ returned code 145: Error parsing dataset ARFF file - Arff error in dataset file: missing trailing quote in string (l.9)
@joaquinvanschoren
Copy link
Contributor

Thanks for reporting! I'll transfer this to the issue tracker of the python API.

@joaquinvanschoren joaquinvanschoren transferred this issue from openml/OpenML Nov 17, 2021
@PGijsbers PGijsbers added the bug label Nov 17, 2021
@mfeurer
Copy link
Collaborator

mfeurer commented Nov 18, 2021

Hi @joaquinvanschoren this is 99.9% not a Python issue as this is an error message emitted by the server. The arff file produced and uploaded can be opened in WEKA without any issues so we assume that this is the PHP upload checker (not the one in the example as this is a minimal working example). I'll discuss with @Louquinze how to produce a more elaborate example that can be used to get a full-blown arff file to also get an arff file to be loaded in WEKA.

@Louquinze
Copy link
Author

I edited the issues like Matthias stated previously.

@joaquinvanschoren
Copy link
Contributor

Might be a fault in the PHP ARFF checker. Double backslash is changed which makes the test fail, perhaps?

@mfeurer
Copy link
Collaborator

mfeurer commented Jun 7, 2022

We will re-evaluate this once Parquet-upload is available.

@mfeurer mfeurer added the Data OpenML concept label Feb 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Data OpenML concept
Projects
None yet
Development

No branches or pull requests

4 participants