Uploading datasets with string columns to openml via api fails #1123

Louquinze · 2021-11-17T14:45:58Z

More informative Code to Reproduce

import requests
import pandas as pd
import openml
from openml.datasets.functions import create_dataset

# upload to test server
openml.config.start_using_configuration_for_example()

url = 'https://zenodo.org/record/3665663/files/dataset.csv?download=1'
r = requests.get(url)
with open('cybertroll.csv', 'wb') as f:
    f.write(r.content)
df = pd.read_csv("cybertroll.csv")
# uncomment the next line fixes the problem (backslashes at the end of lines are deleted)
# df['content'] = df['content'].str.replace(r'\W', '')

cybertroll_dataset = create_dataset(
    name="Cybertroll",
    description="Tweets classified as aggressive or not to help fight trolls.",
    creator="Saima Sadiq",
    contributor=None,
    collection_date="11-17-2021",
    language="English",
    licence="Creative Commons Attribution 1.0 Generic",
    default_target_attribute="annotation",
    row_id_attribute=None,
    ignore_attribute=None,
    citation="dummy citation",
    attributes="auto",
    data=df,
    version_label="test",
    original_data_url="https://zenodo.org/record/3665663",
    paper_url="https://zenodo.org/record/3665663",
)

cybertroll_dataset.publish()
print(f"URL for dataset: {cybertroll_dataset.openml_url}")

Steps/Code to Reproduce

import pandas as pd

import openml
from openml.datasets.functions import create_dataset


openml.config.start_using_configuration_for_example()

# the error occurs only if the double backslash is at the end of the string
# uncomment line 12 and delete line 13, the upload is successfull

# df = pd.DataFrame({"X1": [1], "X2": [r"\\test"], "y": [1]}).astype({"X2": "string"})
df = pd.DataFrame({"X1": [1], "X2": [r"test\\"], "y": [1]}).astype({"X2": "string"})
dummy_dataset = create_dataset(
    name="DummyDataset",
    description="dummy dataset",
    creator="Lukas Strack",
    contributor=None,
    collection_date="11-17-2021",
    language="English",
    licence=None,
    default_target_attribute="y",
    row_id_attribute=None,
    ignore_attribute=None,
    citation="dummy citation",
    attributes="auto",
    data=df,
    version_label="test",
    original_data_url=None,
    paper_url=None,
)

dummy_dataset.publish()
print(f"URL for dataset: {dummy_dataset.openml_url}")

Expected Results

python print(f"URL for dataset: {dummy_dataset.openml_url}")
should output something like
python URL for dataset: https://test.openml.org/d/4005

Actual Results

/home/lukas/anaconda3/envs/Hackathon/lib/python3.9/site-packages/openml/config.py:177: UserWarning: Switching to the test server https://test.openml.org/api/v1/xml to not upload results to the live server. Using the test server may result in reduced performance of the API!
  warnings.warn(
Traceback (most recent call last):
  File "*working_directory*/upload_arff_error.py", line 33, in <module>
    dummy_dataset.publish()
  File "*conda_env_path*/lib/python3.9/site-packages/openml/base.py", line 130, in publish
    response_text = openml._api_calls._perform_api_call(
  File "*conda_env_path*/lib/python3.9/site-packages/openml/_api_calls.py", line 65, in _perform_api_call
    response = _read_url_files(url, data=data, file_elements=file_elements)
  File "*conda_env_path*/lib/python3.9/site-packages/openml/_api_calls.py", line 197, in _read_url_files
    response = _send_request(request_method="post", url=url, data=data, files=file_elements,)
  File "*conda_env_path*/lib/python3.9/site-packages/openml/_api_calls.py", line 248, in _send_request
    __check_response(response=response, url=url, file_elements=files)
  File "*conda_env_path*/lib/python3.9/site-packages/openml/_api_calls.py", line 295, in __check_response
    raise __parse_server_exception(response, url, file_elements=file_elements)
openml.exceptions.OpenMLServerException: https://test.openml.org/api/v1/xml/data/ returned code 145: Error parsing dataset ARFF file - Arff error in dataset file: missing trailing quote in string (l.9)

The text was updated successfully, but these errors were encountered:

joaquinvanschoren · 2021-11-17T19:40:51Z

Thanks for reporting! I'll transfer this to the issue tracker of the python API.

mfeurer · 2021-11-18T09:30:16Z

Hi @joaquinvanschoren this is 99.9% not a Python issue as this is an error message emitted by the server. The arff file produced and uploaded can be opened in WEKA without any issues so we assume that this is the PHP upload checker (not the one in the example as this is a minimal working example). I'll discuss with @Louquinze how to produce a more elaborate example that can be used to get a full-blown arff file to also get an arff file to be loaded in WEKA.

Louquinze · 2021-11-19T08:51:11Z

I edited the issues like Matthias stated previously.

joaquinvanschoren · 2022-03-01T16:41:28Z

Might be a fault in the PHP ARFF checker. Double backslash is changed which makes the test fail, perhaps?

mfeurer · 2022-06-07T15:16:47Z

We will re-evaluate this once Parquet-upload is available.

joaquinvanschoren transferred this issue from openml/OpenML Nov 17, 2021

PGijsbers added the bug label Nov 17, 2021

mfeurer mentioned this issue Nov 24, 2021

Add support for the OpenML test server openml/automlbenchmark#423

Merged

mfeurer added the Data OpenML concept label Feb 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uploading datasets with string columns to openml via api fails #1123

Uploading datasets with string columns to openml via api fails #1123

Louquinze commented Nov 17, 2021 •

edited

Loading

joaquinvanschoren commented Nov 17, 2021

mfeurer commented Nov 18, 2021

Louquinze commented Nov 19, 2021

joaquinvanschoren commented Mar 1, 2022

mfeurer commented Jun 7, 2022

Uploading datasets with string columns to openml via api fails #1123

Uploading datasets with string columns to openml via api fails #1123

Comments

Louquinze commented Nov 17, 2021 • edited Loading

More informative Code to Reproduce

Steps/Code to Reproduce

Expected Results

Actual Results

joaquinvanschoren commented Nov 17, 2021

mfeurer commented Nov 18, 2021

Louquinze commented Nov 19, 2021

joaquinvanschoren commented Mar 1, 2022

mfeurer commented Jun 7, 2022

Louquinze commented Nov 17, 2021 •

edited

Loading