Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset.Tabular is NOT loading the specified file on storage #21419

Closed
afogarty85 opened this issue Oct 26, 2021 · 11 comments
Closed

Dataset.Tabular is NOT loading the specified file on storage #21419

afogarty85 opened this issue Oct 26, 2021 · 11 comments
Assignees
Labels
customer-reported Issues that are reported by GitHub users external to the Azure organization. Machine Learning ML-CoreUI AreaPath needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team.

Comments

@afogarty85
Copy link

  • Package Name: azureml.core.
  • Package Version: 1.34.0
  • Operating System: W10
  • Python Version: 3.6.9

Describe the bug
When loading a Tabular file, it is not reading the file that is there.

To Reproduce
Steps to reproduce the behavior:

Following this guide on how to use these files:
https://github.com/Azure/MachineLearningNotebooks/blob/122df6e84622136690801685b183af5a04d77dec/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-showcasing-dataset-and-pipelineparameter.ipynb

# build data set configurations
stack_rank = Dataset.Tabular.from_delimited_files([(ws.datastores['fs'], '/RAW/Daily/stack_rank_daily.csv')])
stack_rank_param = PipelineParameter(name="stack_rank_param", default_value=stack_rank)
stack_rank_ds_consumption = DatasetConsumptionConfig("stack_rank_dataset", stack_rank_param)

# register it to see its location
stack_rank = stack_rank.register(workspace = ws,
                                 name = 'stack_rank',
                                 description = 'stack_rank data',
                                 create_new_version = True) 

# here is its registration
"registration": {
    "id": "721b9763-b2e5-4524-a620-de3df1ed4403",
    "name": "stack_rank",
   etc

# examine the registration
dataset = Dataset.get_by_id(ws, '721b9763-b2e5-4524-a620-de3df1ed4403')
dataset.to_pandas_dataframe()
# its shape is (1406, 142)
# it SHOULD be (320, 142)

Expected behavior
I expected the file on storage to load. If I delete the file, AML rightly says that the file has disappeared. If I upload the right file, shape (320, 142), AML will continue to load the one shaped (1406, 142).

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Oct 26, 2021
@rakshith91
Copy link
Contributor

Thank you for reporting the issue. Someone from the team will take a look asap

@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Oct 27, 2021
@rakshith91 rakshith91 added the ML-CoreUI AreaPath label Oct 27, 2021
@ghost ghost added the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Oct 27, 2021
@SaurabhSharma-MSFT SaurabhSharma-MSFT self-assigned this Oct 28, 2021
@SaurabhSharma-MSFT SaurabhSharma-MSFT added Service Attention Workflow: This issue is responsible by Azure service team. and removed CXP Attention labels Nov 4, 2021
@ghost
Copy link

ghost commented Nov 4, 2021

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github.

Issue Details
  • Package Name: azureml.core.
  • Package Version: 1.34.0
  • Operating System: W10
  • Python Version: 3.6.9

Describe the bug
When loading a Tabular file, it is not reading the file that is there.

To Reproduce
Steps to reproduce the behavior:

Following this guide on how to use these files:
https://github.com/Azure/MachineLearningNotebooks/blob/122df6e84622136690801685b183af5a04d77dec/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-showcasing-dataset-and-pipelineparameter.ipynb

# build data set configurations
stack_rank = Dataset.Tabular.from_delimited_files([(ws.datastores['fs'], '/RAW/Daily/stack_rank_daily.csv')])
stack_rank_param = PipelineParameter(name="stack_rank_param", default_value=stack_rank)
stack_rank_ds_consumption = DatasetConsumptionConfig("stack_rank_dataset", stack_rank_param)

# register it to see its location
stack_rank = stack_rank.register(workspace = ws,
                                 name = 'stack_rank',
                                 description = 'stack_rank data',
                                 create_new_version = True) 

# here is its registration
"registration": {
    "id": "721b9763-b2e5-4524-a620-de3df1ed4403",
    "name": "stack_rank",
   etc

# examine the registration
dataset = Dataset.get_by_id(ws, '721b9763-b2e5-4524-a620-de3df1ed4403')
dataset.to_pandas_dataframe()
# its shape is (1406, 142)
# it SHOULD be (320, 142)

Expected behavior
I expected the file on storage to load. If I delete the file, AML rightly says that the file has disappeared. If I upload the right file, shape (320, 142), AML will continue to load the one shaped (1406, 142).

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

Author: afogarty85
Assignees: SaurabhSharma-MSFT
Labels:

question, Machine Learning, Service Attention, customer-reported, needs-team-attention, ML-CoreUI

Milestone: -

@SaurabhSharma-MSFT SaurabhSharma-MSFT removed their assignment Nov 4, 2021
@ynpandey
Copy link

ynpandey commented Nov 4, 2021

@afogarty85 Does the csv file that you are using contains multiline values? From our documentation:
By default (support_multi_line=False), all line breaks, including those in quoted field values, will be interpreted as a record break. Reading data this way is faster and more optimized for parallel execution on multiple CPU cores. However, it may result in silently producing more records with misaligned field values. This should be set to True when the delimited files are known to contain quoted line breaks.

@afogarty85
Copy link
Author

Thanks for the help!

By doing something like:

stack_rank = Dataset.Tabular.from_delimited_files([(ws.datastores['fs'], '/RAW/Daily/stack_rank_daily.csv')], support_multi_line=True)

I get another error: "Cannot load any data from the specified path. Make sure the path is accessible and contains data.\nThe Dataflow produced no records.

If I set:

stack_rank = Dataset.Tabular.from_delimited_files([(ws.datastores['fs'], '/RAW/Daily/stack_rank_daily.csv')], support_multi_line=False)

The data populates, but again it is inflated.

@ynpandey
Copy link

ynpandey commented Nov 4, 2021

@afogarty85 Is there any way for you to share the data file with us if it does not contain any sensitive information? This may help us in reproducing the problem at our end.

@afogarty85
Copy link
Author

Unfortunately no, could you perhaps generate dummy data in a pandas dataframe that would highlight this: (?)

all line breaks, including those in quoted field values, will be interpreted as a record break

From there, investigations could be done.

@ynpandey
Copy link

ynpandey commented Nov 4, 2021

@afogarty85 Take this multiline.csv file as an example.

from azureml.core import Workspace, Dataset, Datastore
ws = Workspace.from_config()
dstore = Datastore.get_default(ws)
path = [(dstore, 'data/multiline.csv')]
dset1 = Dataset.Tabular.from_delimited_files(path) # By default support_multi_line=False
df1 = dset1.to_pandas_dataframe()
df1.shape

Output:

(6, 3)

Now if we execute the following code:

dset2 = Dataset.Tabular.from_delimited_files(path, support_multi_line=True)
df2 = dset2.to_pandas_dataframe()
df2.shape

Output:

(2, 3)

As you can see that the shape of dataframe without multiline support is (6, 3), with multiline support is (2, 3). Setting support_multi_line=True parses the file correctly and gives a dataframe of shape (2, 3).

@afogarty85
Copy link
Author

afogarty85 commented Nov 8, 2021

Thanks for the input -- this definitely appears to be the problem.

Are you aware of anything I can do to speed up training and loading of files?

I am in a situation where I need to specify support_multi_line=True, otherwise the shapes are messed up. The consequence of this (support_multi_line=True), is that it takes AML approximately 5 minutes to load a dataframe shaped: (31733, 58)

support_multi_line=False returns my dataframe in seconds, just with 70k observations instead of what it should be.

@ynpandey
Copy link

ynpandey commented Nov 8, 2021

@afogarty85 I am happy that you were able to read the data in correct format.

Regarding the time difference that you are seeing, processing tabular files with multi-line data is slower because data has to be read line-by-line and multiple CPU cores cannot be used to ingest the data in parallel. This is the reason behind slower processing when we set support_multi_line=True.

@afogarty85
Copy link
Author

afogarty85 commented Nov 8, 2021

Thanks @ynpandey !

This makes sense for sure, but this almost certainly has to be an issue. I cannot imagine a scenario where Xeon processors, using a single core, take 5 minutes to load 30k rows at 50 columns. It should take seconds -- its a ~30 mb file.

@luigiw
Copy link
Contributor

luigiw commented Oct 21, 2022

Closing legacy issue.

@ynpandey can you help to share a link to the new v2 mltable package?

@luigiw luigiw closed this as completed Oct 21, 2022
azure-sdk pushed a commit to azure-sdk/azure-sdk-for-python that referenced this issue Feb 1, 2023
Machinelearningservices microsoft.machine learning services 2022 12 01 preview (Azure#21761)

* Adds base for updating Microsoft.MachineLearningServices from version preview/2022-10-01-preview to version 2022-12-01-preview

* Updates readme

* Updates API version in new specs and examples

* Add Dec API Registries Swagger (Azure#21419)

* add december registries swagger + examples

* add status code 202 in examples

* fix 202 examples

* fixes

* fixes

* fix

* add 202 back in for put/patch

Co-authored-by: Komal Yadav <[email protected]>

* remove location (Azure#21430)

Co-authored-by: Komal Yadav <[email protected]>

* remove readonly flag on schedules property for CI (Azure#21653)

Co-authored-by: Naman Agarwal <[email protected]>

* add missing workspace properties (Azure#21725)

* December preview updating mfe.json specs (Azure#21510)

* December preview updating mfe.json specs

* MFE Dec 2022 Preview API - Adding logbase

* MFE 2022-12-01-preview swagger spec model validation fix

* MFE 2022-12-01-preview swagger spec model validation fix, add missing location

* MFE 2022-12-01-preview swagger spec model validation - typo fix

* MFE 2022-12-01-preview swagger spec model validation - fix api version in automljob example

* MFE 2022-12-01-preview swagger spec model validation - fix for multiselectenabled error

* MFE 2022-12-01-preview swagger spec model validation - fix for multiselectenabled error

* Fix  for 1006 - RemovedDefinition (RecurrenceTrigger,CronTrigger) (Azure#21822)

* fix ReadonlyPropertyChanged of MLC (Azure#21814)

Co-authored-by: Bingchen Li <[email protected]>

* fixed custom-words conflict (Azure#21829)

* fix custom-words conflict merge (Azure#21830)

* example fix (INVALID_REQUEST_PARAMETER) (Azure#21832)

Co-authored-by: Ivaliy Ivanov <[email protected]>

* example fix, use correct api preview version  - (INVALID_REQUEST_PARAMETER) (Azure#21833)

Co-authored-by: Ivaliy Ivanov <[email protected]>

* Revert breaking change for MLC swagger 2022-12-01-preview (Azure#21885)

Co-authored-by: Bingchen Li <[email protected]>

* Revert Connection Category back to enum. (Azure#21939)

* revert provisioning state change (Azure#21940)

* remove body (Azure#21978)

Co-authored-by: Komal Yadav <[email protected]>

* Addressed comments, added x-ms-long-running-operation to a patch call (Azure#22005)

* Addressed comments, added x-ms-long-running-operation to a patch call

* fix examples for patch - remove body

* fixed formatting

* Ivalbert fix patch2 (Azure#22006)

* Addressed comments, added x-ms-long-running-operation to a patch call

* fix examples for patch - remove body

* fixed formatting

* fixed formatting

* Updated custom words (Azure#22262)

* Fixed prettier errors (Azure#22237)

* fixed examples for LRO_RESPONSE_HEADER check (Azure#22293)

* fixed examples for LRO_RESPONSE_HEADER check (Azure#22294)

* Example fix - OBJECT_MISSING_REQUIRED_PROPERTY - Missing required property: triggerType (Azure#22317)

---------

Co-authored-by: Komal Yadav <[email protected]>
Co-authored-by: Komal Yadav <[email protected]>
Co-authored-by: Naman Agarwal <[email protected]>
Co-authored-by: Naman Agarwal <[email protected]>
Co-authored-by: ZhidaLiu <[email protected]>
Co-authored-by: libc16 <[email protected]>
Co-authored-by: Bingchen Li <[email protected]>
Co-authored-by: Ivaliy Ivanov <[email protected]>
@github-actions github-actions bot locked and limited conversation to collaborators Apr 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
customer-reported Issues that are reported by GitHub users external to the Azure organization. Machine Learning ML-CoreUI AreaPath needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team.
Projects
None yet
Development

No branches or pull requests

6 participants