Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added decorator for retry_request and handle err 429 #1491

Merged
merged 6 commits into from
Jan 27, 2024

Conversation

a24lorie
Copy link
Contributor

@a24lorie a24lorie commented Jan 7, 2024

Added test for pyarrow integration using the DBFS implementation

Added test for pyarrow integration using the DBFS implementation
@a24lorie
Copy link
Contributor Author

a24lorie commented Jan 7, 2024

This PR is a quick fix for handling HTTP error 429 with DBFS filesystem implementation by adding a decorator to the _send_to_api function. It does not change the base code used to interact with the DBFS. I have also added some integration tests that use the pyarrow implementation to read and write parquet files from DBFS.
The current patch solves the issue #1488 reported but the current code still lacks the capability to read from a partitioned directory structure using the hive partition method.

Commented some failed unit test
Added cassettes for unit testing dfbs methods
fsspec/implementations/dbfs.py Outdated Show resolved Hide resolved
fsspec/implementations/dbfs.py Outdated Show resolved Hide resolved
fsspec/implementations/dbfs.py Outdated Show resolved Hide resolved
fsspec/implementations/dbfs.py Outdated Show resolved Hide resolved
fsspec/implementations/dbfs.py Outdated Show resolved Hide resolved
fsspec/implementations/dbfs.py Outdated Show resolved Hide resolved
fsspec/implementations/dbfs.py Outdated Show resolved Hide resolved
@@ -0,0 +1,769 @@
Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
6,148,72,35,0,33.6,0.627,50,1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why include this big file? A small sample in code would be fine for the sake of testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to replicate the error I was having in a real case scenario to stress the BDFS API writing concurrently many records, doing this triggered the 429 error, I could remove the file and only put a couple of records but it will not replicate the initial issue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

import concurrent.futures
fs = ...
ex = concurrent.futures.ThreadPoolExecutor()
ex.map(lambda i: fs.pipe(f"/path/{i}", b"data"), range(100))

You could use a mock to test that the retry is actually getting called; OR allow logging during retry (not a bad idea anyway) and check for logging statements coming out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we're using VCR, so you'll never get a response except the one you go the original time. I have no idea, then, how to get back a 429 without mocking

fsspec/implementations/tests/test_dbfs.py Show resolved Hide resolved
def test_dbfs_write_pyarrow_non_partitioned(dbfsFS):
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail if these are missing in the environment. Can we do without them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have commented them, is it better to remove them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please

fsspec/implementations/dbfs.py Outdated Show resolved Hide resolved
fsspec/implementations/dbfs.py Outdated Show resolved Hide resolved
# Request Timeout
408,
# Too Many Requests
429,
]
errs += [str(e) for e in errs]
if type(exception) is requests.exceptions.HTTPError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isinstance is better

fsspec/implementations/tests/test_dbfs.py Outdated Show resolved Hide resolved
Fixed broken vcr tests due to comments
Updated vcr cassettes
@AlfredoLorie
Copy link

AlfredoLorie commented Jan 26, 2024

@martindurant Is there anything else required to merge this feature?

self.session = requests.Session()
self.retries = Retry(total=10,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future we'll make this configurable, but I won't hold the PR any more

@martindurant martindurant merged commit a408121 into fsspec:master Jan 27, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants