-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added decorator for retry_request and handle err 429 #1491
Conversation
Added test for pyarrow integration using the DBFS implementation
This PR is a quick fix for handling HTTP error 429 with DBFS filesystem implementation by adding a decorator to the _send_to_api function. It does not change the base code used to interact with the DBFS. I have also added some integration tests that use the pyarrow implementation to read and write parquet files from DBFS. |
Commented some failed unit test Added cassettes for unit testing dfbs methods
@@ -0,0 +1,769 @@ | |||
Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome | |||
6,148,72,35,0,33.6,0.627,50,1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why include this big file? A small sample in code would be fine for the sake of testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to replicate the error I was having in a real case scenario to stress the BDFS API writing concurrently many records, doing this triggered the 429 error, I could remove the file and only put a couple of records but it will not replicate the initial issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about
import concurrent.futures
fs = ...
ex = concurrent.futures.ThreadPoolExecutor()
ex.map(lambda i: fs.pipe(f"/path/{i}", b"data"), range(100))
You could use a mock to test that the retry is actually getting called; OR allow logging during retry (not a bad idea anyway) and check for logging statements coming out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we're using VCR, so you'll never get a response except the one you go the original time. I have no idea, then, how to get back a 429 without mocking
def test_dbfs_write_pyarrow_non_partitioned(dbfsFS): | ||
import pandas as pd | ||
import pyarrow as pa | ||
import pyarrow.parquet as pq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail if these are missing in the environment. Can we do without them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have commented them, is it better to remove them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes please
fsspec/implementations/dbfs.py
Outdated
# Request Timeout | ||
408, | ||
# Too Many Requests | ||
429, | ||
] | ||
errs += [str(e) for e in errs] | ||
if type(exception) is requests.exceptions.HTTPError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isinstance is better
Removed unused tests Other minor format fixes
Fixed broken vcr tests due to comments Updated vcr cassettes
@martindurant Is there anything else required to merge this feature? |
fsspec/implementations/dbfs.py
Outdated
self.session = requests.Session() | ||
self.retries = Retry(total=10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future we'll make this configurable, but I won't hold the PR any more
Added test for pyarrow integration using the DBFS implementation