-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fetch
command
#77
Comments
I would like to take a crack at this. Please let me know if you have specifics in mind already. |
Hi @mhuang74 , the thing I like about Docopt, is that the helptext becomes an API contract definition of sorts. Let's start by drafting the helptext first, and once we lock that down, you can then work on the code implementing it. Can you take a first pass at the helptext? Also, we need to think about the underlying crates to use - should we use reqwest, actix or awc? As their underlying capabilities will enable the options we use with |
Thanks @jqnatividad. Here's my first pass.
|
Great @mhuang74 ! Now we start thinking about the implementation, which will ultimately result in additional options we need to expose on the CLI. The first one is throttling, as qsv cannot just hit an endpoint without it, as most APIs have rate limits, and your IP can even be blacklisted if your query can easily mistaken for a DOS attack. There should be a default, conservative throttle delay (say 5000 ms like OpenRefine) that the user can override. Also, what user agent should qsv present? It should definitely have a default one that says qsv, but the user should be presented with an option to use an alternate user agent string. Another one is response caching. You may want to check out cached. I used it with Also, there's some work being done to allow Finally, As for the underlying HTTP client library to use - https://blog.logrocket.com/the-state-of-rust-http-clients/ |
Thanks @jqnatividad for detailed feedback. Here's my 2nd pass.
And more importantly, since we are really looking at fairly low request/sec throughput due to server rate limits, choice of library would be more about practicality rather than performance. Probably would just go with reqwest, since:
|
@mhuang74 I like your 2nd pass. Some thoughts:
I agree. We should bake in multi-threading. Performance and multi-threading are signature advantages of Rust, and we should leverage it while remaining cognizant of rate limits.
Agree on all the above. With all the files we need to keep around, we should consider creating a
And yes, As for the docopt usage help text:
Finally, you may want to look into RateLimit http header fields to bake in support for auto-adjusting rate limiters to rate-limited services. There are some discussions about it on |
@jqnatividad As most API return JSON, this seems important to support. But I am struggling with:
Example JSON Response
Questions:
|
@mhuang74, here are some ideas for consideration:
|
Thanks for feedback and suggestion. Directly supporting JSON parsing seems the way to go. I was able to integrate with
|
Hi @mhuang74 ! Even in its current form, it's already very useful, and once we implement the other options, it'll be one of qsv's tentpole commands. |
@jqnatividad Okay, glad to hear it. Thanks for doing the merge and refactoring. |
@mhuang74 tweaked it a bit more as I'm planning to already use it in a pipeline I'm working on :)
And some additional thoughts on the other options:
|
Thanks @jqnatividad. Glad to see the feature quickly filling out. Please let me know if you still need help with fetch rate-limiting. I can also move to something less of a critical path for your projects since I only get to work on this over weekends. |
Thanks @mhuang74! If you can help with rate-limiting, that'd be awesome! I think getting headers & cookies working will also be great for authenticated sessions while you're at it. And feel free to jump into anything in the backlog that piques your interest. #46 and #60 in particular would really bulk up qsv's schema validation capabilities. |
Sure. Here's my plan for fetch rate-limiting:
|
Fetch: support rate limiting as per #77
Hi @mhuang74 , I tweaked Once the |
Thanks @jqnatividad. Regarding For integration test, I am having difficulties finding an api that requires key yet does not require my credit card (eg RapidAPI). The only one I've found is api.data.gov but it also recommend not to share the key. Thoughts?
|
Thinking about it, Perhaps, we should rename it As for integration testing, I googled this up - https://mixedanalytics.com/blog/list-actually-free-open-no-auth-needed-apis/. After a quick scan of the list , JSONPlaceholder and HTTPBin looks promising. Alternatively, perhaps you can parameterize the integration test that requires an API key and populate it from an environment variable, and if the envar is not set, the test is skipped. We just mention in the fetch documentation that they need to get their own API key. |
@mhuang74 with the 0.29 release, I'm closing this issue, but you're welcome to keep improving |
fetch
will allow qsv to fetch HTML or data from web pages or services, to enrich a CSV (e.g. geocoding, wikidata api, etc.)It will support authentication, concurrent requests, thresholds, etc.
Reminiscent of OpenRefine's fetch url... (https://docs.openrefine.org/manual/columnediting#add-column-by-fetching-urls) , but optimized for the command line.
The text was updated successfully, but these errors were encountered: