Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add download_job_result method to JobAPI #119

Merged
merged 3 commits into from
Sep 6, 2024
Merged

Conversation

chezou
Copy link
Member

@chezou chezou commented Aug 23, 2024

This method downloads the result of a job to a
local file in msgpack.gz format. It can specify
thread number for parallel download.

File validation

Confirmed that checksum is the same as a downloaded file by TD toolbelt.

$ shasum 2172017217*
5e227ef6582fd216ce1f0d8b81d43a4a6274aff6  2172017217.msgpack.gz
5e227ef6582fd216ce1f0d8b81d43a4a6274aff6  2172017217_toolbelt.msgpack.gz

Download speed comparison

  • TD Toolbelt: 37 mins (including one retry)
  • td-client-python with 6 threads: 4 mins
$ time /Users/ariga/.gem/ruby/2.6.0/bin/td job:show 2172017217 -f msgpack.gz -o 2172017217_toolbelt.msgpack.gz
JobID       : 2172017217
Status      : success
Type        : presto
Database    : aki_audience_large
Start At    : 2024-08-17 08:52:04 UTC
End At      : 2024-08-17 09:03:29 UTC
Priority    : NORMAL
Retry limit : 0
Output      :
Query       : select * from behavior_1 limit 100_000_000
Result size : 2.84 GB
Result      :
EOFError: httpclient IncompleteError. Retrying after 5 seconds...ack.gz in msgpack.gz format 966.9 MB /  2.8 GB : ======= 33 ========
NOTE: the job result is being written to 2172017217_toolbelt.msgpack.gz in msgpack.gz format  2.8 GB /  2.8 GB : =========================== 100 ===========================
written to 2172017217_toolbelt.msgpack.gz in msgpack.gz format
Use '-v' option to show detailed messages.
/Users/ariga/.gem/ruby/2.6.0/bin/td job:show 2172017217 -f msgpack.gz -o   71.05s user 62.49s system 5% cpu 37:38.02 total
>>> import tdclient; import os; td = tdclient.Client(apikey=os.environ["TD_API_KEY"])
>>> import time; s = time.time(); td.download_job_result("2172017217", "2172017217.msgpack.gz"); print(f"elapsed time: {time.time() - s} sec")
True
elapsed time: 240.56217789649963 sec

This method downloads the result of a job to a
local file in msgpack.gz format. It can specify
thread number for parallel download.
Copy link
Contributor

@tung-vu-td tung-vu-td left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tung-vu-td tung-vu-td merged commit f8f23bb into master Sep 6, 2024
18 checks passed
@tung-vu-td tung-vu-td deleted the download-job-result branch September 6, 2024 03:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants