-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create spark job to get user file acc times using wmarchive lfnarray belonging crab jobs #113
base: master
Are you sure you want to change the base?
Conversation
a6de041
to
e684264
Compare
Changed access time from |
a8e9b52
to
fb01a2e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bin/cron4wma_crab_ds_access.sh
Outdated
# ------------------------------------------------------------------------------------------------------- RUN SPARK JOB | ||
# Required for Spark job in K8s | ||
util4logi "spark job starts" | ||
export PYTHONPATH=$script_dir/../src/python:$PYTHONPATH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very dangerous construct which hides the dependency that this script requires specific data structure with python source. I rather prefer to put it on top of the script and perform check of loading some python module which this script requires. If module import fail you can through an error asking user to setup proper PYTHONPATH environment.
# Define logs path for Spark imports which produce lots of info logs | ||
LOG_DIR="$WDIR"/logs/$(date +%Y%m%d) | ||
mkdir -p "$LOG_DIR" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script should print all environment variable it uses, e.g.
echo "LOG_DIR=$LOGDIR"
...
This will help you later in debug process.
@@ -0,0 +1,122 @@ | |||
#!/bin/bash | |||
set -e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add author part into the script.
HDFS_DBS_PHYSICS_GROUPS = f'/tmp/cmsmonit/rucio_daily_stats-{TODAY}/PHYSICS_GROUPS/part*.avro' | ||
HDFS_DBS_ACQUISITION_ERAS = f'/tmp/cmsmonit/rucio_daily_stats-{TODAY}/ACQUISITION_ERAS/part*.avro' | ||
HDFS_DBS_DATASET_ACCESS_TYPES = f'/tmp/cmsmonit/rucio_daily_stats-{TODAY}/DATASET_ACCESS_TYPES/part*.avro' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to add dump of all global variable to the stdout to help debugging process
# Send data with STOMP AMQ | ||
# ===================================================================================================================== | ||
def credentials(f_name): | ||
if os.path.exists(f_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add docstring to describe format of input file.
d834c44
to
3324217
Compare
Thanks @vkuznet , I applied all changes. @drkovalskyi Spark job is ready to test. Since it takes so much time, I could not fully test it, I'll do tomorrow. And also there were some problems in swan which slowed me a bit. In general:
|
3324217
to
79997a4
Compare
79997a4
to
fc648ae
Compare
This spark job requires proper documentation when it is completely done. Let me explain latest changes here: As requested by Dima, we need to get last access times and access counts of user jobs to datasets. WMArchive provides user job information in When we come to access times, Last but not least, there 2 critical filters: fyi @drkovalskyi |
fc648ae
to
4544fee
Compare
This PR includes calculation of last/first access times of datasets accessed by user jobs, using WMArchive data by filtering only CRAB*
jobtype
.There are so many lines for extracting DBS additional information but main logic that extract LFN files and their information from WMArchive data can be found in
udf_lfn_extract
function.@drkovalskyi WMArchive HDFS data goes back to 18 months, so to 2021-03. It means that if a file was accessed before
2021-03-01
, we will not have this information. I'm running the spark job on full data and send it to ES. I'll inform you when it is visible in Kibana.@vkuznet you're the WMArchive expert and creator of its producer. If you've time, a review would be great.
Bash script is long but it's our general script that we use for other cron jobs, so it's trivial on our side.