diff --git a/.gitignore b/.gitignore index 9ebcc6a..2605da1 100644 --- a/.gitignore +++ b/.gitignore @@ -150,6 +150,7 @@ dead_letters/ !requirements_dev.txt !simeon/scripts/data !simeon/scripts/data/*.csv +!simeon/scripts/data/*.txt # Simeon configs *.cfg diff --git a/simeon/scripts/data/simeon_epilog.txt b/simeon/scripts/data/simeon_epilog.txt new file mode 100644 index 0000000..8c43480 --- /dev/null +++ b/simeon/scripts/data/simeon_epilog.txt @@ -0,0 +1,183 @@ +RETURN CODES: + simeon returns either 0 or 1, depending on whether an error was encountered. + If any error is encountered with any of the subcommands, 1 is returned. + For simeon list and simeon download, if nothing is listed or downloaded, then 1 is returned. + For simeon split and simeon push, if nothing ends up being split or pushed, then 1 is returned. + For simeon report, if any error is encountered while running the queries, then 1 is returned. + +SETUP and CONFIGURATIONS: + simeon is a glorified downloader and uploader set of scripts. Much of the downloading and uploading that it does makes the assumptions that you have + your AWS credentials configured properly and that you've got a service account file for GCP services available on your machine. If the latter is + missing, you may have to authenticate to GCP services through the SDK. However, both we and Google recommend the use of service accounts. + + Every downloaded file is decrypted either during the download process or while it gets split by the simeon split command. So, this tool assumes that + you have installed and configured gpg to be able to decrypt files from edX. + + The following steps may be useful to someone just getting started with the edX data package: + + 1. Credentials from edX + + o Reach out to edX to get your data czar credentials + + o Configure both AWS and gpg, so your credentials can access the S3 buckets and your gpg key can decrypt the files there + + 2. Setup a GCP project + + o Create a GCP project + + o Set up a BigQuery workspace + + o Create a GCS bucket + + o Create a service account and download the associated file + + o Give the service account Admin Role access to both the BigQuery project and the GCS bucket + + If the above steps are carried out successfully, then you should be able to use simeon without any issues. + + However, if you have taken care of the above steps but are still unable to get simeon to work, please open an issue. + + Further, simeon can parse INI formatted configuration files. It, by default, looks for files in the user's home directory, or in the current working + directory of the running process. The base names that are targeted when config files are looked up are: simeon.cfg or .simeon.cfg or simeon.ini or .simeon.ini. + You can also provide simeon with a config file by using the global option --config-file or -C, and giving it a path to the file with the corresponding configurations. + + The following is a sample file content: + + # Default section for things like the organization whose data package is processed + # You can also set a default site as one of the following: edx, edge, patches + [DEFAULT] + site = edx + org = yourorganizationx + clistings_file = /path/to/file/with/course_ids + + # Section related to Google Cloud (project, bucket, service account) + [GCP] + project = your-gcp-project-id + bucket = your-gcs-bucket + service_account_file = /path/to/a/service_account_file.json + wait_for_loads = True + geo_table = your-gcp-project.geocode_latest.geoip + youtube_table = your-gcp-project.videos.youtube + youtube_token = your-YouTube-API-token + + # Section related to the AWS credentials needed to download data from S3 + [AWS] + aws_cred_file = ~/.aws/credentials + profile_name = default + + The options in the config file(s) should match the optional arguments of the CLI tool. For instance, the --service-account-file, --project and + --bucket options can be provided under the GCP section of the config file as service_account_file, project and bucket, respectively. Similarly, the + --site and --org options can be provided under the DEFAULT section as site and org, respectively. + + +EXAMPLES: +List files + simeon can list files on S3 for your organization based on criteria like file type (sql or log or email), time intervals (begin and end dates), + and site (edx or edge or patches). + # List the latest SQL data dump + simeon list -s edx -o mitx -f sql -L + # List the latest email data dump + simeon list -s edx -o mitx -f email -L + # List the latest tracking log file + simeon list -s edx -o mitx -f log -L + +Download and split files + simeon can download, decrypt and split up files into folders belonging to specific courses. + + o Example 1: Download, split and push SQL bundles to both GCS and BigQuery + + # Download the latest SQL data dump + simeon download -s edx -o mitx -f sql -L -d data/ + + # Download SQL bundles dumped any time since 2021-01-01 and + # extract the contents for course ID MITx/12.3x/1T2021. + # Place the downloaded files in data/ and the output of the split in data/SQL + simeon download -s edx -o mitx -c "MITx/12.3x/1T2021" -f sql -b 2021-01-01 -d data -S -D data/SQL/ + + # Push to GCS the split up SQL files inside data/SQL/MITx__12_3x__1T2021 + simeon push gcs -f sql -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} -S ${SAFILE} data/SQL/MITx__12_3x__1T2021 + + # Push the files to BigQuery and wait for the jobs to finish + # Using -s or --use-storage tells BigQuery to extract the files + # to be loaded from Google Cloud Storage. + # So, use the option when you've already called simeon push gcs + simeon push bq -w -s -f sql -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} -S ${SAFILE} data/SQL/MITx__12_3x__1T2021 + + o Example 2: Download, split and push tracking logs to both GCS and BigQuery + + # Download the latest tracking log file + simeon download -s edx -o mitx -f log -L -d data/ + + # Download tracking logs dumped any time since 2021-01-01 + # and extract the contents for course ID MITx/12.3x/1T2021 + # Place the downloaded files in data/ and the output of the split in data/TRACKING_LOGS + simeon download -s edx -o mitx -c "MITx/12.3x/1T2021" -f log -b 2021-01-01 -d data -S -D data/TRACKING_LOGS/ + + # Push to GCS the split up tracking log files inside + # data/TRACKING_LOGS/MITx__12_3x__1T2021 + simeon push gcs -f log -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} -S ${SAFILE} data/TRACKING_LOGS/MITx__12_3x__1T2021 + + # Push the files to BigQuery and wait for the jobs to finish + # Using -s or --use-storage tells BigQuery to extract the files + # to be loaded from Google Cloud Storage. + # So, use the option when you've already called simeon push gcs + simeon push bq -w -s -f log -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} -S ${SAFILE} data/TRACKING_LOGS/MITx__12_3x__1T2021 + + o If you have already downloaded SQL bundles or tracking log files, you can use simeon split them up. + +Make secondary/aggregated tables + simeon can generate secondary tables based on already loaded data. Call simeon report --help for the expected positional and optional arguments. + + o Example: Make person_course for course ID MITx/12.3x/1T2021 + + # Make a person course table for course ID MITx/12.3x/1T2021 + # Provide the -g option to give a geolocation BigQuery table + # to fill the ip-to-location details in the generated person course table + COURSE=MITx/12.3x/1T2021 + simeon report -w -g "${GCP_PROJECT_ID}.geocode.geoip" -t "person_course" -p ${GCP_PROJECT_ID} -S ${SAFILE} ${COURSE} + + +NOTES: +1. Please note that SQL bundles are quite large when split up, so consider using the -c or --courses option when invoking simeon download -S or + simeon split to make sure that you limit the splitting to a set of course IDs. The `--clistings-file` option is an alternative to `--courses`. + It expects a text file with one course ID per line. + If those options are not used, simeon may end up failing to complete the split operation + due to exhausted system resources (storage to be specific). + +2. simeon download with file types log and email will both download and decrypt the files matching the given criteria. If the latter operations are + successful, then the encrypted files are deleted by default. This is to make sure that you don't exhaust storage resources. If you wish to keep + those files, you can always use the --keep-encrypted option that comes with simeon download and simeon split. SQL bundles are only downloaded (not decrypted). + Their decryption is done during a split operation. + +3. Unless there is an unhandled exception (which should be reported as a bug), simeon should, by default, print to the standard output both information + and errors encountered while processing your files. You can capture those logs in a file by using the global option --log-file and providing + a destination file for the logs. + +4. When using multi argument options like --tables or --courses, you should try not to place them right before the expected positional arguments. + This will help the CLI parser not confuse your positional arguments with table names (in the case of --tables) or course IDs (when --courses is used). + +5. Splitting tracking logs is a resource intensive process. The routine that splits the logs generates a file for each course ID encountered. If you + happen to have more course IDs in your logs than the running process can open operating system file descriptors, then simeon will put away records + it cannot save to disk for a second pass. Putting away the records involves using more memory than normally required. The second pass will only + require one file descriptor at a time, so it should be safe in terms of file descriptor limits. To help simeon not have to do a second pass, you + may increase the file descriptor limits of processes from your shell by running something like ulimit -n 2000 before calling simeon split on Unix + machines. For Windows users, you may have to dig into the Windows Registries for a corresponding setting. This should tell your OS kernel to allow + OS processes to open up to 2000 file handles. + +6. Care must be taken when using simeon split and simeon push to make sure that the number of positional arguments passed does not lead to the + invoked command exceeding the maximum command-line length allowed for arguments in a command. To avoid errors along those lines, please consider + passing the positional arguments as UNIX glob patterns. For instance, simeon split --file-type log 'data/TRACKING-LOGS/*/*.log.gz' tells simeon to + expand the given glob pattern, instead of relying on the shell to do it. + +7. The report subcommand relies on the presence of SQL query files to parse and send to BigQuery to execute. Any errors arising from executing the parsed + queries will be shown to the end user through the given log stream. While the simeon tool ships with query files for most secondary/reporting tables + that are based on the edx2bigquery tool, an end user should be able to point simeon to a different location with SQL query files by using + the --query-dir option that comes with simeon report. Additionally, these query files can contain jinja2 templated SQL code. + Any mentioned variables within these templated queries can be passed to simeon report by using the --extra-args option and passing key-value pair items + in the format var1=value1,var2=value2,var3=value3,...,var_n=value_n. Further, these key-value pair items can also be typed by using the format + var1:i=value1,var2:s=value2,var3:f=value3,...,var_n:s=value_n. In this format, the type is appended to the key, separated by a colon. + The only supported scalar types, so far, are s for str, i for int, and f for float. If any conversion errors occur during value parsing, + then those are shown to the end user, and the query won't get executed. Finally, if you wish to pass an array or list to the template, + you will need to repeat a key multiple times. For instance, if you want to pass a list named mylist containing the integers, + you could write something like --extra-args mylist:i=1,mylist:i=2,mylist:i=3. This means that you'll have a python list named + mylist within your template, and it should contain [1, 2, 3]. You can also pass a JSON file whose top-level objects are parsed as variables. Use a leading @ when passing a JSON file. \ No newline at end of file