Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Dumps not auto-generating #5402

Closed
2 tasks done
Tracked by #5757
cdrini opened this issue Jul 8, 2021 · 14 comments · Fixed by #5892, #6163 or #6910
Closed
2 tasks done
Tracked by #5757

Data Dumps not auto-generating #5402

cdrini opened this issue Jul 8, 2021 · 14 comments · Fixed by #5892, #6163 or #6910
Assignees
Labels
Affects: Data Issues that affect book/author metadata or user/account data. [managed] Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Module: Data dumps Priority: 2 Important, as time permits. [managed] Type: Bug Something isn't working. [managed]

Comments

@cdrini
Copy link
Collaborator

cdrini commented Jul 8, 2021

@cdrini cdrini added Type: Bug Something isn't working. [managed] Priority: 2 Important, as time permits. [managed] Affects: Data Issues that affect book/author metadata or user/account data. [managed] Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] labels Jul 8, 2021
@cdrini cdrini added this to the Next (proposed) milestone Jul 8, 2021
@cdrini
Copy link
Collaborator Author

cdrini commented Jul 8, 2021

It looks like they were generated, but the file names look off; there's no date. And the --archive flag which is used to... you know upload the data dumps, somehow made it into the filename. So I imagine the upload never happened.

On ol-home0:

cd /opt/openlibrary

COMPOSE_FILE="docker-compose.yml:docker-compose.production.yml" \
HOSTNAME="$HOSTNAME" \
docker-compose run \
    -u openlibrary \
    -e "SCRIPTS=/openlibrary/scripts" \
    cron-jobs \
    bash
cd /1/var/tmp/dumps
openlibrary@ol-home0:/1/var/tmp/dumps$ du -sh *
4.0K    ol_cdump_--archive.txt.gz
4.0K    ol_dump_--archive.txt.gz

^ Although the file sizes definitely don't look correct here.

@cdrini
Copy link
Collaborator Author

cdrini commented Jul 8, 2021

Going to follow the steps here to generate a new dump: #4621 (comment)

Had to remove the old files manually; complaining about permissions... sudo rm -rf /1/var/tmp/dumps/ol_*

Then (from the above comment):

tmux
# ---
cd /opt/openlibrary

COMPOSE_FILE="docker-compose.yml:docker-compose.production.yml" \
HOSTNAME="$HOSTNAME" \
docker-compose run \
    -u openlibrary \
    -e "SCRIPTS=/openlibrary/scripts" \
    cron-jobs \
    bash
# ---
cd /1/var/tmp
/olsystem/bin/cron/oldump.sh ol-db1 openlibrary `date -d yesterday '+%Y-%m-%d'` --archive

Waiting...

@cdrini
Copy link
Collaborator Author

cdrini commented Jul 9, 2021

New dump created 👍

@cdrini cdrini self-assigned this Aug 2, 2021
@cdrini cdrini modified the milestones: Next (proposed), Active Sprint Aug 2, 2021
@cdrini cdrini removed their assignment Aug 2, 2021
@cdrini cdrini modified the milestones: Active Sprint, Next (proposed) Aug 2, 2021
@cdrini
Copy link
Collaborator Author

cdrini commented Aug 10, 2021

Kicked off manual dump; see #3989 (comment)

@cdrini
Copy link
Collaborator Author

cdrini commented Aug 11, 2021

Errored; restarted. See #3989 (comment)

@jimman2003
Copy link
Contributor

The dump is still not up :)

@cdrini
Copy link
Collaborator Author

cdrini commented Sep 14, 2021

Ooops! Auto-closed by a PR

@cclauss cclauss reopened this Feb 24, 2022
@mekarpeles mekarpeles added Priority: 1 Do this week, receiving emails, time sensitive, . [managed] Priority: 2 Important, as time permits. [managed] Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] and removed Priority: 2 Important, as time permits. [managed] Priority: 1 Do this week, receiving emails, time sensitive, . [managed] Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] labels Feb 28, 2022
@mekarpeles
Copy link
Member

This has large refactors of bash code on ol-home0 which need to be checked over and turned into a PR by @cclauss

See: #6253

@mekarpeles
Copy link
Member

womp womp

@mekarpeles mekarpeles reopened this May 2, 2022
@cclauss
Copy link
Collaborator

cclauss commented Jun 1, 2022

Problems on 01 June 2020: Generating cdump took 7h28 to process 200M records.

  1. Skipping: data.txt.gz -- Because the script deleted the file in the parent directory, not the working directory.
  2. Step 4 processed 200M records and wrote its results into ol_cdump_2022-05-31 but the compgen logic told Step 5 to process a file that starts with ol_cdump_2022-06. No such file existed so steps 5 and 6 processed nothing quite quickly. :-(

ol-home0% docker logs -f openlibrary_cron-jobs_1 2>&1 | grep "dump\|Jun"

[openlibrary.dump] * [Wed Jun  1 00:00:01 UTC 2022] /openlibrary/scripts/oldump.sh 2022-05-31 --archive --overwrite
[openlibrary.dump] * <host:ol-home0.us.archive.org> <user:openlibrary> <dir:/1/var/tmp>
[openlibrary.dump] * Cleaning Up: Found --cleanup, removing old files
> === Step 1 ===
[openlibrary.dump] * generating reading log table: ol_dump_reading-log_2022-05-31.txt.gz
> === Step 2 ===
[openlibrary.dump] * generating ratings table: ol_dump_ratings_2022-05-31.txt.gz
> === Step 3 ===
[openlibrary.dump] * Skipping: data.txt.gz   <-- NOT GOOD!!  --overwrite cleared the parent, not working directory.
> === Step 4 ===
[openlibrary.dump] * generating ol_cdump_2022-05-31.txt.gz -- takes approx. 500 minutes for 192,000,000+ records...
/openlibrary/scripts/oldump.py: Python 3.9.4
generate_cdump(data.txt.gz, 2022-05-31) reading
Wed Jun  1 00:00:22 2022 read_data_file(data.txt.gz, max_lines=all)
Wed Jun  1 00:00:22 2022 0
Wed Jun  1 00:02:58 2022 1,000,000
Wed Jun  1 00:04:58 2022 2,000,000
Wed Jun  1 00:07:01 2022 3,000,000
Wed Jun  1 00:09:26 2022 4,000,000
Wed Jun  1 00:12:05 2022 5,000,000
    [ ... ]
Wed Jun  1 07:20:41 2022 196,000,000
Wed Jun  1 07:22:24 2022 197,000,000
Wed Jun  1 07:24:38 2022 198,000,000
Wed Jun  1 07:26:35 2022 199,000,000
Wed Jun  1 07:28:45 2022 200,000,000
[openlibrary.dump] * generated  <-- PROBLEMS START HERE!!  2022-06 vs. 2022-05 issue.
> === Step 5 ===
generating the dump -- takes approx. 485 minutes for 173,000,000+ records...
/openlibrary/scripts/oldump.py: Python 3.9.4
> === Step 6 ===
/openlibrary/scripts/oldump.py: Python 3.9.4
Wed Jun  1 07:29:19 2022 read_tsv() reading <_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>
Wed Jun  1 07:29:19 2022 splitting stdin
Wed Jun  1 07:29:19 2022 sorting /1/var/tmp/oldumpsort/00.txt.gz
Wed Jun  1 07:29:19 2022 sorting /1/var/tmp/oldumpsort/01.txt.gz
Wed Jun  1 07:29:19 2022 sorting /1/var/tmp/oldumpsort/02.txt.gz
Wed Jun  1 07:29:19 2022 sorting /1/var/tmp/oldumpsort/03.txt.gz
Wed Jun  1 07:29:19 2022 sorting /1/var/tmp/oldumpsort/04.txt.gz
    [ ... ]
Wed Jun  1 07:29:21 2022 sorting /1/var/tmp/oldumpsort/fb.txt.gz
Wed Jun  1 07:29:21 2022 sorting /1/var/tmp/oldumpsort/fc.txt.gz
Wed Jun  1 07:29:21 2022 sorting /1/var/tmp/oldumpsort/fd.txt.gz
Wed Jun  1 07:29:21 2022 sorting /1/var/tmp/oldumpsort/fe.txt.gz
Wed Jun  1 07:29:21 2022 sorting /1/var/tmp/oldumpsort/ff.txt.gz
splitting the dump: ol_dump_%s_2022-05-31.txt.gz -- takes approx. 85 minutes for 68,000,000+ records...
/openlibrary/scripts/oldump.py: Python 3.9.4
[openlibrary.dump] * dumps are generated at /1/var/tmp/dumps
drwxr-xr-x 2 openlibrary openlibrary 4.0K Jun  1 07:29 ol_cdump_2022-05-31
drwxr-xr-x 2 openlibrary openlibrary 4.0K Jun  1 07:29 ol_dump_2022-05-31
./ol_cdump_2022-05-31:
-rw-r--r-- 1 openlibrary openlibrary 26G Jun  1 07:29 ol_cdump_2022-05-31.txt.gz
./ol_dump_2022-05-31:
-rw-r--r-- 1 openlibrary openlibrary   20 Jun  1 07:29 ol_dump_2022-05-31.txt.gz
-rw-r--r-- 1 openlibrary openlibrary   56 Jun  1 07:29 ol_dump_authors_2022-05-31.txt.gz
-rw-r--r-- 1 openlibrary openlibrary   57 Jun  1 07:29 ol_dump_editions_2022-05-31.txt.gz
-rw-r--r-- 1 openlibrary openlibrary 2.5M Jun  1 00:00 ol_dump_ratings_2022-05-31.txt.gz
-rw-r--r-- 1 openlibrary openlibrary  33M Jun  1 00:00 ol_dump_reading-log_2022-05-31.txt.gz
-rw-r--r-- 1 openlibrary openlibrary   58 Jun  1 07:29 ol_dump_redirects_2022-05-31.txt.gz
-rw-r--r-- 1 openlibrary openlibrary   54 Jun  1 07:29 ol_dump_works_2022-05-31.txt.gz
[openlibrary.dump] * ia version is v2.3.0
ol_dump_2022-05-31:
 uploading /ol_dump_authors_2022-05-31.txt.gz: 100%|██████████| 1/1 [00:00<00:00, 32.94MiB/s]
 uploading /ol_dump_2022-05-31.txt.gz: 100%|██████████| 1/1 [00:00<00:00, 39.22MiB/s]
 uploading /ol_dump_editions_2022-05-31.txt.gz: 100%|██████████| 1/1 [00:00<00:00, 39.23MiB/s]
 uploading /ol_dump_works_2022-05-31.txt.gz: 100%|██████████| 1/1 [00:00<00:00, 34.32MiB/s]
 uploading /ol_dump_ratings_2022-05-31.txt.gz: 100%|██████████| 3/3 [00:00<00:00, 30.41MiB/s]
 uploading /ol_dump_redirects_2022-05-31.txt.gz: 100%|██████████| 1/1 [00:00<00:00, 43.77MiB/s]
 uploading /ol_dump_reading-log_2022-05-31.txt.gz: 100%|██████████| 33/33 [00:04<00:00,  6.80MiB/s]
ol_cdump_2022-05-31:
 uploading /ol_cdump_2022-05-31.txt.gz: 100%|██████████| 25686/25686 [06:53<00:00, 62.12MiB/s] ]
[openlibrary.dump] * Skipping sitemap
Wed Jun 1 07:37:17 UTC 2022: openlibrary has completed /openlibrary/scripts/oldump.sh 2022-05-31 --archive --overwrite in /1/var/tmp on ol-home0.us.archive.org
deleting the data table dump

@mekarpeles
Copy link
Member

☑️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Affects: Data Issues that affect book/author metadata or user/account data. [managed] Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Module: Data dumps Priority: 2 Important, as time permits. [managed] Type: Bug Something isn't working. [managed]
Projects
None yet
4 participants