Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decrease size of HistoricalLink table #3008

Open
rebeccacremona opened this issue Jan 5, 2022 · 1 comment
Open

Decrease size of HistoricalLink table #3008

rebeccacremona opened this issue Jan 5, 2022 · 1 comment
Labels
database Subcategory housekeeping Issue type priority-low Priority hint

Comments

@rebeccacremona
Copy link
Contributor

We use Django Simple History to track changes on a number of our models. The historical link table is huge:

+---------------------------------------+--------+-------+-------+------------+---------+
| CONCAT(table_schema, '.', table_name) | rows   | DATA  | idx   | total_size | idxfrac |
+---------------------------------------+--------+-------+-------+------------+---------+
| perma.perma_historicallink            | 18.76M | 5.66G | 2.30G | 7.96G      |    0.41 |
| perma.perma_link                      | 2.16M  | 0.81G | 0.92G | 1.73G      |    1.14 |
| perma.django_session                  | 2.83M  | 1.02G | 0.20G | 1.21G      |    0.20 |
| perma.perma_uncaughterror             | 0.81M  | 1.01G | 0.02G | 1.03G      |    0.02 |
| perma.perma_historicallinkuser        | 2.64M  | 0.54G | 0.45G | 1.00G      |    0.83 |
| perma.perma_capture                   | 5.34M  | 0.73G | 0.21G | 0.94G      |    0.28 |
| perma.perma_historicalregistrar       | 1.82M  | 0.44G | 0.08G | 0.53G      |    0.19 |
| perma.perma_capturejob                | 1.77M  | 0.31G | 0.21G | 0.52G      |    0.69 |
| perma.perma_historicalorganization    | 1.73M  | 0.15G | 0.15G | 0.30G      |    1.00 |
| perma.perma_link_folders              | 2.33M  | 0.14G | 0.14G | 0.28G      |    1.06 |
...

The routine creation of a link makes 3 rows, one of which is a duplicate which I think clean_duplicate_history will remove.

Maybe let's try running that, and see how it goes? The table is so big, I'm not sure it will run painlessly, but we can hope. We probably want to run it regularly, maybe as an hourly/daily celerybeat task.

Also, Jack says, "seems totally viable to clean_old_history with some long time window, too."

@rebeccacremona
Copy link
Contributor Author

rebeccacremona commented Jan 12, 2022

We're trying a customized version of the duplicate cleanup command on a clone of the prod db, to see how much it reduces the final size and how long it takes, on a db where nothing else is happening. For this experiment, we're running it as a fab task in 4 tmux sessions on one minion; in real life we probably would run on the IA minions, maybe mediated by celery, maybe not? TBD.

@task
def clean_duplicate_link_history(start_date=None, end_date=None, batch_size=None, dry_run=False):
    """
    One-time task, to clean duplicate history in our 9.3 GB, never-before-de-duped historical links table.
    Based on the Django Simple History management command, but takes a start and end date, and should use
    less memory. Adapted from https://github.com/jazzband/django-simple-history/blob/master/simple_history/management/commands/clean_duplicate_history.py#L87

    start_date and end_date are in the format YYYY-MM-DD

    Arguments should be strings, e.g.
    fab dev.clean_duplicate_link_history:batch_size="10",dry_run="True"

    """
    from datetime import datetime
    from django.db import transaction
    from tqdm import tqdm
    from perma.models import Link
    import pytz
    import re
    from simple_history.utils import get_history_manager_for_model

    #
    # Format args
    #
    if not start_date:
        # use first archive date
        start_datetime = Link.objects.order_by('creation_timestamp')[0].creation_timestamp
    elif re.match(r'^\d\d\d\d-\d\d-\d\d$', start_date):
        start_datetime = pytz.utc.localize(datetime.strptime(start_date, "%Y-%m-%d"))
    else:
        print("start_date must be in the format YYYY-MM-DD")
        return

    if not end_date:
        end_datetime = pytz.utc.localize(datetime.now())
    elif re.match(r'^\d\d\d\d-\d\d-\d\d$', end_date):
        end_datetime = pytz.utc.localize(datetime.strptime(end_date, "%Y-%m-%d"))
    else:
        print("end_date must be in the format YYYY-MM-DD")
        return

    if batch_size:
        batch_size = int(batch_size)

    dry_run = bool(dry_run)


    #
    # retrieve link ids and batch size
    #
    links = Link.objects.all_with_deleted().filter(
        creation_timestamp__gte=start_datetime,
        creation_timestamp__lt=end_datetime
    )
    if batch_size:
        links = links[:batch_size]

    to_dedupe = links.count()
    if not to_dedupe:
        print("No links in batch.")
        return

    #
    # de-dupe history
    #
    print(f"De-duping the history of {to_dedupe} links.")
    historical_link_manager = get_history_manager_for_model(Link)
    for guid in tqdm(links.values_list('guid', flat=True)):
        first_record = historical_link_manager.filter(guid=guid).first()
        if not first_record:
            print(f"No history for {guid}")
            continue

        all_history =  historical_link_manager.filter(guid=guid)

        with transaction.atomic():
            to_delete = []
            f1 = first_record
            for f2 in all_history[1:]:
                delta = f1.diff_against(f2)
                if not delta.changed_fields:
                    to_delete.append(f1.pk)
                f1 = f2
            if not dry_run:
                historical_link_manager.filter(pk__in=to_delete).delete()
            print(f"{'Found' if dry_run else 'Deleted'} {len(to_delete)} duplicate entries for {guid}.")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
database Subcategory housekeeping Issue type priority-low Priority hint
Projects
None yet
Development

No branches or pull requests

2 participants