Decrease size of HistoricalLink table #3008

rebeccacremona · 2022-01-05T19:05:54Z

We use Django Simple History to track changes on a number of our models. The historical link table is huge:

+---------------------------------------+--------+-------+-------+------------+---------+
| CONCAT(table_schema, '.', table_name) | rows   | DATA  | idx   | total_size | idxfrac |
+---------------------------------------+--------+-------+-------+------------+---------+
| perma.perma_historicallink            | 18.76M | 5.66G | 2.30G | 7.96G      |    0.41 |
| perma.perma_link                      | 2.16M  | 0.81G | 0.92G | 1.73G      |    1.14 |
| perma.django_session                  | 2.83M  | 1.02G | 0.20G | 1.21G      |    0.20 |
| perma.perma_uncaughterror             | 0.81M  | 1.01G | 0.02G | 1.03G      |    0.02 |
| perma.perma_historicallinkuser        | 2.64M  | 0.54G | 0.45G | 1.00G      |    0.83 |
| perma.perma_capture                   | 5.34M  | 0.73G | 0.21G | 0.94G      |    0.28 |
| perma.perma_historicalregistrar       | 1.82M  | 0.44G | 0.08G | 0.53G      |    0.19 |
| perma.perma_capturejob                | 1.77M  | 0.31G | 0.21G | 0.52G      |    0.69 |
| perma.perma_historicalorganization    | 1.73M  | 0.15G | 0.15G | 0.30G      |    1.00 |
| perma.perma_link_folders              | 2.33M  | 0.14G | 0.14G | 0.28G      |    1.06 |
...

The routine creation of a link makes 3 rows, one of which is a duplicate which I think clean_duplicate_history will remove.

Maybe let's try running that, and see how it goes? The table is so big, I'm not sure it will run painlessly, but we can hope. We probably want to run it regularly, maybe as an hourly/daily celerybeat task.

Also, Jack says, "seems totally viable to clean_old_history with some long time window, too."

The text was updated successfully, but these errors were encountered:

rebeccacremona · 2022-01-12T22:06:44Z

We're trying a customized version of the duplicate cleanup command on a clone of the prod db, to see how much it reduces the final size and how long it takes, on a db where nothing else is happening. For this experiment, we're running it as a fab task in 4 tmux sessions on one minion; in real life we probably would run on the IA minions, maybe mediated by celery, maybe not? TBD.

@task
def clean_duplicate_link_history(start_date=None, end_date=None, batch_size=None, dry_run=False):
    """
    One-time task, to clean duplicate history in our 9.3 GB, never-before-de-duped historical links table.
    Based on the Django Simple History management command, but takes a start and end date, and should use
    less memory. Adapted from https://github.com/jazzband/django-simple-history/blob/master/simple_history/management/commands/clean_duplicate_history.py#L87

    start_date and end_date are in the format YYYY-MM-DD

    Arguments should be strings, e.g.
    fab dev.clean_duplicate_link_history:batch_size="10",dry_run="True"

    """
    from datetime import datetime
    from django.db import transaction
    from tqdm import tqdm
    from perma.models import Link
    import pytz
    import re
    from simple_history.utils import get_history_manager_for_model

    #
    # Format args
    #
    if not start_date:
        # use first archive date
        start_datetime = Link.objects.order_by('creation_timestamp')[0].creation_timestamp
    elif re.match(r'^\d\d\d\d-\d\d-\d\d$', start_date):
        start_datetime = pytz.utc.localize(datetime.strptime(start_date, "%Y-%m-%d"))
    else:
        print("start_date must be in the format YYYY-MM-DD")
        return

    if not end_date:
        end_datetime = pytz.utc.localize(datetime.now())
    elif re.match(r'^\d\d\d\d-\d\d-\d\d$', end_date):
        end_datetime = pytz.utc.localize(datetime.strptime(end_date, "%Y-%m-%d"))
    else:
        print("end_date must be in the format YYYY-MM-DD")
        return

    if batch_size:
        batch_size = int(batch_size)

    dry_run = bool(dry_run)


    #
    # retrieve link ids and batch size
    #
    links = Link.objects.all_with_deleted().filter(
        creation_timestamp__gte=start_datetime,
        creation_timestamp__lt=end_datetime
    )
    if batch_size:
        links = links[:batch_size]

    to_dedupe = links.count()
    if not to_dedupe:
        print("No links in batch.")
        return

    #
    # de-dupe history
    #
    print(f"De-duping the history of {to_dedupe} links.")
    historical_link_manager = get_history_manager_for_model(Link)
    for guid in tqdm(links.values_list('guid', flat=True)):
        first_record = historical_link_manager.filter(guid=guid).first()
        if not first_record:
            print(f"No history for {guid}")
            continue

        all_history =  historical_link_manager.filter(guid=guid)

        with transaction.atomic():
            to_delete = []
            f1 = first_record
            for f2 in all_history[1:]:
                delta = f1.diff_against(f2)
                if not delta.changed_fields:
                    to_delete.append(f1.pk)
                f1 = f2
            if not dry_run:
                historical_link_manager.filter(pk__in=to_delete).delete()
            print(f"{'Found' if dry_run else 'Deleted'} {len(to_delete)} duplicate entries for {guid}.")

matteocargnelutti added housekeeping Issue type database Subcategory priority-low Priority hint labels Aug 9, 2022

bensteinberg mentioned this issue Nov 19, 2024

Remove HistoricalLink #3660

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decrease size of HistoricalLink table #3008

Decrease size of HistoricalLink table #3008

rebeccacremona commented Jan 5, 2022

rebeccacremona commented Jan 12, 2022 •

edited

Loading

Decrease size of HistoricalLink table #3008

Decrease size of HistoricalLink table #3008

Comments

rebeccacremona commented Jan 5, 2022

rebeccacremona commented Jan 12, 2022 • edited Loading

rebeccacremona commented Jan 12, 2022 •

edited

Loading