Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split MULTI transaction in batches (fix #149) #148

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Commits on Oct 11, 2023

  1. Configuration menu
    Copy the full SHA
    c72ca9d View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    f5625c4 View commit details
    Browse the repository at this point in the history
  3. Split big multi transactions - part 2: ScanSource()

    Remove the variable count, instead we use this variable only to iterate.
    Hence in the final log, we replace 'count' with 'len(sourceFiles)'.
    
    With this commit, we break two MULTI transactions, so let's look at
    those in details:
    
    The first MULTI transaction is:
    - DEL FILES_TMP
    - loop on files:
      - SADD FILES_TMP <path>
    
    I think there's no problem in breaking this in chunks, as we just
    iterate over a temporary key. It doesn't matter if the program is
    interrupted and we leave a partially updated key behind.
    
    Then comes a lone SDIFF FILES FILES_TMP command, that gives us the list
    of files to remove.
    
    The second MULTI transaction is:
    - loop on files:
      - HMSET FILE_<path>
      - publish FILE_UPDATE <path>
    - loop on removed files:
      - DEL FILE_<path>
      - publish FILE_UPDATE <path>
    - RENAME FILES_TMP FILES
    
    I don't think it's really needed to have all of that in a single MULTI
    transaction, i *think* it's Ok to break the two loops in chunks. What
    really matters is that we rename the keys FILES_TMP to FILES in the
    last step.
    elboulangero committed Oct 11, 2023
    Configuration menu
    Copy the full SHA
    d67d0b4 View commit details
    Browse the repository at this point in the history
  4. Split big multi transactions - part 3a: Scan() (rework internals)

    Rework how the files are committed to the db.
    
    Before: we'd create a MULTI, the scan. The scan function iterates over
    the scan results and call ScannerAddFile(), which would send commands to
    Redis. In case of failure, we'd discard the MULTI transaction, remove
    the temporary key, and bail out. In case of success, we'd finally call
    ScannerCommit() which was just about calling EXEC to execute the MULTI
    transaction.
    
    With this commit: we now keep an internal slice of filedata. Calling
    ScannerAddFile() just adds a filedata to the slice. In case of failure,
    it's easier, we can just return. In case of success, it's now the
    ScannerCommit() function that does the bulk of the job: sent a MULTI
    command, then iterate on the files to enqueue all the commands, and
    finally EXEC.
    
    This change of behaviour is needed for what comes next: breaking the
    MULTI transaction in chunks.
    elboulangero committed Oct 11, 2023
    Configuration menu
    Copy the full SHA
    cf806da View commit details
    Browse the repository at this point in the history
  5. Split big multi transactions - part 3b: Scan() (drop s.count)

    We don't need to maintain a counter, as we keep a slice with the files
    returned by the scan, so len() does the job.
    elboulangero committed Oct 11, 2023
    Configuration menu
    Copy the full SHA
    7d0fc28 View commit details
    Browse the repository at this point in the history
  6. Split big multi transactions - part 3c: Scan() (split multi)

    With this commit we split two big multi transactions in chunks. Let's
    have a look at those in details.
    
    One was about removing the files to remove. I don't think it really
    matters if it's all done at once, or in several transaction.
    
    The other in ScannerCommit() is about committing all the files that were
    returned by the scan. Once again, I have the impression that it doesn't
    really matter if it's done all at once, or if the transaction is split
    in chunks.
    elboulangero committed Oct 11, 2023
    Configuration menu
    Copy the full SHA
    e286632 View commit details
    Browse the repository at this point in the history

Commits on Oct 19, 2023

  1. Minor improvements in logs

    Looking at the logs when the source is scan:
    
    scan.go:494  2023/10/10 06:18:58.633 UTC [source] Scanning the filesystem...
    scan.go:512  2023/10/10 06:19:22.079 UTC [source] Indexing the files...
    scan.go:624  2023/10/10 06:19:27.745 UTC [source] Scanned 544001 files
    
    And the logs when a mirror is scanned:
    
    rsync.go:89  2023/10/10 06:13:23.634 UTC [ftp.jaist.ac.jp] Requesting file list via rsync...
    trace.go:129 2023/10/10 06:13:23.979 UTC [ftp.jaist.ac.jp] trace last sync: 2023-10-10 00:00:01 +0000 UTC
    scan.go:221  2023/10/10 06:18:49.781 UTC [ftp.jaist.ac.jp] Indexed 544001 files (544000 known), 0 removed
    
    This commit brings two minor improvements:
    * log the number of files that were removed for the source, similar to how it's
      done with mirrors.
    * add a log "Indexing the files" after the mirror scan returns, similar to how
      it's done for the source.
    elboulangero committed Oct 19, 2023
    Configuration menu
    Copy the full SHA
    7c1e44e View commit details
    Browse the repository at this point in the history