Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3437 improve handling of slow archives #4090

Conversation

ThomasBrady
Copy link
Contributor

@ThomasBrady ThomasBrady commented Dec 13, 2023

Description

Resolves #3437

  • Exposes config to control number of retries when fetching archives
  • Adds metrics for retries and throughput of files downloaded
  • Logs what commands are run as part of GetRemoteFileWork

Checklist

  • Reviewed the contributing document
  • Rebased on top of master (no merge commits)
  • Ran clang-format v8.0.0 (via make format or the Visual Studio extension)
  • Compiles
  • Ran all tests
  • If change impacts performance, include supporting evidence per the performance document
Screenshot 2024-01-02 at 6 45 24 PM

@ThomasBrady ThomasBrady marked this pull request as draft December 13, 2023 19:40
@ThomasBrady ThomasBrady marked this pull request as ready for review December 13, 2023 21:50
@ThomasBrady ThomasBrady force-pushed the 3437-improve-handling-of-slow-archives branch from edb4496 to 0987437 Compare December 14, 2023 01:28
Copy link
Contributor

@SirTyson SirTyson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off to a good start! I think it may be useful to step up a level and do the metrics and timings in GetHistoryArchiveStateWork. What you could do is still set the number of retries for GetHistoryArchiveStateWork. In the else if (state == State::WORK_FAILURE && archive) branch after you see a failure, you can then update the metrics with the time and count. This would also avoid noisy metrics GetAndUnzipRemoteFileWork, as I think we're only interested in GetHistoryArchiveStateWork (I'm not sure on this point though and need to double check).

src/work/BasicWork.h Outdated Show resolved Hide resolved
src/main/Config.h Outdated Show resolved Hide resolved
src/test/test.cpp Outdated Show resolved Hide resolved
docs/stellar-core_example.cfg Outdated Show resolved Hide resolved
docs/metrics.md Outdated Show resolved Hide resolved
src/catchup/CatchupWork.cpp Outdated Show resolved Hide resolved
src/history/test/HistoryTests.cpp Outdated Show resolved Hide resolved
src/historywork/GetRemoteFileWork.cpp Outdated Show resolved Hide resolved
src/ledger/readme.md Show resolved Hide resolved
src/historywork/GetRemoteFileWork.cpp Outdated Show resolved Hide resolved
src/historywork/GetRemoteFileWork.h Outdated Show resolved Hide resolved
docs/stellar-core_example.cfg Outdated Show resolved Hide resolved
src/catchup/CatchupWork.cpp Outdated Show resolved Hide resolved
src/historywork/GetRemoteFileWork.cpp Outdated Show resolved Hide resolved
src/historywork/GetRemoteFileWork.cpp Outdated Show resolved Hide resolved
src/historywork/GetRemoteFileWork.cpp Outdated Show resolved Hide resolved
src/historywork/GetRemoteFileWork.h Outdated Show resolved Hide resolved
src/historywork/GetRemoteFileWork.cpp Outdated Show resolved Hide resolved
src/historywork/GetRemoteFileWork.cpp Outdated Show resolved Hide resolved
@ThomasBrady ThomasBrady force-pushed the 3437-improve-handling-of-slow-archives branch from 0987437 to 39cc61c Compare December 21, 2023 19:07
@SirTyson
Copy link
Contributor

I’m not sure if the semantics of HISTORY_ARCHIVE_RETRY_LIMIT are quite right. Currently, if we try to download from an archive and it fails, we try another archive in a round robin fashion. I don’t know if the HISTORY_ARCHIVE_RETRY_LIMIT would be better as a config that determines how many times you try another archive before failing vs. trying to download from the same archive multiple times. Looking at the original issue, I don’t know if this config setting addresses the archive pattern variables that Nico originally mentioned. Sorry about that, I think I lost track of the original issue when going back and forth on the design.

I think what makes the most sense is to keep the metrics in this PR (which we definitely want), but move the config issues to a follow up issue where we can discuss a little more what we want. I agree with Marta’s DownloadWorkWithMetrics suggestion and we can keep the metrics moving. Again, sorry about getting confused on the issue.

@ThomasBrady ThomasBrady force-pushed the 3437-improve-handling-of-slow-archives branch from 39cc61c to bc1c430 Compare December 23, 2023 01:14
Copy link
Contributor

@SirTyson SirTyson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty close! A few minor nits, and I think another class needs to inherit from DownloadWorkWithMetrics.

@@ -0,0 +1,37 @@
// Copyright 2015 Stellar Development Foundation and contributors. Licensed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Copyright should be 2024

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy new year 🎉

@@ -0,0 +1,28 @@
// Copyright 2015 Stellar Development Foundation and contributors. Licensed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: update copyright

@@ -4,7 +4,7 @@

#pragma once

#include "historywork/RunCommandWork.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This include should not have changed.

@@ -13,7 +14,7 @@ namespace stellar
class HistoryArchive;
class GetRemoteFileWork;

class GetAndUnzipRemoteFileWork : public Work
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think GetHistoryArchiveStateWork should also inherit from DownloadWorkWithMetrics.

@ThomasBrady ThomasBrady force-pushed the 3437-improve-handling-of-slow-archives branch 2 times, most recently from 0e39ed1 to 8bd904e Compare January 3, 2024 16:37

namespace stellar
{

class HistoryArchive;
class GetRemoteFileWork;

class GetHistoryArchiveStateWork : public Work
class GetHistoryArchiveStateWork : public DownloadWorkWithMetrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't catch this initially, but it looks like there's an inconsistency here with the mBytesPerSecond metric. The issue is the meter assumes that the download finishes whenever the work that inherits DownloadWorkWithMetrics finishes. For GetHistoryArchiveStateWork, this is fine because the entire class is effectively a wrapper around GetRemoteFileWork and does no additional work. However, GetAndUnzipRemoteFileWork first downloads the file, then unzips it, then finishes. This means that even if the speed of the download is identical, files downloaded via GetAndUnzipRemoteFileWork will always appear to be slower because you count unzip time.

I think the easiest way to fix this is probably just to do your metering in GetRemoteFileWork. There's no non-history related work that uses this, so we can probably just do this metering directly in the class. A wrapper class like DownloadWithMetrics would be useful if GetRemoteFileWork was used for use cases outside of history downloads, but I don't see any. You can mark the bytesPerSecond value in the onSuccess function still. Since we never retry the work directly, you can mark the retry/failure meter in onFailureRaise.

Copy link
Contributor Author

@ThomasBrady ThomasBrady Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, good catch on the unzip time being included in the denominator. Regarding moving the metrics to GetRemoteFileWork, it doesn't make sense to call the metric history.get.retry (as its not exactly a retry when incremented from a failing GetRemoteFileWork), perhaps history.get.failure?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

@ThomasBrady ThomasBrady force-pushed the 3437-improve-handling-of-slow-archives branch from 8bd904e to c81c740 Compare January 3, 2024 19:15
Copy link
Contributor

@SirTyson SirTyson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! A nice clean change after all the back and forth.

@marta-lokhova
Copy link
Contributor

r+ c81c740

@latobarita latobarita merged commit 58697e8 into stellar:master Jan 3, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve handling of slow archives
4 participants