Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audit ingestion lock critical section #3474

Closed
tamirms opened this issue Mar 18, 2021 · 10 comments
Closed

Audit ingestion lock critical section #3474

tamirms opened this issue Mar 18, 2021 · 10 comments
Assignees
Labels
horizon ingest New ingestion system
Milestone

Comments

@tamirms
Copy link
Contributor

tamirms commented Mar 18, 2021

Horizon supports a mode of distributed ingestion where a SELECT FOR UPDATE lock is used to facilitate leader election. Once a Horizon instance has the lock it attempts to ingest the next available sequence from stellar core. Afterwards, the horizon instance releases the lock and the next round of leader election begins.

We should audit the code path that is executed when Horizon has the SELECT FOR UPDATE lock. In particular, we should make sure that all asynchronous or I/O bound operations have appropriate timeouts. Otherwise, such an operation could block indefinitely causing Horizon ingestion to stall.

We may exempt operations on the Horizon db from having timeouts because there might be cases where we are ingesting very large ledgers and we expect those operations to be time consuming.

See #2570

@tamirms tamirms added the ingest New ingestion system label Mar 18, 2021
@tamirms tamirms self-assigned this Mar 24, 2021
@tamirms
Copy link
Contributor Author

tamirms commented Mar 24, 2021

The horizon ingestion system interacts with 3 components to perform I/O operations:

  1. history archives
  2. ledger backend
  3. horizon postgres db

There are no timeouts set on the requests to the history archives. It is possible that a request to the history archive hangs for a prolonged period of time while the ingestion lock is held. Unfortunately, it's not easy to configure timeouts for the history archive requests because the primary operation we are doing is streaming large files from S3.

Although the lack of timeouts on history archive streaming operations is problematic, we perform history archive ingestion in rare circumstances (when the horizon db is empty or when we bump the version of the ingestion system), typically when starting the horizon ingestion service. If you run more than one horizon cluster, you can introduce upgrades on the inactive cluster and when the history archive ingestion is complete we can swap the cluster in so it becomes active.


The ledger backend represents the source of ledgers which are fed into the horizon ingestion system. The ledger backend can be configured to be one of two implementations: DatabaseBackend or CaptiveCoreBackend. The DatabaseBackend obtains new ledgers by querying the Stellar Core DB. The CaptiveCoreBackend obtains new ledgers by running Stellar Core as a subprocess and sharing them with horizon via filesystem pipe.

There are no timeouts set on the DatabaseBackend operations but it should be relatively easy to add timeouts to those operations.

The CaptiveCoreBackend does not require us to add timeouts because it loads new ledgers in a separate goroutine from the main ingestion system thread. When the goroutine has finished reading a ledger from the file system pipe it pushes it onto a channel which is shared with the ingestion system. This design works well for us because it allows us to release the ingestion lock if there are no ledgers available in the buffer.


The output of the horizon ingestion system is stored in postgres. In the ingestion process, horizon will execute batches of updates which will populate postgres with latest data derived from the incoming ledger. If the database connection for a horizon instance is closed, postgres should roll back the ingestion transaction and release the ingestion lock. Then another node in the horizon cluster will reattempt to ingest the latest ledger.

However, it is possible for a database connection to not be closed when a machine has crashed or rebooted. In the event that the horizon node which has the ingestion lock crashes or reboots, the lock will not be released and ingestion won't be able to fail over to the other nodes in the horizon cluster. To prevent this scenario we can configure the keep alive settings of the postgres server to detect dead client connections. Once the dead client is detected postgres should rollback the ingestion transaction and thereby release the ingestion lock.

The other case is that the postgres machine crashes or reboots. In this scenario, ingestion is completely blocked until postgres becomes available again. Once postgres is up again, the horizon nodes might still think the old connections are valid and not be able to resume ingestion. The ideal solution to this problem would be to configure the keep alive settings on the horizon side to detect when connections to postgres are dead.

@tamirms
Copy link
Contributor Author

tamirms commented Mar 25, 2021

As an alternative to configuring keep alive settings on the horizon side, we could add timeouts to each of the db query and statements executed by horizon when it has the ingestion lock (see #2570). This would mitigate the scenario where the postgres machine reboots.

@bartekn
Copy link
Contributor

bartekn commented Mar 30, 2021

Great analysis, thanks @tamirms!

Although the lack of timeouts on history archive streaming operations is problematic, we perform history archive ingestion in rare circumstances (when the horizon db is empty or when we bump the version of the ingestion system), typically when starting the horizon ingestion service.

I agree we can ignore this case. Also because when state ingestion is performed, ledger-by-ledger ingestion is stopped.

When it comes to Horizon DB:

  • I think support/db: Add timeout on horizon db ingestion statements #2570 makes sense but I'm afraid of three things:
    • In the worst case scenario, all queries are finished just before the timeout which means that the cumulative ingestion time will still be long.
    • It can be hard to find an universal timeout that would prevent locking issues in different organizations. For example: at SDF we use pretty high spec DB so we can set smaller timeouts but somebody else can use an average machine and it won't work. The solution could be to set a larger timeout but then this is vulnerable to the situation explained in the previous point.
    • If a DB is overloaded query time can be longer. In this situation we'd rather continue ingestion at a slower rate than stop it completely due to timeouts.
  • I'm thinking that TCP keepalive idea is much better because it works just when connection issues arise. I did some research connected to this solution and it's super easy to set in a DB (tcp_keepalives_* in Postgres config) but not so easy in Golang because the package we currently use (lib/pq) doesn't allow setting the value.

I was also thinking about two issue not listed here:

  • If a Horizon <-> DB connection and DB itself behave correctly but the ingesting instance is slow due to CPU usage etc. In such case the ideas here won't work. I wonder if in such case https://github.com/cirello-io/pglock suggested by @2opremio would help.
  • In case of two ingesting instances out of which one is having persistent connection issues it would be great to have a sleep period (maybe with exp backoff?) in which problematic ingesting instance is not trying to acquire a lock. This is tricky because we need to know if there is more than one ingesting instance (we shouldn't sleep if there's just one instance) and if it's really a problem of a single instance, not a DB for example.

To sum it up I think in the short term we should probably:

  • Set TCP keepalive in a DB (stg to start and test). In Horizon we can add a flag that, when enabled (disabled by default), will check if TCP keepalive is set in a DB. If not it will exit and error. This will allow admins to make sure the DB config is correct when using distributed ingestion.
  • On the client side, it looks like support/db: Add timeout on horizon db ingestion statements #2570 is the best option on the table currently. We just need to make sure the timeout is reasonable.

@tamirms
Copy link
Contributor Author

tamirms commented Mar 30, 2021

On the client side, it looks like #2570 is the best option on the table currently. We just need to make sure the timeout is reasonable.

@bartekn what about configuring tcp keep alive on the horizon side at the operating system level? The only downside I see is that it requires configuration outside of horizon. if that's a blocker we could have a goroutine which periodically calls db.Ping() on the database connection and if there are consecutive failures we could cancel a context for the transaction.

@2opremio
Copy link
Contributor

Regarding the history archive timeouts, if the file size can be obtained in advance, we could dynamically adjust the timeout based on it (although, maybe we cannot rely on that in general).

@bartekn
Copy link
Contributor

bartekn commented Mar 30, 2021

@bartekn what about configuring tcp keep alive on the horizon side at the operating system level?

This is a good idea for internal network but, as you said, it requires some config outside Horizon which isn't ideal for anyone else. It will also affect all other services running on the host.

goroutine which periodically calls db.Ping() on the database connection

Good idea. I just wonder if, in case of connection issues, db.Ping would also hang indefinitely.

@tamirms
Copy link
Contributor Author

tamirms commented Mar 30, 2021

Good idea. I just wonder if, in case of connection issues, db.Ping would also hang indefinitely.

We can add a timeout to the ping operation via a context deadline

@bartekn
Copy link
Contributor

bartekn commented Mar 30, 2021

We can add a timeout to the ping operation via a context deadline

Nice! Maybe we should do that instead of #2570 then?

@tamirms
Copy link
Contributor Author

tamirms commented Mar 30, 2021

Nice! Maybe we should do that instead of #2570 then?

I think that makes sense

@tamirms
Copy link
Contributor Author

tamirms commented Mar 30, 2021

@bartekn I just thought of one possible issue with the Ping() approach. When you call Ping() there's no guarantee that the request will happen on the same connection which is used by the ingestion code. I think the pq library maintains a connection pool so it's possible that the ping operation is performed on a new connection.

My concern is if postgres reboots without closing its connection but it does so quickly enough to come back online and respond to the Ping() heartbeat. In that scenario ingestion is stalled but we aren't able to detect it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
horizon ingest New ingestion system
Projects
None yet
Development

No branches or pull requests

4 participants