Audit ingestion lock critical section #3474

tamirms · 2021-03-18T13:09:47Z

Horizon supports a mode of distributed ingestion where a SELECT FOR UPDATE lock is used to facilitate leader election. Once a Horizon instance has the lock it attempts to ingest the next available sequence from stellar core. Afterwards, the horizon instance releases the lock and the next round of leader election begins.

We should audit the code path that is executed when Horizon has the SELECT FOR UPDATE lock. In particular, we should make sure that all asynchronous or I/O bound operations have appropriate timeouts. Otherwise, such an operation could block indefinitely causing Horizon ingestion to stall.

We may exempt operations on the Horizon db from having timeouts because there might be cases where we are ingesting very large ledgers and we expect those operations to be time consuming.

See #2570

The text was updated successfully, but these errors were encountered:

tamirms · 2021-03-24T21:47:51Z

The horizon ingestion system interacts with 3 components to perform I/O operations:

history archives
ledger backend
horizon postgres db

There are no timeouts set on the requests to the history archives. It is possible that a request to the history archive hangs for a prolonged period of time while the ingestion lock is held. Unfortunately, it's not easy to configure timeouts for the history archive requests because the primary operation we are doing is streaming large files from S3.

Although the lack of timeouts on history archive streaming operations is problematic, we perform history archive ingestion in rare circumstances (when the horizon db is empty or when we bump the version of the ingestion system), typically when starting the horizon ingestion service. If you run more than one horizon cluster, you can introduce upgrades on the inactive cluster and when the history archive ingestion is complete we can swap the cluster in so it becomes active.

The ledger backend represents the source of ledgers which are fed into the horizon ingestion system. The ledger backend can be configured to be one of two implementations: DatabaseBackend or CaptiveCoreBackend. The DatabaseBackend obtains new ledgers by querying the Stellar Core DB. The CaptiveCoreBackend obtains new ledgers by running Stellar Core as a subprocess and sharing them with horizon via filesystem pipe.

There are no timeouts set on the DatabaseBackend operations but it should be relatively easy to add timeouts to those operations.

The CaptiveCoreBackend does not require us to add timeouts because it loads new ledgers in a separate goroutine from the main ingestion system thread. When the goroutine has finished reading a ledger from the file system pipe it pushes it onto a channel which is shared with the ingestion system. This design works well for us because it allows us to release the ingestion lock if there are no ledgers available in the buffer.

The output of the horizon ingestion system is stored in postgres. In the ingestion process, horizon will execute batches of updates which will populate postgres with latest data derived from the incoming ledger. If the database connection for a horizon instance is closed, postgres should roll back the ingestion transaction and release the ingestion lock. Then another node in the horizon cluster will reattempt to ingest the latest ledger.

However, it is possible for a database connection to not be closed when a machine has crashed or rebooted. In the event that the horizon node which has the ingestion lock crashes or reboots, the lock will not be released and ingestion won't be able to fail over to the other nodes in the horizon cluster. To prevent this scenario we can configure the keep alive settings of the postgres server to detect dead client connections. Once the dead client is detected postgres should rollback the ingestion transaction and thereby release the ingestion lock.

The other case is that the postgres machine crashes or reboots. In this scenario, ingestion is completely blocked until postgres becomes available again. Once postgres is up again, the horizon nodes might still think the old connections are valid and not be able to resume ingestion. The ideal solution to this problem would be to configure the keep alive settings on the horizon side to detect when connections to postgres are dead.

tamirms · 2021-03-25T13:15:33Z

As an alternative to configuring keep alive settings on the horizon side, we could add timeouts to each of the db query and statements executed by horizon when it has the ingestion lock (see #2570). This would mitigate the scenario where the postgres machine reboots.

bartekn · 2021-03-30T12:42:29Z

Great analysis, thanks @tamirms!

Although the lack of timeouts on history archive streaming operations is problematic, we perform history archive ingestion in rare circumstances (when the horizon db is empty or when we bump the version of the ingestion system), typically when starting the horizon ingestion service.

I agree we can ignore this case. Also because when state ingestion is performed, ledger-by-ledger ingestion is stopped.

When it comes to Horizon DB:

I think support/db: Add timeout on horizon db ingestion statements #2570 makes sense but I'm afraid of three things:
- In the worst case scenario, all queries are finished just before the timeout which means that the cumulative ingestion time will still be long.
- It can be hard to find an universal timeout that would prevent locking issues in different organizations. For example: at SDF we use pretty high spec DB so we can set smaller timeouts but somebody else can use an average machine and it won't work. The solution could be to set a larger timeout but then this is vulnerable to the situation explained in the previous point.
- If a DB is overloaded query time can be longer. In this situation we'd rather continue ingestion at a slower rate than stop it completely due to timeouts.
I'm thinking that TCP keepalive idea is much better because it works just when connection issues arise. I did some research connected to this solution and it's super easy to set in a DB (tcp_keepalives_* in Postgres config) but not so easy in Golang because the package we currently use (lib/pq) doesn't allow setting the value.

I was also thinking about two issue not listed here:

If a Horizon <-> DB connection and DB itself behave correctly but the ingesting instance is slow due to CPU usage etc. In such case the ideas here won't work. I wonder if in such case https://github.com/cirello-io/pglock suggested by @2opremio would help.
In case of two ingesting instances out of which one is having persistent connection issues it would be great to have a sleep period (maybe with exp backoff?) in which problematic ingesting instance is not trying to acquire a lock. This is tricky because we need to know if there is more than one ingesting instance (we shouldn't sleep if there's just one instance) and if it's really a problem of a single instance, not a DB for example.

To sum it up I think in the short term we should probably:

Set TCP keepalive in a DB (stg to start and test). In Horizon we can add a flag that, when enabled (disabled by default), will check if TCP keepalive is set in a DB. If not it will exit and error. This will allow admins to make sure the DB config is correct when using distributed ingestion.
On the client side, it looks like support/db: Add timeout on horizon db ingestion statements #2570 is the best option on the table currently. We just need to make sure the timeout is reasonable.

tamirms · 2021-03-30T14:06:32Z

On the client side, it looks like #2570 is the best option on the table currently. We just need to make sure the timeout is reasonable.

@bartekn what about configuring tcp keep alive on the horizon side at the operating system level? The only downside I see is that it requires configuration outside of horizon. if that's a blocker we could have a goroutine which periodically calls db.Ping() on the database connection and if there are consecutive failures we could cancel a context for the transaction.

2opremio · 2021-03-30T15:20:54Z

Regarding the history archive timeouts, if the file size can be obtained in advance, we could dynamically adjust the timeout based on it (although, maybe we cannot rely on that in general).

bartekn · 2021-03-30T15:31:11Z

@bartekn what about configuring tcp keep alive on the horizon side at the operating system level?

This is a good idea for internal network but, as you said, it requires some config outside Horizon which isn't ideal for anyone else. It will also affect all other services running on the host.

goroutine which periodically calls db.Ping() on the database connection

Good idea. I just wonder if, in case of connection issues, db.Ping would also hang indefinitely.

tamirms · 2021-03-30T15:34:03Z

Good idea. I just wonder if, in case of connection issues, db.Ping would also hang indefinitely.

We can add a timeout to the ping operation via a context deadline

bartekn · 2021-03-30T15:35:56Z

We can add a timeout to the ping operation via a context deadline

Nice! Maybe we should do that instead of #2570 then?

tamirms · 2021-03-30T15:57:31Z

Nice! Maybe we should do that instead of #2570 then?

I think that makes sense

tamirms · 2021-03-30T17:39:40Z

@bartekn I just thought of one possible issue with the Ping() approach. When you call Ping() there's no guarantee that the request will happen on the same connection which is used by the ingestion code. I think the pq library maintains a connection pool so it's possible that the ping operation is performed on a new connection.

My concern is if postgres reboots without closing its connection but it does so quickly enough to come back online and respond to the Ping() heartbeat. In that scenario ingestion is stalled but we aren't able to detect it.

tamirms added the ingest New ingestion system label Mar 18, 2021

tamirms self-assigned this Mar 24, 2021

ire-and-curses added the horizon label Mar 24, 2021

bartekn mentioned this issue Mar 30, 2021

Periodically call db.Ping() to check DB connection #3516

Closed

bartekn mentioned this issue Mar 30, 2021

Move LedgerBackend operations outside ingest locked section #3517

Closed

bartekn added this to the Horizon 2.2.0 milestone Mar 31, 2021

tamirms closed this as completed Jun 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audit ingestion lock critical section #3474

Audit ingestion lock critical section #3474

tamirms commented Mar 18, 2021

tamirms commented Mar 24, 2021

tamirms commented Mar 25, 2021

bartekn commented Mar 30, 2021 •

edited

Loading

tamirms commented Mar 30, 2021

2opremio commented Mar 30, 2021

bartekn commented Mar 30, 2021

tamirms commented Mar 30, 2021

bartekn commented Mar 30, 2021 •

edited

Loading

tamirms commented Mar 30, 2021

tamirms commented Mar 30, 2021

Audit ingestion lock critical section #3474

Audit ingestion lock critical section #3474

Comments

tamirms commented Mar 18, 2021

tamirms commented Mar 24, 2021

tamirms commented Mar 25, 2021

bartekn commented Mar 30, 2021 • edited Loading

tamirms commented Mar 30, 2021

2opremio commented Mar 30, 2021

bartekn commented Mar 30, 2021

tamirms commented Mar 30, 2021

bartekn commented Mar 30, 2021 • edited Loading

tamirms commented Mar 30, 2021

tamirms commented Mar 30, 2021

bartekn commented Mar 30, 2021 •

edited

Loading

bartekn commented Mar 30, 2021 •

edited

Loading