RFC: Configurable Build Event Stores #53

aoldershaw · 2020-05-10T17:12:59Z

Experimental PR: concourse/concourse#5651

Signed-off-by: Aidan Oldershaw [email protected]

Signed-off-by: Aidan Oldershaw <[email protected]>

Not sure why it's PR concourse#53, when the previous one was concourse#48...perhaps because of discussions? Signed-off-by: Aidan Oldershaw <[email protected]>

Signed-off-by: Aidan Oldershaw <[email protected]>

evanchaoli · 2020-05-11T01:15:10Z

053-configurable-event-store/proposal.md

+* By default, there will be no change - Concourse will default to using a
+  Postgres `EventStore` using the same main Postgres database.
+* Operators will be able to configure a different store
+* If they configure a new store, they won't be able to access any existing


I think this feature has to provide a migration solution, otherwise production clusters would be hard to move on external storage.

Yeah I agree. I'd love to hear any thoughts you have on the best way to approach this. I can think of a couple options:

Introduce a concourse migrate-build-events command that migrates everything over to the configured build event store. It could possibly default to migrating from the Postgres database to whatever event store is configured, but it could also take a separate from configuration in case you want to migrate from a non-Postgres backend to a new backend.

Migrate out of Postgres "on the fly". i.e. we could have a component like the build reaper that migrates old builds to the new store, a few at a time. In this case, we'd probably still want some way to view build logs from Postgres in case they haven't yet been migrated, for which we could construct a FallbackEventStore that does something like this:

type FallbackEventStore struct { EventStore Fallback EventStore } func (f *FallbackEventStore) Get(build db.Build, requested int, cursor *Key) ([]event.Envelope, error) { // first, try the main event store: events, err := f.EventStore.Get(build, requested, cursor) if err == nil && len(events) > 0 { return events, nil } // the build wasn't found (or at least has no events in the main store) - so try the Fallback (Postgres) // if this also errors, we'll probably want to return the err from the primary EventStore, but you get the idea return f.Fallback.Get(build, requested, cursor) }

I suspect an approach like 1) may not be feasible when there's a huge number of build events, since it will likely take a really long time.

Either way, it probably makes sense to make EventStore.Put take a list of events so we can do batch puts, rather than needing to Put them all one-by-one.

Maybe just not migrate. Instead, add a flag to table build to indicate where is build logs of the build, PG or ES. So that reads old build logs from PG and new build logs from ES.

Just a suggestion for another way to do the migration, I wonder if there could be a way to migrate over all the build events to their new event store without stopping the database? This will make it so that operators wanting to move over to a new build event store does not need to wait for the migrations to finish before starting concourse.

I imagine something like the operator would continue to have a running concourse deployment and the build events would currently be fetched from the postgres database. While the deployment is running, some tool can be used to migrate all the build events within the postgres database (old build event store) into the new build event store. Then once it has finished, the operator can redeploy concourse and point it at the new build event store. There will need to be some method for discarding the old data (maybe concourse can do that at startup if you configure a new event store? or maybe it can be part of the tool to wipe out build events?)

I wonder for operators that want to configure a new build event store, they might want to benefit from being able to query or search for things within the build events. If we were to do the method of migrating on the fly, that would mean only certain build events would be migrated over. So taking that into account and the fact that operators that configure a new build event store most likely would like all their builds to be in this new store, I might be inclined to thinking that migrating all the data together would be preferable? I could totally be wrong about this though.

@clarafu that definitely makes sense - having a separate tool that just points to the running Postgres is very much in line with #53 (comment).

FWIW the idea behind the "on the fly" approach is that it'd eventually migrate all build events over (not just for new events) - it'd essentially be doing what the tool you're suggesting is doing, but it's started by the ATC. I think it's beneficial to keep it separate though, for the reasons mentioned by @jchesterpivotal.

Ohh I see! Sorry I missed where you clarified it by saying it will build reaper that migrates old builds to the new store. And yeah that is a good call that allowing the operator to migrate before upgrading is definitely a good idea.

evanchaoli · 2020-05-11T01:15:32Z

Awesome!

Rather than when it's Started, it should be when the build is first created. Signed-off-by: Aidan Oldershaw <[email protected]>

Signed-off-by: Aidan Oldershaw <[email protected]>

cirocosta

iiinteresting! I didn't finish reading it all yet though, will finish later 😁

053-configurable-event-store/proposal.md

cirocosta · 2020-05-12T13:54:07Z

053-configurable-event-store/proposal.md

+    Get(build db.Build, requested int, cursor *Key) ([]event.Envelope, error)
+
+    Delete(builds []db.Build) error
+    DeletePipeline(pipeline db.Pipeline) error


it got me wondering if we should really have these 🤔 from what I understand, the distinction between Delete and Delete(Pipeline|Team) is simply that the the one implementing the interface could do so in a more performant manner. Buut .. the Pipeline and Team concepts seem to be too Concourse specific, kinda "leaking" into the interface here.

tbh, I don't have a suggestion 😅 just seemed like overloading the interface a bit

Yeah I agree, but couldn't really think of a better way to do it without losing the ability to DROP tables...

We could try to be smart with the Postgres implementation of the EventStore, where if we call Delete(builds) where builds exactly matches the list of pipeline builds, drop the table instead - though I feel like this could be finicky.

cirocosta · 2020-05-12T14:01:54Z

053-configurable-event-store/proposal.md

+}
+```
+
+How about migrating build events to cold-storage after they're completed:


that's a pretty useful use case, specially for those businesses that require keeping logs for .. years.

assuming this would be implemented by, say, aws' s3, one can even place lifecycle policies to their bucket (https://docs.aws.amazon.com/AmazonS3/latest/dev/lifecycle-transition-general-considerations.html) and then get some sweet price reduction automatically for those super old build events

Signed-off-by: Aidan Oldershaw <[email protected]>

053-configurable-event-store/proposal.md

jchesterpivotal · 2020-05-12T20:43:15Z

053-configurable-event-store/proposal.md

+1. Storing build events in a system designed for handling text data (like
+   Elasticsearch, for instance) allows you to do some pretty cool querying that
+   isn't feasible in Postgres.
+1. Gaining observability into build events isn't great. The syslog drainer


Agreed, I've had to do quite a lot of work when I've wanted to extract and transform this data for analysis.

jchesterpivotal · 2020-05-12T20:45:16Z

053-configurable-event-store/proposal.md

+* Operators will be able to configure a different store
+* If they configure a new store, they won't be able to access any existing
+  build events that are currently stored in Postgres without migrating them
+  over (how should we recommend doing this?)


I think it should be a standalone tool, not part of the ATC. Otherwise you're forcing folks to upgrade to get it, or to get bugfixes to such a migration tool.

053-configurable-event-store/proposal.md

jchesterpivotal · 2020-05-12T20:50:46Z

053-configurable-event-store/proposal.md

+* Users won't be able to access build events that exist in the main Postgres
+  database if they switch to a new backend store. How can we make the


I think there are two use cases being mixed here.

The first is "tell me what just broke", the thing Concourse UX is optimised for. That data works best in PostgreSQL, close to the ATC.

The second is "I need a complete archive of everything ever". Concourse enables that but isn't really putting it front and centre. The overhead has been historically bad enough that the archival need was at cross-purposes with the diagnostic goal. This is where draining to secondary stores is useful.

If you have a secondary store in addition to the main store, then setting short build histories on the main store, but not the secondary store, is a relatively safe compromise for most installations.

Yeah, that assessment makes sense. I think the proposed EventStore interface should be able to support this (similar to the "Cold Storage" example I provided, but could drain to an arbitrary secondary EventStore). Pair that with build log reaping to clear out old Postgres logs, and we can keep recent builds close to the ATC in Postgres while archiving everything in a secondary store (which could be S3, Elasticsearch, etc)

Signed-off-by: Aidan Oldershaw <[email protected]>

This header allows users to query the /api/v1/builds/{id}/events endpoint starting from an offset. Removing this may be a controversial decision. I discussed this as an open question in concourse/rfcs#53. Essentially, supporting this header means that `EventStores` will have to define how to unmarshal `Keys` (but not marshal them??). This header seems to be used nowhere internally, and isn't exposed by Fly or go-concourse, so my guess is it's *probably* no longer used by anyone. Signed-off-by: Aidan Oldershaw <[email protected]>

This commit adds the Postgres implementation for the `EventStore` interface as described in concourse/rfcs#53. While this change shouldn't noticably change the outward behaviour of Concourse noticably, there are a few changes: * Don't get from the build_events table twice for no reason in Events() * Some DB interactions now run in a separate transaction to accomodate `EventStores` that don't interact with Postgres * The build_events table is now created at runtime (rather than at migration time), and the `pipeline_build_events_x` and `team_build_events_x` tables are created when the first build is initialized (rather than when the pipeline/team is created) Signed-off-by: Aidan Oldershaw <[email protected]>

This header allows users to query the /api/v1/builds/{id}/events endpoint starting from an offset. Removing this may be a controversial decision. I discussed this as an open question in concourse/rfcs#53. Essentially, supporting this header means that `EventStores` will have to define how to unmarshal `Keys` (but not marshal them??). This header seems to be used nowhere internally, and isn't exposed by Fly or go-concourse, so my guess is it's *probably* no longer used by anyone. Signed-off-by: Aidan Oldershaw <[email protected]>

This commit adds the Postgres implementation for the `EventStore` interface as described in concourse/rfcs#53. While this change shouldn't noticably change the outward behaviour of Concourse noticably, there are a few changes: * Don't get from the build_events table twice for no reason in Events() * Some DB interactions now run in a separate transaction to accomodate `EventStores` that don't interact with Postgres * The build_events table is now created at runtime (rather than at migration time), and the `pipeline_build_events_x` and `team_build_events_x` tables are created when the first build is initialized (rather than when the pipeline/team is created) Signed-off-by: Aidan Oldershaw <[email protected]>

053-configurable-event-store/proposal.md

This header allows users to query the /api/v1/builds/{id}/events endpoint starting from an offset. Removing this may be a controversial decision. I discussed this as an open question in concourse/rfcs#53. Essentially, supporting this header means that `EventStores` will have to define how to unmarshal `Keys` (but not marshal them??). This header seems to be used nowhere internally, and isn't exposed by Fly or go-concourse, so my guess is it's *probably* no longer used by anyone. Signed-off-by: Aidan Oldershaw <[email protected]>

This commit adds the Postgres implementation for the `EventStore` interface as described in concourse/rfcs#53. While this change shouldn't noticably change the outward behaviour of Concourse noticably, there are a few changes: * Don't get from the build_events table twice for no reason in Events() * Some DB interactions now run in a separate transaction to accomodate `EventStores` that don't interact with Postgres * The build_events table is now created at runtime (rather than at migration time), and the `pipeline_build_events_x` and `team_build_events_x` tables are created when the first build is initialized (rather than when the pipeline/team is created) Signed-off-by: Aidan Oldershaw <[email protected]>

+ key marshalling Signed-off-by: Aidan Oldershaw <[email protected]>

ari-becker · 2020-05-26T07:02:16Z

053-configurable-event-store/proposal.md

+One thing I haven't really explored is how operators would configure the
+backend store. I think it could be similar to how Credential managers are
+configured - there are groups of independent configuration options for each
+possible `EventStore`. If certain flags are configured for a given backend, use
+that backend.


I think that it would be desirable to design build event stores in such a way that allows them to be built and released independently from the rest of Concourse, similar to how Terraform eventually spun providers out of its main source tree so that they could be built and released independently, typically written and maintained by the commercial entities backing the providers.

For example, a user might wish to export build logs to Coralogix, which is a log management and analysis platform. As Coralogix is a commercial platform, I would fully expect that a Coralogix build event store plugin would be written and maintained by Coralogix and not by the Concourse team / Pivotal.

(full disclosure: I work for Coralogix and we would probably (but I'm intentionally being non-committal here) be interested in building out a plugin for our own platform as part of dogfooding our own product)

That's very interesting - I hadn't put much thought into an architecture like that, mostly because the precedent seems to be package and release everything from within Concourse (at least in terms of credential managers and metrics emitters). I think it'd be cool to for build event stores to be written as plugins external to the Concourse binary, and it'd be really cool to have experts in the domain be the ones to implement/maintain it.

I guess one concern is the added operational complexity, though I imagine it's not too bad. Like Terraform, we could package some plugins directly into Concourse (it could just be the Postgres plugin so operators don't need to bring their own plugin for the default behaviour).

Actually, my frame of reference wasn't credential managers and metrics emitters (though arguably they should be handled the same way), rather it was Concourse resources. Just as everything in https://resource-types.concourse-ci.org/ is written, packaged, and maintained outside of the main Concourse source tree, shouldn't build event stores (and credential managers and metrics emitters) also be written and maintained outside of the main Concourse source tree?

That concourse/registry-image-resource and concourse/git-resource resource types are not part of concourse/concourse isn't something that makes me think of additional operational complexity (although I guess that's technically true), but rather resilience, flexibility, etc.

I was thinking about this a bit more. For some use cases of the proposed EventStore, having their implementations be invoked directly (as opposed to over something like gRPC) is preferable. For instance, invoking the Postgres EventStore in memory allows us to perform queries database objects (e.g. also deleting all of a team's pipelines when deleting a team). For certain applications of EventStore composition (e.g. for secret redaction), handling the operations in memory is beneficial.

So, what if EventStores are just an in-memory type, but there's a PluginEventStore implementation that functions as a gRPC client? Which plugin (gRPC server) to use could be configured via flags to Concourse, and any environment variables such as CONCOURSE_EVENTSTORE_PLUGIN_* could be passed along to the plugin on startup to configure e.g. authentication (similar to how we support custom configuration for Guardian).

What do you think of that approach?

I doubt in-memory EventStore, because that implies every ATC will duplicately store build logs, which might not fit to huge clusters where hundreds pipelines generate huge amount of build logs. But I think plugin is a good idea, at least, a PG plugin would allow us to store build logs to a separate database, if the plugin is well designed, it may even support to dispatch build logs to multiple databases, which would definitely benefit scalability of Concourse. Also, users will have the flexibility to store build logs in various backends, ES, and whatever.

ari-becker · 2020-05-26T07:32:19Z

053-configurable-event-store/proposal.md

+    // perhaps run this in the background
+    return c.migrateToColdStorage(build)


I think this comment belies a lot of underlying complexity, where this design should go towards a plugin model per-provider, like Terraform (see comment below about adding providers), which uses an RPC architecture. See for example: https://www.terraform.io/docs/extend/how-terraform-works.html

This kind of multiple-binary architecture is necessary for both a) proper queuing and background processing of events as they are sent to external providers, and b) allowing additional providers to be written and developed independently of the upstream project.

Yeah, the comment was a bit of a half-baked idea. From what I can tell, any sort of background processing we have now is just done through separate goroutines, everything residing in the same process. For instance, metric emitters run on an emitLoop that reads from a Go chan that receives messages: https://github.com/concourse/concourse/blob/36cbf5ecf628dba92b6b4c08cfa536820638f313/atc/metric/emit.go#L129-L133

All that to say, we could (and currently do) perform some sort of background processing together without the use of plugins running in their own process and communicating via RPC. That's definitely not to say we should do things this way (though, admittedly, I don't fully grasp all of the implications of taking one approach over the other).

ari-becker · 2020-05-26T07:37:23Z

053-configurable-event-store/proposal.md

+}
+```
+
+How about migrating build events to cold-storage after they're completed:


I wonder if the desire to design for cold storage doesn't mean that EventStore.Get should be some kind of optional operation. Usually the trade-off with cold storage is that it reduces storage costs while increasing access costs, and we wouldn't want to subject users to a surprisingly high bill just because they accessed some part of Concourse in a way which invoked the EventStore.Get on cold-storage.

Maybe a better design would be one where EventStore.Get can return a link to where the build events can be found? Then the user can more intentionally make the decision to access cold storage and invoke those costs.

Hmm good point. I'm hesitant to modify the EventStore interface to explicitly make Get optional, since that seems pretty specific to this use case (and I think it generally makes sense for there to be Get functionality).

I guess this is also modifying the API (albeit in a different way), but we could introduce a special build event type that includes the link to cold-storage (e.g. event.InColdStorage), and a ColdStorage implementation would return that single build event when called. The event consumers (the UI and Fly) could know to interpret that event type specially, perhaps prompting the user to confirm they want to fetch from cold storage. This prompt could also be bypassed with a special Key (passed as the Last-Event-ID to the build events API endpoint) to indicate "yes, I know I'm fetching from cold storage, and I'm okay with that".

agurney

I am very excited by the idea of offloading build logs to "another place", since as you've identified, we are bearing quite a burden from keeping them in Postgres exclusively.

I'd like to probe a bit on what's envisaged here, because I am uneasy about bringing a lot of fresh complexity into Concourse itself as far as log forwarding. There are existing tools like logstash, fluentd, or NiFi, which can do all sorts of things in terms of fanout, enrichment/transformation, and high-performance streaming. Pointing the syslog emitter at one of those should be more friendly than having Concourse grow the same capabilities on its side of the socket. Perhaps the syslog drainer does need some changes to allow build logs to be more continuously streamed, or that kind of thing, but it does feel like the forwarding part of the picture is more established.

In the same way, I'm not sure that it should be Concourse's job to decide when remote logs should be reaped. Once we've passed the data on, it's the other system's job to implement its own retention policy, according to whatever criteria happen to make sense. Those might not be readily accessible to Concourse (e.g. remote disk usage, or frequency of access). We may also not want to force log deletion just because a pipeline has been deleted - maybe those logs are still useful for analytics.

So I think the remaining non-existing piece is retrieving logs from their remote home, for display through Concourse. I'm imagining a world where logs always go to the local Postgres for at least a few days, but some operators will configure aggressive local reaping combined with forwarding into another store. Then Concourse just needs to know, for a given build whose logs do not exist locally, how to retrieve the logs from the remote place. If that does not work then we could display a tombstone, as we do now. This sidesteps some of the notification concerns since "live" build logs would always be served from local storage.

Then the implementation is that the builds DB table grows an extra column for the remote storage URL (maybe with other data as needed for retrieval, such as auth tokens or whatever). When build logs are requested, and not found locally, we fetch them via that URL instead. This could even be set by the log forwarding engine in some fashion, since some backends might not have URLs that are predictable a priori, or they may change over time if logs are further migrated to different storage tiers. (For example, Elasticsearch could be configured to invoke a webhook whenever it receives a start-build event, that reaches back into Concourse to set the remote storage URL for that build. Similar setups should be possible for other forwarding and storage choices.) In that case there is no need for Concourse-side configuration of how to construct that URL. This part is a little fuzzy to me since storage backends may have a variety of APIs and access control schemes, though that concern holds for any version of having remote storage.

There is a bunch of stuff to configure here, from an operator point of view, so we'd need to think through and document a few end-to-end scenarios. For example, can we use fluentd + in_syslog + out_s3 in a sensible way, and have the remote-URL details set correctly? How should Concourse be expected to authenticate itself to a remote log store, given that not all users should see all logs, and remote systems have their own idea of how access control works? (That's still the case for the current proposal as well, since a given EventStore would have to implement that logic in some fashion.)

aoldershaw · 2020-05-28T19:38:35Z

Thanks for your feedback @agurney! You bring up some really good points. Going to preface my response by saying I'm not an expert on this stuff, so please bear with me if I say anything that doesn't make sense 😄- if you disagree with anything, please let me know!

I'd like to probe a bit on what's envisaged here, because I am uneasy about bringing a lot of fresh complexity into Concourse itself as far as log forwarding. There are existing tools like logstash, fluentd, or NiFi, which can do all sorts of things in terms of fanout, enrichment/transformation, and high-performance streaming

I looked briefly into logstash during my investigation into Elasticsearch, but I've never used any log forwarding tools myself so don't fully grasp the value-add for Concourse build events (besides performance, as I imagine they're quite well optimized). To me, it seems like one benefit is turning raw unstructured logs into structured data (I suppose that's the enrichment/transformation you bring up), but Concourse build events already are structured to a certain degree. Another benefit is the decoupling of inputs from the outputs - as you mention, Concourse could just emit to syslog, and logstash/fluentd/NiFi could forward the logs to the correct destination. However, given the fact that Concourse will need to retrieve logs from these destinations as well, there's an implicit coupling that still needs to exist somewhere:

Then the implementation is that the builds DB table grows an extra column for the remote storage URL (maybe with other data as needed for retrieval, such as auth tokens or whatever). When build logs are requested, and not found locally, we fetch them via that URL instead.

I'm not quite sure what you mean by "storage URL" - do you mean something like a https://... API endpoint that Concourse can just call out to without needing to know what's on the other side, or something with a provider-specific scheme that Concourse has to interpret: e.g. elasticsearch://{elasticsearch-url}/builds/{build_id}.

If you mean the former, I'm not convinced this will work in general. For instance, if we emit build events to Elasticsearch, we could store the /{index}/_search API endpoint with an appropriate query, but how would we go about e.g. pagination, or even parsing the results into a common form?

If you mean the latter, wouldn't it be easier to just have an implementation of an EventStore that has a Get method (that can live outside of the core-Concourse codebase)?

Let me know if I completely misunderstood what you meant here 😄

So I think the remaining non-existing piece is retrieving logs from their remote home, for display through Concourse. I'm imagining a world where logs always go to the local Postgres for at least a few days, but some operators will configure aggressive local reaping combined with forwarding into another store

This sounds a lot like what @jchesterpivotal was saying here: #53 (comment) (keep recent builds as close as possible to the ATC, and optionally configure an external event store for archiving builds). I think that's a good way to look at the problem. I did reply to his comment mentioning that I think this could be modelled with the proposed EventStore interface, but it's probably worth revisiting the proposed design with that in mind.

I'm not sure that it should be Concourse's job to decide when remote logs should be reaped. Once we've passed the data on, it's the other system's job to implement its own retention policy, according to whatever criteria happen to make sense... We may also not want to force log deletion just because a pipeline has been deleted - maybe those logs are still useful for analytics

That's a really good point, I hadn't considered that! Do you think there's value in Concourse letting the external system know "by the way, this pipeline no longer exists, so you may want to delete the build events", and the system can choose to delete the events or not?

EDIT: I think if you want to keep the build events around, it probably makes sense to archive the pipeline rather than delete it (in which case we wouldn't try to delete the build events in the first place). It probably doesn't always make sense to enable the "build log reaper" - but that's opt-in, anyway.

How should Concourse be expected to authenticate itself to a remote log store, given that not all users should see all logs, and remote systems have their own idea of how access control works? (That's still the case for the current proposal as well, since a given EventStore would have to implement that logic in some fashion.)

I'm not 100% sure of what types of access control remote systems have, so maybe you could speak more to that. I kind of envisioned it as working like it does currently with Postgres: the ATC is provided credentials that would grant access to all build events, and Concourse implements handles user access controls by team at the API. So, EventStore implementations assume they're able to access build events for any build with the credentials they're configured with, and it's up to Concourse not to serve build events to users who aren't able to access them.

This header allows users to query the /api/v1/builds/{id}/events endpoint starting from an offset. Removing this may be a controversial decision. I discussed this as an open question in concourse/rfcs#53. Essentially, supporting this header means that `EventStores` will have to define how to unmarshal `Keys` (but not marshal them??). This header seems to be used nowhere internally, and isn't exposed by Fly or go-concourse, so my guess is it's *probably* no longer used by anyone. Signed-off-by: Aidan Oldershaw <[email protected]>

This commit adds the Postgres implementation for the `EventStore` interface as described in concourse/rfcs#53. While this change shouldn't noticably change the outward behaviour of Concourse noticably, there are a few changes: * Don't get from the build_events table twice for no reason in Events() * Some DB interactions now run in a separate transaction to accomodate `EventStores` that don't interact with Postgres * The build_events table is now created at runtime (rather than at migration time), and the `pipeline_build_events_x` and `team_build_events_x` tables are created when the first build is initialized (rather than when the pipeline/team is created) Signed-off-by: Aidan Oldershaw <[email protected]>

This header allows users to query the /api/v1/builds/{id}/events endpoint starting from an offset. Removing this may be a controversial decision. I discussed this as an open question in concourse/rfcs#53. Essentially, supporting this header means that `EventStores` will have to define how to unmarshal `Keys` (but not marshal them??). This header seems to be used nowhere internally, and isn't exposed by Fly or go-concourse, so my guess is it's *probably* no longer used by anyone. Signed-off-by: Aidan Oldershaw <[email protected]>

This commit adds the Postgres implementation for the `EventStore` interface as described in concourse/rfcs#53. While this change shouldn't noticably change the outward behaviour of Concourse noticably, there are a few changes: * Don't get from the build_events table twice for no reason in Events() * Some DB interactions now run in a separate transaction to accomodate `EventStores` that don't interact with Postgres * The build_events table is now created at runtime (rather than at migration time), and the `pipeline_build_events_x` and `team_build_events_x` tables are created when the first build is initialized (rather than when the pipeline/team is created) Signed-off-by: Aidan Oldershaw <[email protected]>

aoldershaw · 2020-05-30T19:27:33Z

053-configurable-event-store/proposal.md

+
+# Open Questions
+
+* Does `Key` need to be `interface{...}`, or can it be more specific like `uint`?


In my experimental PR that introduced an Elasticsearch implementation of an EventStore, I found that I needed a composite key of (timestamp, tiebreaker) (where both timestamp and tiebreaker are int64s) to uniquely identify events, so a single uint isn't good enough (without relying on Postgres to generate auto-incrementing IDs).

I was experimenting with plugin-based EventStore implementations at the suggestion of @ari-becker, and realized that if the plugin runs in a separate process (i.e. we use something other than Go plugins, which seem very limiting), we'll need some way of sending the Keys over the wire while retaining their ordering information. This isn't so easy with an arbitrary Key type, so I got back to thinking about restricting the Key type to something more concrete that has an implicit ordering.

From the Postgres and Elasticsearch implementations, I got the idea of using a []int64. Using timestamp and tiebreaker could be used for many backends (the timestamp comes from the event, and tiebreaker is an in-memory counter - more details in the PR) - but for backends that support a more native keying approach (like Postgres' auto incrementing event_id or line numbers in file-based storage like S3 or local FS), we could make use of that instead. I wonder if there are backends that won't fit into the []int64 mold - if anyone has a system in mind, I'd love to hear about it. I also wonder if the Key could always be a [2]int64, or if there's value in having an arbitrary length slice.

Another approach is to avoid the Key/cursor approach altogether, which I'd definitely be open to ideas on.

There's basically no way to establish sane global ordering without the whole nightmare of distributed consensus. I heartily discourage trying. Leave the business of sorting out ordering to consumers, who can choose what interpretations to apply.

But if what you need is uniqueness, that's much more straightforward. You have broadly two options:

UUIDs, specifically v4. Basically a 122-bit random number, so they can be generated without coordination. They are widely supported data type. PostgreSQL has a native type and the uuid-ossp module can generate them in the database very fast. They're supported in pretty much everything everywhere.

Natural keys. That is, you determine what the distinguishing fields are and then you digest them. This is isomorphic to the database normalisation exercise of identifying the natural key for a table: what is the field or combination of fields which absolutely defines uniqueness? In this case it might be something as simple as (something_identifying_the_database, event_id). As long as the database is unique and the event_id is unique within the database, it "should" be unique overall.

I prefer option 1 to option 2. It's easier to implement and more robust to oversights. Natural keys are easy to describe but once you pick one, it's hard to fix if you discover you made a mistake and have collisions. Plus people are liable to start trying to rely on the natural key components, meaning you wouldn't be able to change it.

tl;dr UUIDs errywhere

I think that's good advice - in my experimental PR for Elasticsearch, I made the naïve assumption that I could trust the timestamp on build events for ordering, when obviously isn't the case. Builds are tracked by a single ATC at a time, so at least time-based inconsistencies aren't as prevalent, but in general, it's a hacky "solution" (that doesn't really solve the problem). So, I agree that we shouldn't try to achieve distributed consensus.

Leave the business of sorting out ordering to consumers, who can choose what interpretations to apply

This makes sense, but does this mean that Concourse will have to fetch all events for a build (and sort them in memory) before it can serve any events? Or am I misunderstanding what you mean?

Another option that I suggested as a possibility is to use natural keys of (build_id, event_id). For backends that don't support auto incrementing (e.g. Elasticsearch), rely on Postgres sequences to generate these event_ids. This way, we can take advantage of the indexing options of tools like Elasticsearch to perform sorting/pagination outside of Concourse.

This makes sense, but does this mean that Concourse will have to fetch all events for a build (and sort them in memory) before it can serve any events? Or am I misunderstanding what you mean?

Sorting in the database is fine, using the internal sequential IDs (ie the primary key). Those are typically monotonic, so they establish a local order. But they're only an ordinal scale (this-before-that), not an interval score (this-10-seconds-before-that) or ratio score (this-10-percent-longer-than-that).

The motivation for supporting using EventStore implementations defined in external binaries was suggested in concourse/rfcs#53. Removing the EventStore implementations from the core Concourse codebase allows for a few things: 1. EventStores can be versioned independent of Concourse 2. The burden of maintaining many EventStores doesn't fall on the Concourse team 3. The barrier to adding support for external tools is much lower, and we can hopefully see the open source community create EventStores for several platforms (like we've seen with Resource Types) 4. Using gRPC, EventStore authors aren't tied to Go (with a bit of extra effort to make it compatible with hashicorp/go-plugin) If this is an approach we choose to take, it might make sense to introduce an SDK for building plugins like hashicorp/terraform-plugin-sdk. There are two main differences between in-process EventStores (like the Postgres implementation) and external plugin EventStores: 1. Plugin EventStores don't have access to the full DB models (`db.Build`, etc), and can't perform further queries on them. Instead, they have a serialized form that contains many of the fields 2. Plugin EventStores can't use arbitrary `Key`s - instead, they are restricted to `[]int64`. I posed a question on the RFC if using `[]int64` for all EventStores is feasible. Signed-off-by: Aidan Oldershaw <[email protected]>

This header allows users to query the /api/v1/builds/{id}/events endpoint starting from an offset. Removing this may be a controversial decision. I discussed this as an open question in concourse/rfcs#53. Essentially, supporting this header means that `EventStores` will have to define how to unmarshal `Keys` (but not marshal them??). This header seems to be used nowhere internally, and isn't exposed by Fly or go-concourse, so my guess is it's *probably* no longer used by anyone. Signed-off-by: Aidan Oldershaw <[email protected]>

This commit adds the Postgres implementation for the `EventStore` interface as described in concourse/rfcs#53. While this change shouldn't noticably change the outward behaviour of Concourse noticably, there are a few changes: * Don't get from the build_events table twice for no reason in Events() * Some DB interactions now run in a separate transaction to accomodate `EventStores` that don't interact with Postgres * The build_events table is now created at runtime (rather than at migration time), and the `pipeline_build_events_x` and `team_build_events_x` tables are created when the first build is initialized (rather than when the pipeline/team is created) Signed-off-by: Aidan Oldershaw <[email protected]>

This header allows users to query the /api/v1/builds/{id}/events endpoint starting from an offset. Removing this may be a controversial decision. I discussed this as an open question in concourse/rfcs#53. Essentially, supporting this header means that `EventStores` will have to define how to unmarshal `Keys` (but not marshal them??). This header seems to be used nowhere internally, and isn't exposed by Fly or go-concourse, so my guess is it's *probably* no longer used by anyone. Signed-off-by: Aidan Oldershaw <[email protected]>

This commit adds the Postgres implementation for the `EventStore` interface as described in concourse/rfcs#53. While this change shouldn't noticably change the outward behaviour of Concourse noticably, there are a few changes: * Don't get from the build_events table twice for no reason in Events() * Some DB interactions now run in a separate transaction to accomodate `EventStores` that don't interact with Postgres * The build_events table is now created at runtime (rather than at migration time), and the `pipeline_build_events_x` and `team_build_events_x` tables are created when the first build is initialized (rather than when the pipeline/team is created) Signed-off-by: Aidan Oldershaw <[email protected]>

This header allows users to query the /api/v1/builds/{id}/events endpoint starting from an offset. Removing this may be a controversial decision. I discussed this as an open question in concourse/rfcs#53. Essentially, supporting this header means that `EventStores` will have to define how to unmarshal `Keys` (but not marshal them??). This header seems to be used nowhere internally, and isn't exposed by Fly or go-concourse, so my guess is it's *probably* no longer used by anyone. Signed-off-by: Aidan Oldershaw <[email protected]>

This commit adds the Postgres implementation for the `EventStore` interface as described in concourse/rfcs#53. While this change shouldn't noticably change the outward behaviour of Concourse noticably, there are a few changes: * Don't get from the build_events table twice for no reason in Events() * Some DB interactions now run in a separate transaction to accomodate `EventStores` that don't interact with Postgres * The build_events table is now created at runtime (rather than at migration time), and the `pipeline_build_events_x` and `team_build_events_x` tables are created when the first build is initialized (rather than when the pipeline/team is created) Signed-off-by: Aidan Oldershaw <[email protected]>

RFC #49 Configurable Build Event Stores

c0512dc

Signed-off-by: Aidan Oldershaw <[email protected]>

aoldershaw changed the title ~~RFC #49 Configurable Build Event Stores~~ RFC: Configurable Build Event Stores May 10, 2020

aoldershaw added 3 commits May 10, 2020 13:15

move 049 -> 053

a761275

Not sure why it's PR concourse#53, when the previous one was concourse#48...perhaps because of discussions? Signed-off-by: Aidan Oldershaw <[email protected]>

fix typos and improve wording

cb47938

Signed-off-by: Aidan Oldershaw <[email protected]>

fix formatting

15586f8

Signed-off-by: Aidan Oldershaw <[email protected]>

evanchaoli reviewed May 11, 2020

View reviewed changes

aoldershaw added 2 commits May 10, 2020 22:37

clarify Initialize semantics

3c79023

Rather than when it's Started, it should be when the build is first created. Signed-off-by: Aidan Oldershaw <[email protected]>

update Put signature to take a list of events

925cbb5

Signed-off-by: Aidan Oldershaw <[email protected]>

cirocosta reviewed May 12, 2020

View reviewed changes

053-configurable-event-store/proposal.md Show resolved Hide resolved

cirocosta reviewed May 12, 2020

View reviewed changes

aoldershaw added 2 commits May 12, 2020 13:19

add Close method

cab7eff

Signed-off-by: Aidan Oldershaw <[email protected]>

fix typo

c9a8d01

Signed-off-by: Aidan Oldershaw <[email protected]>

jchesterpivotal reviewed May 12, 2020

View reviewed changes

add context to params

f22a91f

Signed-off-by: Aidan Oldershaw <[email protected]>

deepakmohanakrishnan07 reviewed May 21, 2020

View reviewed changes

053-configurable-event-store/proposal.md Outdated Show resolved Hide resolved

aoldershaw mentioned this pull request May 24, 2020

experiment: atc: support Elasticsearch as a build event store concourse/concourse#5651

Closed

8 tasks

add details of notification bus update

2ff551c

+ key marshalling Signed-off-by: Aidan Oldershaw <[email protected]>

vito mentioned this pull request May 25, 2020

Add manual steps #36

Closed

ari-becker reviewed May 26, 2020

View reviewed changes

agurney reviewed May 28, 2020

View reviewed changes

aoldershaw commented May 30, 2020

View reviewed changes

xtreme-sameer-vohra mentioned this pull request Jun 3, 2020

Decouple 3rd party integrations from Concourse #57

Closed

agurney mentioned this pull request Sep 18, 2020

Log garbage collection for very large builds may hang the DB concourse/concourse#6080

Open

vito assigned chenbh Apr 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Configurable Build Event Stores #53

RFC: Configurable Build Event Stores #53

aoldershaw commented May 10, 2020 •

edited

Loading

evanchaoli May 11, 2020

aoldershaw May 11, 2020 •

edited

Loading

evanchaoli Jun 2, 2020

clarafu Jun 24, 2020

aoldershaw Jun 24, 2020

clarafu Jun 25, 2020

evanchaoli commented May 11, 2020

cirocosta left a comment

cirocosta May 12, 2020

aoldershaw May 12, 2020

cirocosta May 12, 2020 •

edited

Loading

jchesterpivotal May 12, 2020

jchesterpivotal May 12, 2020

jchesterpivotal May 12, 2020

aoldershaw May 12, 2020

ari-becker May 26, 2020

aoldershaw May 26, 2020

ari-becker May 26, 2020

aoldershaw May 29, 2020

evanchaoli Jul 2, 2020 •

edited

Loading

ari-becker May 26, 2020

aoldershaw May 26, 2020

ari-becker May 26, 2020

aoldershaw May 28, 2020

agurney left a comment

aoldershaw commented May 28, 2020 •

edited

Loading

aoldershaw May 30, 2020

jchesterpivotal Jun 1, 2020

aoldershaw Jun 1, 2020

jchesterpivotal Jun 1, 2020

		* Users won't be able to access build events that exist in the main Postgres
		database if they switch to a new backend store. How can we make the

		// perhaps run this in the background
		return c.migrateToColdStorage(build)


		# Open Questions

		* Does `Key` need to be `interface{...}`, or can it be more specific like `uint`?

RFC: Configurable Build Event Stores #53

Are you sure you want to change the base?

RFC: Configurable Build Event Stores #53

Conversation

aoldershaw commented May 10, 2020 • edited Loading

Choose a reason for hiding this comment

aoldershaw May 11, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evanchaoli commented May 11, 2020

cirocosta left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cirocosta May 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evanchaoli Jul 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agurney left a comment

Choose a reason for hiding this comment

aoldershaw commented May 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aoldershaw commented May 10, 2020 •

edited

Loading

aoldershaw May 11, 2020 •

edited

Loading

cirocosta May 12, 2020 •

edited

Loading

evanchaoli Jul 2, 2020 •

edited

Loading

aoldershaw commented May 28, 2020 •

edited

Loading