Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Configurable Build Event Stores #53

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

aoldershaw
Copy link

@aoldershaw aoldershaw commented May 10, 2020

Rendered

Experimental PR: concourse/concourse#5651

Signed-off-by: Aidan Oldershaw [email protected]

@aoldershaw aoldershaw changed the title RFC #49 Configurable Build Event Stores RFC: Configurable Build Event Stores May 10, 2020
Not sure why it's PR concourse#53, when the previous one was concourse#48...perhaps
because of discussions?

Signed-off-by: Aidan Oldershaw <[email protected]>
Signed-off-by: Aidan Oldershaw <[email protected]>
Signed-off-by: Aidan Oldershaw <[email protected]>
* By default, there will be no change - Concourse will default to using a
Postgres `EventStore` using the same main Postgres database.
* Operators will be able to configure a different store
* If they configure a new store, they won't be able to access any existing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this feature has to provide a migration solution, otherwise production clusters would be hard to move on external storage.

Copy link
Author

@aoldershaw aoldershaw May 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree. I'd love to hear any thoughts you have on the best way to approach this. I can think of a couple options:

  1. Introduce a concourse migrate-build-events command that migrates everything over to the configured build event store. It could possibly default to migrating from the Postgres database to whatever event store is configured, but it could also take a separate from configuration in case you want to migrate from a non-Postgres backend to a new backend.

  2. Migrate out of Postgres "on the fly". i.e. we could have a component like the build reaper that migrates old builds to the new store, a few at a time. In this case, we'd probably still want some way to view build logs from Postgres in case they haven't yet been migrated, for which we could construct a FallbackEventStore that does something like this:

type FallbackEventStore struct {
    EventStore
    Fallback EventStore
}

func (f *FallbackEventStore) Get(build db.Build, requested int, cursor *Key) ([]event.Envelope, error) {
    // first, try the main event store:
    events, err := f.EventStore.Get(build, requested, cursor)
    if err == nil && len(events) > 0 {
        return events, nil
    }
    // the build wasn't found (or at least has no events in the main store) - so try the Fallback (Postgres)
    // if this also errors, we'll probably want to return the err from the primary EventStore, but you get the idea
    return f.Fallback.Get(build, requested, cursor)
}

I suspect an approach like 1) may not be feasible when there's a huge number of build events, since it will likely take a really long time.

Either way, it probably makes sense to make EventStore.Put take a list of events so we can do batch puts, rather than needing to Put them all one-by-one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just not migrate. Instead, add a flag to table build to indicate where is build logs of the build, PG or ES. So that reads old build logs from PG and new build logs from ES.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion for another way to do the migration, I wonder if there could be a way to migrate over all the build events to their new event store without stopping the database? This will make it so that operators wanting to move over to a new build event store does not need to wait for the migrations to finish before starting concourse.

I imagine something like the operator would continue to have a running concourse deployment and the build events would currently be fetched from the postgres database. While the deployment is running, some tool can be used to migrate all the build events within the postgres database (old build event store) into the new build event store. Then once it has finished, the operator can redeploy concourse and point it at the new build event store. There will need to be some method for discarding the old data (maybe concourse can do that at startup if you configure a new event store? or maybe it can be part of the tool to wipe out build events?)

I wonder for operators that want to configure a new build event store, they might want to benefit from being able to query or search for things within the build events. If we were to do the method of migrating on the fly, that would mean only certain build events would be migrated over. So taking that into account and the fact that operators that configure a new build event store most likely would like all their builds to be in this new store, I might be inclined to thinking that migrating all the data together would be preferable? I could totally be wrong about this though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clarafu that definitely makes sense - having a separate tool that just points to the running Postgres is very much in line with #53 (comment).

FWIW the idea behind the "on the fly" approach is that it'd eventually migrate all build events over (not just for new events) - it'd essentially be doing what the tool you're suggesting is doing, but it's started by the ATC. I think it's beneficial to keep it separate though, for the reasons mentioned by @jchesterpivotal.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh I see! Sorry I missed where you clarified it by saying it will build reaper that migrates old builds to the new store. And yeah that is a good call that allowing the operator to migrate before upgrading is definitely a good idea.

@evanchaoli
Copy link
Contributor

Awesome!

Rather than when it's Started, it should be when the build is first
created.

Signed-off-by: Aidan Oldershaw <[email protected]>
Copy link
Member

@cirocosta cirocosta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiinteresting! I didn't finish reading it all yet though, will finish later 😁

053-configurable-event-store/proposal.md Outdated Show resolved Hide resolved
053-configurable-event-store/proposal.md Outdated Show resolved Hide resolved
053-configurable-event-store/proposal.md Outdated Show resolved Hide resolved
053-configurable-event-store/proposal.md Show resolved Hide resolved
Get(build db.Build, requested int, cursor *Key) ([]event.Envelope, error)

Delete(builds []db.Build) error
DeletePipeline(pipeline db.Pipeline) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it got me wondering if we should really have these 🤔 from what I understand, the distinction between Delete and Delete(Pipeline|Team) is simply that the the one implementing the interface could do so in a more performant manner. Buut .. the Pipeline and Team concepts seem to be too Concourse specific, kinda "leaking" into the interface here.

tbh, I don't have a suggestion 😅 just seemed like overloading the interface a bit

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree, but couldn't really think of a better way to do it without losing the ability to DROP tables...

We could try to be smart with the Postgres implementation of the EventStore, where if we call Delete(builds) where builds exactly matches the list of pipeline builds, drop the table instead - though I feel like this could be finicky.

}
```

How about migrating build events to cold-storage after they're completed:
Copy link
Member

@cirocosta cirocosta May 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a pretty useful use case, specially for those businesses that require keeping logs for .. years.

assuming this would be implemented by, say, aws' s3, one can even place lifecycle policies to their bucket (https://docs.aws.amazon.com/AmazonS3/latest/dev/lifecycle-transition-general-considerations.html) and then get some sweet price reduction automatically for those super old build events

Signed-off-by: Aidan Oldershaw <[email protected]>
Signed-off-by: Aidan Oldershaw <[email protected]>
053-configurable-event-store/proposal.md Show resolved Hide resolved
053-configurable-event-store/proposal.md Show resolved Hide resolved
1. Storing build events in a system designed for handling text data (like
Elasticsearch, for instance) allows you to do some pretty cool querying that
isn't feasible in Postgres.
1. Gaining observability into build events isn't great. The syslog drainer

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I've had to do quite a lot of work when I've wanted to extract and transform this data for analysis.

* Operators will be able to configure a different store
* If they configure a new store, they won't be able to access any existing
build events that are currently stored in Postgres without migrating them
over (how should we recommend doing this?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be a standalone tool, not part of the ATC. Otherwise you're forcing folks to upgrade to get it, or to get bugfixes to such a migration tool.

053-configurable-event-store/proposal.md Show resolved Hide resolved
053-configurable-event-store/proposal.md Show resolved Hide resolved
Comment on lines +439 to +440
* Users won't be able to access build events that exist in the main Postgres
database if they switch to a new backend store. How can we make the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are two use cases being mixed here.

The first is "tell me what just broke", the thing Concourse UX is optimised for. That data works best in PostgreSQL, close to the ATC.

The second is "I need a complete archive of everything ever". Concourse enables that but isn't really putting it front and centre. The overhead has been historically bad enough that the archival need was at cross-purposes with the diagnostic goal. This is where draining to secondary stores is useful.

If you have a secondary store in addition to the main store, then setting short build histories on the main store, but not the secondary store, is a relatively safe compromise for most installations.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that assessment makes sense. I think the proposed EventStore interface should be able to support this (similar to the "Cold Storage" example I provided, but could drain to an arbitrary secondary EventStore). Pair that with build log reaping to clear out old Postgres logs, and we can keep recent builds close to the ATC in Postgres while archiving everything in a secondary store (which could be S3, Elasticsearch, etc)

Signed-off-by: Aidan Oldershaw <[email protected]>
aoldershaw added a commit to concourse/concourse that referenced this pull request May 13, 2020
This header allows users to query the /api/v1/builds/{id}/events
endpoint starting from an offset.

Removing this may be a controversial decision. I discussed this as an
open question in concourse/rfcs#53. Essentially, supporting this header
means that `EventStores` will have to define how to unmarshal `Keys`
(but not marshal them??).

This header seems to be used nowhere internally, and isn't exposed by
Fly or go-concourse, so my guess is it's *probably* no longer used by anyone.

Signed-off-by: Aidan Oldershaw <[email protected]>
aoldershaw added a commit to concourse/concourse that referenced this pull request May 13, 2020
This commit adds the Postgres implementation for the `EventStore`
interface as described in concourse/rfcs#53.

While this change shouldn't noticably change the outward behaviour of
Concourse noticably, there are a few changes:

* Don't get from the build_events table twice for no reason in Events()
* Some DB interactions now run in a separate transaction to accomodate
  `EventStores` that don't interact with Postgres
* The build_events table is now created at runtime (rather than at
  migration time), and the `pipeline_build_events_x` and
  `team_build_events_x` tables are created when the first build is
  initialized (rather than when the pipeline/team is created)

Signed-off-by: Aidan Oldershaw <[email protected]>
aoldershaw added a commit to concourse/concourse that referenced this pull request May 13, 2020
This commit adds the Postgres implementation for the `EventStore`
interface as described in concourse/rfcs#53.

While this change shouldn't noticably change the outward behaviour of
Concourse noticably, there are a few changes:

* Don't get from the build_events table twice for no reason in Events()
* Some DB interactions now run in a separate transaction to accomodate
  `EventStores` that don't interact with Postgres
* The build_events table is now created at runtime (rather than at
  migration time), and the `pipeline_build_events_x` and
  `team_build_events_x` tables are created when the first build is
  initialized (rather than when the pipeline/team is created)

Signed-off-by: Aidan Oldershaw <[email protected]>
aoldershaw added a commit to concourse/concourse that referenced this pull request May 14, 2020
This header allows users to query the /api/v1/builds/{id}/events
endpoint starting from an offset.

Removing this may be a controversial decision. I discussed this as an
open question in concourse/rfcs#53. Essentially, supporting this header
means that `EventStores` will have to define how to unmarshal `Keys`
(but not marshal them??).

This header seems to be used nowhere internally, and isn't exposed by
Fly or go-concourse, so my guess is it's *probably* no longer used by anyone.

Signed-off-by: Aidan Oldershaw <[email protected]>
aoldershaw added a commit to concourse/concourse that referenced this pull request May 14, 2020
This commit adds the Postgres implementation for the `EventStore`
interface as described in concourse/rfcs#53.

While this change shouldn't noticably change the outward behaviour of
Concourse noticably, there are a few changes:

* Don't get from the build_events table twice for no reason in Events()
* Some DB interactions now run in a separate transaction to accomodate
  `EventStores` that don't interact with Postgres
* The build_events table is now created at runtime (rather than at
  migration time), and the `pipeline_build_events_x` and
  `team_build_events_x` tables are created when the first build is
  initialized (rather than when the pipeline/team is created)

Signed-off-by: Aidan Oldershaw <[email protected]>
aoldershaw added a commit to concourse/concourse that referenced this pull request May 23, 2020
This header allows users to query the /api/v1/builds/{id}/events
endpoint starting from an offset.

Removing this may be a controversial decision. I discussed this as an
open question in concourse/rfcs#53. Essentially, supporting this header
means that `EventStores` will have to define how to unmarshal `Keys`
(but not marshal them??).

This header seems to be used nowhere internally, and isn't exposed by
Fly or go-concourse, so my guess is it's *probably* no longer used by anyone.

Signed-off-by: Aidan Oldershaw <[email protected]>
aoldershaw added a commit to concourse/concourse that referenced this pull request May 23, 2020
This commit adds the Postgres implementation for the `EventStore`
interface as described in concourse/rfcs#53.

While this change shouldn't noticably change the outward behaviour of
Concourse noticably, there are a few changes:

* Don't get from the build_events table twice for no reason in Events()
* Some DB interactions now run in a separate transaction to accomodate
  `EventStores` that don't interact with Postgres
* The build_events table is now created at runtime (rather than at
  migration time), and the `pipeline_build_events_x` and
  `team_build_events_x` tables are created when the first build is
  initialized (rather than when the pipeline/team is created)

Signed-off-by: Aidan Oldershaw <[email protected]>
+ key marshalling

Signed-off-by: Aidan Oldershaw <[email protected]>
@vito vito mentioned this pull request May 25, 2020
Comment on lines +415 to +419
One thing I haven't really explored is how operators would configure the
backend store. I think it could be similar to how Credential managers are
configured - there are groups of independent configuration options for each
possible `EventStore`. If certain flags are configured for a given backend, use
that backend.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it would be desirable to design build event stores in such a way that allows them to be built and released independently from the rest of Concourse, similar to how Terraform eventually spun providers out of its main source tree so that they could be built and released independently, typically written and maintained by the commercial entities backing the providers.

For example, a user might wish to export build logs to Coralogix, which is a log management and analysis platform. As Coralogix is a commercial platform, I would fully expect that a Coralogix build event store plugin would be written and maintained by Coralogix and not by the Concourse team / Pivotal.

(full disclosure: I work for Coralogix and we would probably (but I'm intentionally being non-committal here) be interested in building out a plugin for our own platform as part of dogfooding our own product)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very interesting - I hadn't put much thought into an architecture like that, mostly because the precedent seems to be package and release everything from within Concourse (at least in terms of credential managers and metrics emitters). I think it'd be cool to for build event stores to be written as plugins external to the Concourse binary, and it'd be really cool to have experts in the domain be the ones to implement/maintain it.

I guess one concern is the added operational complexity, though I imagine it's not too bad. Like Terraform, we could package some plugins directly into Concourse (it could just be the Postgres plugin so operators don't need to bring their own plugin for the default behaviour).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, my frame of reference wasn't credential managers and metrics emitters (though arguably they should be handled the same way), rather it was Concourse resources. Just as everything in https://resource-types.concourse-ci.org/ is written, packaged, and maintained outside of the main Concourse source tree, shouldn't build event stores (and credential managers and metrics emitters) also be written and maintained outside of the main Concourse source tree?

That concourse/registry-image-resource and concourse/git-resource resource types are not part of concourse/concourse isn't something that makes me think of additional operational complexity (although I guess that's technically true), but rather resilience, flexibility, etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this a bit more. For some use cases of the proposed EventStore, having their implementations be invoked directly (as opposed to over something like gRPC) is preferable. For instance, invoking the Postgres EventStore in memory allows us to perform queries database objects (e.g. also deleting all of a team's pipelines when deleting a team). For certain applications of EventStore composition (e.g. for secret redaction), handling the operations in memory is beneficial.

So, what if EventStores are just an in-memory type, but there's a PluginEventStore implementation that functions as a gRPC client? Which plugin (gRPC server) to use could be configured via flags to Concourse, and any environment variables such as CONCOURSE_EVENTSTORE_PLUGIN_* could be passed along to the plugin on startup to configure e.g. authentication (similar to how we support custom configuration for Guardian).

What do you think of that approach?

Copy link
Contributor

@evanchaoli evanchaoli Jul 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt in-memory EventStore, because that implies every ATC will duplicately store build logs, which might not fit to huge clusters where hundreds pipelines generate huge amount of build logs. But I think plugin is a good idea, at least, a PG plugin would allow us to store build logs to a separate database, if the plugin is well designed, it may even support to dispatch build logs to multiple databases, which would definitely benefit scalability of Concourse. Also, users will have the flexibility to store build logs in various backends, ES, and whatever.

Comment on lines +252 to +253
// perhaps run this in the background
return c.migrateToColdStorage(build)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment belies a lot of underlying complexity, where this design should go towards a plugin model per-provider, like Terraform (see comment below about adding providers), which uses an RPC architecture. See for example: https://www.terraform.io/docs/extend/how-terraform-works.html

This kind of multiple-binary architecture is necessary for both a) proper queuing and background processing of events as they are sent to external providers, and b) allowing additional providers to be written and developed independently of the upstream project.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the comment was a bit of a half-baked idea. From what I can tell, any sort of background processing we have now is just done through separate goroutines, everything residing in the same process. For instance, metric emitters run on an emitLoop that reads from a Go chan that receives messages: https://github.com/concourse/concourse/blob/36cbf5ecf628dba92b6b4c08cfa536820638f313/atc/metric/emit.go#L129-L133

All that to say, we could (and currently do) perform some sort of background processing together without the use of plugins running in their own process and communicating via RPC. That's definitely not to say we should do things this way (though, admittedly, I don't fully grasp all of the implications of taking one approach over the other).

}
```

How about migrating build events to cold-storage after they're completed:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the desire to design for cold storage doesn't mean that EventStore.Get should be some kind of optional operation. Usually the trade-off with cold storage is that it reduces storage costs while increasing access costs, and we wouldn't want to subject users to a surprisingly high bill just because they accessed some part of Concourse in a way which invoked the EventStore.Get on cold-storage.

Maybe a better design would be one where EventStore.Get can return a link to where the build events can be found? Then the user can more intentionally make the decision to access cold storage and invoke those costs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm good point. I'm hesitant to modify the EventStore interface to explicitly make Get optional, since that seems pretty specific to this use case (and I think it generally makes sense for there to be Get functionality).

I guess this is also modifying the API (albeit in a different way), but we could introduce a special build event type that includes the link to cold-storage (e.g. event.InColdStorage), and a ColdStorage implementation would return that single build event when called. The event consumers (the UI and Fly) could know to interpret that event type specially, perhaps prompting the user to confirm they want to fetch from cold storage. This prompt could also be bypassed with a special Key (passed as the Last-Event-ID to the build events API endpoint) to indicate "yes, I know I'm fetching from cold storage, and I'm okay with that".

Copy link

@agurney agurney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very excited by the idea of offloading build logs to "another place", since as you've identified, we are bearing quite a burden from keeping them in Postgres exclusively.

I'd like to probe a bit on what's envisaged here, because I am uneasy about bringing a lot of fresh complexity into Concourse itself as far as log forwarding. There are existing tools like logstash, fluentd, or NiFi, which can do all sorts of things in terms of fanout, enrichment/transformation, and high-performance streaming. Pointing the syslog emitter at one of those should be more friendly than having Concourse grow the same capabilities on its side of the socket. Perhaps the syslog drainer does need some changes to allow build logs to be more continuously streamed, or that kind of thing, but it does feel like the forwarding part of the picture is more established.

In the same way, I'm not sure that it should be Concourse's job to decide when remote logs should be reaped. Once we've passed the data on, it's the other system's job to implement its own retention policy, according to whatever criteria happen to make sense. Those might not be readily accessible to Concourse (e.g. remote disk usage, or frequency of access). We may also not want to force log deletion just because a pipeline has been deleted - maybe those logs are still useful for analytics.

So I think the remaining non-existing piece is retrieving logs from their remote home, for display through Concourse. I'm imagining a world where logs always go to the local Postgres for at least a few days, but some operators will configure aggressive local reaping combined with forwarding into another store. Then Concourse just needs to know, for a given build whose logs do not exist locally, how to retrieve the logs from the remote place. If that does not work then we could display a tombstone, as we do now. This sidesteps some of the notification concerns since "live" build logs would always be served from local storage.

Then the implementation is that the builds DB table grows an extra column for the remote storage URL (maybe with other data as needed for retrieval, such as auth tokens or whatever). When build logs are requested, and not found locally, we fetch them via that URL instead. This could even be set by the log forwarding engine in some fashion, since some backends might not have URLs that are predictable a priori, or they may change over time if logs are further migrated to different storage tiers. (For example, Elasticsearch could be configured to invoke a webhook whenever it receives a start-build event, that reaches back into Concourse to set the remote storage URL for that build. Similar setups should be possible for other forwarding and storage choices.) In that case there is no need for Concourse-side configuration of how to construct that URL. This part is a little fuzzy to me since storage backends may have a variety of APIs and access control schemes, though that concern holds for any version of having remote storage.

There is a bunch of stuff to configure here, from an operator point of view, so we'd need to think through and document a few end-to-end scenarios. For example, can we use fluentd + in_syslog + out_s3 in a sensible way, and have the remote-URL details set correctly? How should Concourse be expected to authenticate itself to a remote log store, given that not all users should see all logs, and remote systems have their own idea of how access control works? (That's still the case for the current proposal as well, since a given EventStore would have to implement that logic in some fashion.)

@aoldershaw
Copy link
Author

aoldershaw commented May 28, 2020

Thanks for your feedback @agurney! You bring up some really good points. Going to preface my response by saying I'm not an expert on this stuff, so please bear with me if I say anything that doesn't make sense 😄- if you disagree with anything, please let me know!

I'd like to probe a bit on what's envisaged here, because I am uneasy about bringing a lot of fresh complexity into Concourse itself as far as log forwarding. There are existing tools like logstash, fluentd, or NiFi, which can do all sorts of things in terms of fanout, enrichment/transformation, and high-performance streaming

I looked briefly into logstash during my investigation into Elasticsearch, but I've never used any log forwarding tools myself so don't fully grasp the value-add for Concourse build events (besides performance, as I imagine they're quite well optimized). To me, it seems like one benefit is turning raw unstructured logs into structured data (I suppose that's the enrichment/transformation you bring up), but Concourse build events already are structured to a certain degree. Another benefit is the decoupling of inputs from the outputs - as you mention, Concourse could just emit to syslog, and logstash/fluentd/NiFi could forward the logs to the correct destination. However, given the fact that Concourse will need to retrieve logs from these destinations as well, there's an implicit coupling that still needs to exist somewhere:

Then the implementation is that the builds DB table grows an extra column for the remote storage URL (maybe with other data as needed for retrieval, such as auth tokens or whatever). When build logs are requested, and not found locally, we fetch them via that URL instead.

I'm not quite sure what you mean by "storage URL" - do you mean something like a https://... API endpoint that Concourse can just call out to without needing to know what's on the other side, or something with a provider-specific scheme that Concourse has to interpret: e.g. elasticsearch://{elasticsearch-url}/builds/{build_id}.

If you mean the former, I'm not convinced this will work in general. For instance, if we emit build events to Elasticsearch, we could store the /{index}/_search API endpoint with an appropriate query, but how would we go about e.g. pagination, or even parsing the results into a common form?

If you mean the latter, wouldn't it be easier to just have an implementation of an EventStore that has a Get method (that can live outside of the core-Concourse codebase)?

Let me know if I completely misunderstood what you meant here 😄

So I think the remaining non-existing piece is retrieving logs from their remote home, for display through Concourse. I'm imagining a world where logs always go to the local Postgres for at least a few days, but some operators will configure aggressive local reaping combined with forwarding into another store

This sounds a lot like what @jchesterpivotal was saying here: #53 (comment) (keep recent builds as close as possible to the ATC, and optionally configure an external event store for archiving builds). I think that's a good way to look at the problem. I did reply to his comment mentioning that I think this could be modelled with the proposed EventStore interface, but it's probably worth revisiting the proposed design with that in mind.

I'm not sure that it should be Concourse's job to decide when remote logs should be reaped. Once we've passed the data on, it's the other system's job to implement its own retention policy, according to whatever criteria happen to make sense... We may also not want to force log deletion just because a pipeline has been deleted - maybe those logs are still useful for analytics

That's a really good point, I hadn't considered that! Do you think there's value in Concourse letting the external system know "by the way, this pipeline no longer exists, so you may want to delete the build events", and the system can choose to delete the events or not?

EDIT: I think if you want to keep the build events around, it probably makes sense to archive the pipeline rather than delete it (in which case we wouldn't try to delete the build events in the first place). It probably doesn't always make sense to enable the "build log reaper" - but that's opt-in, anyway.

How should Concourse be expected to authenticate itself to a remote log store, given that not all users should see all logs, and remote systems have their own idea of how access control works? (That's still the case for the current proposal as well, since a given EventStore would have to implement that logic in some fashion.)

I'm not 100% sure of what types of access control remote systems have, so maybe you could speak more to that. I kind of envisioned it as working like it does currently with Postgres: the ATC is provided credentials that would grant access to all build events, and Concourse implements handles user access controls by team at the API. So, EventStore implementations assume they're able to access build events for any build with the credentials they're configured with, and it's up to Concourse not to serve build events to users who aren't able to access them.

aoldershaw added a commit to concourse/concourse that referenced this pull request May 29, 2020
This header allows users to query the /api/v1/builds/{id}/events
endpoint starting from an offset.

Removing this may be a controversial decision. I discussed this as an
open question in concourse/rfcs#53. Essentially, supporting this header
means that `EventStores` will have to define how to unmarshal `Keys`
(but not marshal them??).

This header seems to be used nowhere internally, and isn't exposed by
Fly or go-concourse, so my guess is it's *probably* no longer used by anyone.

Signed-off-by: Aidan Oldershaw <[email protected]>
aoldershaw added a commit to concourse/concourse that referenced this pull request May 29, 2020
This commit adds the Postgres implementation for the `EventStore`
interface as described in concourse/rfcs#53.

While this change shouldn't noticably change the outward behaviour of
Concourse noticably, there are a few changes:

* Don't get from the build_events table twice for no reason in Events()
* Some DB interactions now run in a separate transaction to accomodate
  `EventStores` that don't interact with Postgres
* The build_events table is now created at runtime (rather than at
  migration time), and the `pipeline_build_events_x` and
  `team_build_events_x` tables are created when the first build is
  initialized (rather than when the pipeline/team is created)

Signed-off-by: Aidan Oldershaw <[email protected]>
aoldershaw added a commit to concourse/concourse that referenced this pull request May 29, 2020
This header allows users to query the /api/v1/builds/{id}/events
endpoint starting from an offset.

Removing this may be a controversial decision. I discussed this as an
open question in concourse/rfcs#53. Essentially, supporting this header
means that `EventStores` will have to define how to unmarshal `Keys`
(but not marshal them??).

This header seems to be used nowhere internally, and isn't exposed by
Fly or go-concourse, so my guess is it's *probably* no longer used by anyone.

Signed-off-by: Aidan Oldershaw <[email protected]>
aoldershaw added a commit to concourse/concourse that referenced this pull request May 29, 2020
This commit adds the Postgres implementation for the `EventStore`
interface as described in concourse/rfcs#53.

While this change shouldn't noticably change the outward behaviour of
Concourse noticably, there are a few changes:

* Don't get from the build_events table twice for no reason in Events()
* Some DB interactions now run in a separate transaction to accomodate
  `EventStores` that don't interact with Postgres
* The build_events table is now created at runtime (rather than at
  migration time), and the `pipeline_build_events_x` and
  `team_build_events_x` tables are created when the first build is
  initialized (rather than when the pipeline/team is created)

Signed-off-by: Aidan Oldershaw <[email protected]>

# Open Questions

* Does `Key` need to be `interface{...}`, or can it be more specific like `uint`?
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experimental PR that introduced an Elasticsearch implementation of an EventStore, I found that I needed a composite key of (timestamp, tiebreaker) (where both timestamp and tiebreaker are int64s) to uniquely identify events, so a single uint isn't good enough (without relying on Postgres to generate auto-incrementing IDs).

I was experimenting with plugin-based EventStore implementations at the suggestion of @ari-becker, and realized that if the plugin runs in a separate process (i.e. we use something other than Go plugins, which seem very limiting), we'll need some way of sending the Keys over the wire while retaining their ordering information. This isn't so easy with an arbitrary Key type, so I got back to thinking about restricting the Key type to something more concrete that has an implicit ordering.

From the Postgres and Elasticsearch implementations, I got the idea of using a []int64. Using timestamp and tiebreaker could be used for many backends (the timestamp comes from the event, and tiebreaker is an in-memory counter - more details in the PR) - but for backends that support a more native keying approach (like Postgres' auto incrementing event_id or line numbers in file-based storage like S3 or local FS), we could make use of that instead. I wonder if there are backends that won't fit into the []int64 mold - if anyone has a system in mind, I'd love to hear about it. I also wonder if the Key could always be a [2]int64, or if there's value in having an arbitrary length slice.

Another approach is to avoid the Key/cursor approach altogether, which I'd definitely be open to ideas on.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's basically no way to establish sane global ordering without the whole nightmare of distributed consensus. I heartily discourage trying. Leave the business of sorting out ordering to consumers, who can choose what interpretations to apply.

But if what you need is uniqueness, that's much more straightforward. You have broadly two options:

  1. UUIDs, specifically v4. Basically a 122-bit random number, so they can be generated without coordination. They are widely supported data type. PostgreSQL has a native type and the uuid-ossp module can generate them in the database very fast. They're supported in pretty much everything everywhere.
  2. Natural keys. That is, you determine what the distinguishing fields are and then you digest them. This is isomorphic to the database normalisation exercise of identifying the natural key for a table: what is the field or combination of fields which absolutely defines uniqueness? In this case it might be something as simple as (something_identifying_the_database, event_id). As long as the database is unique and the event_id is unique within the database, it "should" be unique overall.

I prefer option 1 to option 2. It's easier to implement and more robust to oversights. Natural keys are easy to describe but once you pick one, it's hard to fix if you discover you made a mistake and have collisions. Plus people are liable to start trying to rely on the natural key components, meaning you wouldn't be able to change it.

tl;dr UUIDs errywhere

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's good advice - in my experimental PR for Elasticsearch, I made the naïve assumption that I could trust the timestamp on build events for ordering, when obviously isn't the case. Builds are tracked by a single ATC at a time, so at least time-based inconsistencies aren't as prevalent, but in general, it's a hacky "solution" (that doesn't really solve the problem). So, I agree that we shouldn't try to achieve distributed consensus.

Leave the business of sorting out ordering to consumers, who can choose what interpretations to apply

This makes sense, but does this mean that Concourse will have to fetch all events for a build (and sort them in memory) before it can serve any events? Or am I misunderstanding what you mean?

Another option that I suggested as a possibility is to use natural keys of (build_id, event_id). For backends that don't support auto incrementing (e.g. Elasticsearch), rely on Postgres sequences to generate these event_ids. This way, we can take advantage of the indexing options of tools like Elasticsearch to perform sorting/pagination outside of Concourse.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, but does this mean that Concourse will have to fetch all events for a build (and sort them in memory) before it can serve any events? Or am I misunderstanding what you mean?

Sorting in the database is fine, using the internal sequential IDs (ie the primary key). Those are typically monotonic, so they establish a local order. But they're only an ordinal scale (this-before-that), not an interval score (this-10-seconds-before-that) or ratio score (this-10-percent-longer-than-that).

aoldershaw added a commit to concourse/concourse that referenced this pull request May 31, 2020
The motivation for supporting using EventStore implementations defined
in external binaries was suggested in concourse/rfcs#53. Removing the
EventStore implementations from the core Concourse codebase allows for a
few things:

1. EventStores can be versioned independent of Concourse
2. The burden of maintaining many EventStores doesn't fall on the
   Concourse team
3. The barrier to adding support for external tools is much lower, and
   we can hopefully see the open source community create EventStores for
   several platforms (like we've seen with Resource Types)
4. Using gRPC, EventStore authors aren't tied to Go (with a bit of extra
   effort to make it compatible with hashicorp/go-plugin)

If this is an approach we choose to take, it might make sense to
introduce an SDK for building plugins like
hashicorp/terraform-plugin-sdk.

There are two main differences between in-process EventStores (like the
Postgres implementation) and external plugin EventStores:

1. Plugin EventStores don't have access to the full DB models
   (`db.Build`, etc), and can't perform further queries on them.
   Instead, they have a serialized form that contains many of the fields
2. Plugin EventStores can't use arbitrary `Key`s - instead, they are
   restricted to `[]int64`. I posed a question on the RFC if using
   `[]int64` for all EventStores is feasible.

Signed-off-by: Aidan Oldershaw <[email protected]>
aoldershaw added a commit to concourse/concourse that referenced this pull request May 31, 2020
The motivation for supporting using EventStore implementations defined
in external binaries was suggested in concourse/rfcs#53. Removing the
EventStore implementations from the core Concourse codebase allows for a
few things:

1. EventStores can be versioned independent of Concourse
2. The burden of maintaining many EventStores doesn't fall on the
   Concourse team
3. The barrier to adding support for external tools is much lower, and
   we can hopefully see the open source community create EventStores for
   several platforms (like we've seen with Resource Types)
4. Using gRPC, EventStore authors aren't tied to Go (with a bit of extra
   effort to make it compatible with hashicorp/go-plugin)

If this is an approach we choose to take, it might make sense to
introduce an SDK for building plugins like
hashicorp/terraform-plugin-sdk.

There are two main differences between in-process EventStores (like the
Postgres implementation) and external plugin EventStores:

1. Plugin EventStores don't have access to the full DB models
   (`db.Build`, etc), and can't perform further queries on them.
   Instead, they have a serialized form that contains many of the fields
2. Plugin EventStores can't use arbitrary `Key`s - instead, they are
   restricted to `[]int64`. I posed a question on the RFC if using
   `[]int64` for all EventStores is feasible.

Signed-off-by: Aidan Oldershaw <[email protected]>
aoldershaw added a commit to concourse/concourse that referenced this pull request May 31, 2020
The motivation for supporting using EventStore implementations defined
in external binaries was suggested in concourse/rfcs#53. Removing the
EventStore implementations from the core Concourse codebase allows for a
few things:

1. EventStores can be versioned independent of Concourse
2. The burden of maintaining many EventStores doesn't fall on the
   Concourse team
3. The barrier to adding support for external tools is much lower, and
   we can hopefully see the open source community create EventStores for
   several platforms (like we've seen with Resource Types)
4. Using gRPC, EventStore authors aren't tied to Go (with a bit of extra
   effort to make it compatible with hashicorp/go-plugin)

If this is an approach we choose to take, it might make sense to
introduce an SDK for building plugins like
hashicorp/terraform-plugin-sdk.

There are two main differences between in-process EventStores (like the
Postgres implementation) and external plugin EventStores:

1. Plugin EventStores don't have access to the full DB models
   (`db.Build`, etc), and can't perform further queries on them.
   Instead, they have a serialized form that contains many of the fields
2. Plugin EventStores can't use arbitrary `Key`s - instead, they are
   restricted to `[]int64`. I posed a question on the RFC if using
   `[]int64` for all EventStores is feasible.

Signed-off-by: Aidan Oldershaw <[email protected]>
aoldershaw added a commit to concourse/concourse that referenced this pull request May 31, 2020
The motivation for supporting using EventStore implementations defined
in external binaries was suggested in concourse/rfcs#53. Removing the
EventStore implementations from the core Concourse codebase allows for a
few things:

1. EventStores can be versioned independent of Concourse
2. The burden of maintaining many EventStores doesn't fall on the
   Concourse team
3. The barrier to adding support for external tools is much lower, and
   we can hopefully see the open source community create EventStores for
   several platforms (like we've seen with Resource Types)
4. Using gRPC, EventStore authors aren't tied to Go (with a bit of extra
   effort to make it compatible with hashicorp/go-plugin)

If this is an approach we choose to take, it might make sense to
introduce an SDK for building plugins like
hashicorp/terraform-plugin-sdk.

There are two main differences between in-process EventStores (like the
Postgres implementation) and external plugin EventStores:

1. Plugin EventStores don't have access to the full DB models
   (`db.Build`, etc), and can't perform further queries on them.
   Instead, they have a serialized form that contains many of the fields
2. Plugin EventStores can't use arbitrary `Key`s - instead, they are
   restricted to `[]int64`. I posed a question on the RFC if using
   `[]int64` for all EventStores is feasible.

Signed-off-by: Aidan Oldershaw <[email protected]>
aoldershaw added a commit to concourse/concourse that referenced this pull request Jun 5, 2020
This header allows users to query the /api/v1/builds/{id}/events
endpoint starting from an offset.

Removing this may be a controversial decision. I discussed this as an
open question in concourse/rfcs#53. Essentially, supporting this header
means that `EventStores` will have to define how to unmarshal `Keys`
(but not marshal them??).

This header seems to be used nowhere internally, and isn't exposed by
Fly or go-concourse, so my guess is it's *probably* no longer used by anyone.

Signed-off-by: Aidan Oldershaw <[email protected]>
aoldershaw added a commit to concourse/concourse that referenced this pull request Jun 5, 2020
This commit adds the Postgres implementation for the `EventStore`
interface as described in concourse/rfcs#53.

While this change shouldn't noticably change the outward behaviour of
Concourse noticably, there are a few changes:

* Don't get from the build_events table twice for no reason in Events()
* Some DB interactions now run in a separate transaction to accomodate
  `EventStores` that don't interact with Postgres
* The build_events table is now created at runtime (rather than at
  migration time), and the `pipeline_build_events_x` and
  `team_build_events_x` tables are created when the first build is
  initialized (rather than when the pipeline/team is created)

Signed-off-by: Aidan Oldershaw <[email protected]>
aoldershaw added a commit to concourse/concourse that referenced this pull request Jun 11, 2020
This header allows users to query the /api/v1/builds/{id}/events
endpoint starting from an offset.

Removing this may be a controversial decision. I discussed this as an
open question in concourse/rfcs#53. Essentially, supporting this header
means that `EventStores` will have to define how to unmarshal `Keys`
(but not marshal them??).

This header seems to be used nowhere internally, and isn't exposed by
Fly or go-concourse, so my guess is it's *probably* no longer used by anyone.

Signed-off-by: Aidan Oldershaw <[email protected]>
aoldershaw added a commit to concourse/concourse that referenced this pull request Jun 11, 2020
This commit adds the Postgres implementation for the `EventStore`
interface as described in concourse/rfcs#53.

While this change shouldn't noticably change the outward behaviour of
Concourse noticably, there are a few changes:

* Don't get from the build_events table twice for no reason in Events()
* Some DB interactions now run in a separate transaction to accomodate
  `EventStores` that don't interact with Postgres
* The build_events table is now created at runtime (rather than at
  migration time), and the `pipeline_build_events_x` and
  `team_build_events_x` tables are created when the first build is
  initialized (rather than when the pipeline/team is created)

Signed-off-by: Aidan Oldershaw <[email protected]>
aoldershaw added a commit to concourse/concourse that referenced this pull request Jul 3, 2020
This header allows users to query the /api/v1/builds/{id}/events
endpoint starting from an offset.

Removing this may be a controversial decision. I discussed this as an
open question in concourse/rfcs#53. Essentially, supporting this header
means that `EventStores` will have to define how to unmarshal `Keys`
(but not marshal them??).

This header seems to be used nowhere internally, and isn't exposed by
Fly or go-concourse, so my guess is it's *probably* no longer used by anyone.

Signed-off-by: Aidan Oldershaw <[email protected]>
aoldershaw added a commit to concourse/concourse that referenced this pull request Jul 3, 2020
This commit adds the Postgres implementation for the `EventStore`
interface as described in concourse/rfcs#53.

While this change shouldn't noticably change the outward behaviour of
Concourse noticably, there are a few changes:

* Don't get from the build_events table twice for no reason in Events()
* Some DB interactions now run in a separate transaction to accomodate
  `EventStores` that don't interact with Postgres
* The build_events table is now created at runtime (rather than at
  migration time), and the `pipeline_build_events_x` and
  `team_build_events_x` tables are created when the first build is
  initialized (rather than when the pipeline/team is created)

Signed-off-by: Aidan Oldershaw <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants