Skip to content
Adrian Cole edited this page Jul 14, 2020 · 4 revisions

This design was completed in https://github.com/openzipkin/zipkin/pull/3018

Zipkin only stores traces for 1 week by default. However there are cases where people might want to access them past their expiration date. The most common is adding a link to the trace to an issue / code-review. The idea is that it'll be easier for people to figure out exactly what was the problem being described, however after one week that link will be dead as the trace will be deleted from storage.

The current UI lets you download the trace as JSON and then re-upload it. That works and it's great for sharing a trace with people outside the org, but it's quite annoying when you just want to persist a trace. There are several steps involved as you have to download the JSON, upload it somehow to the system you use to track issues and then all the people that want to see it will need to download the JSON again and upload it back in the zipkin ui. It's not terrible, but there are enough steps that most people I know just end up sharing a screenshot of the useful part of the trace rather than having to do that.

It'd be great to be able to just click a button and have the current trace persisted for a longer period of time.

Past efforts

This topic has been brought up before. The most common complaints against those proposals were people worried about having to double their infrastructure, companies not mounting the write API in a place reachable from the UI and how you'd then query the save data.

Proposal

UI / Javascript

Let's add a new optional "Archive trace" button to the UI.

Showing the button is controlled via a config option, so that if people haven't configured their archival endpoint they won't see the button at all. This option would be disabled by default. This would prevent adding a broken button that'd confuse users and it also means that companies that are not interested or unable to use this for whatever reason could just leave it off.

The config option would just be something like:

ARCHIVE_URL = https://zipkin.archive.mycompany.com/

Clicking on the button would trigger an HTTP POST to the /api/v2/spans endpoint at url with the current trace as payload.

It'd also show a popup or a banner somewhere with a copiable link to the archived trace that you could share.

Zipkin Server

No change would be needed to the Zipkin API.

You'd just need to run another zipkin-server instance reachable through that ARCHIVE_URL. This would ideally be configured to write to a different C* keyspace or ES index. There's no need to have a separate cluster since the amount of data would probably be very small, so a separate keyspace/index would minimize the extra infrastructure required.

This would also serve as the UI for the archived data.

Datastore

I don't really know how indexes work in Elasticsearch, but for Cassandra you'd need to create a new keyspace with whatever default TTL you want (1 year, 10 year, forever). You could in theory reuse the same keyspace, but the compaction algorithm works best when all the data expires at the same time so I'd be worried of messing that up since it already has enough problems under load.

How would this address past issues

Having to double the infrastructure

The only extra thing you'd really need to run is another zipkin-server, but most sites are already running more than one to handle the load so that doesn't seem a big requirement. You'd also need an extra keyspace or index in the datastore, but that's only a config change and doesn't require new infrastructure.

Companies not mounting the write API in a place reachable from the UI

I don't have a solution for this. But imo that's a tradeoff that sites will need to consider if they don't want to expose the write API. Since they're explicitly saying that they don't want the UI or anyone to write anything, I think it's fine that they cannot use this. Since this would be disabled by default it wouldn't clutter their UI with broken buttons anyway.

How you'd query saved data

I don't think there's a need to have a common query interface for both the normal data and the archived one. In my opinion it'd also look weird since you might see duplicated traces. And anyway in the first week you don't really need the archived traces since you still have the normal ones. And after one week you'd just need the archived ones.

Using the archive url & zipkin-server to search and see the archived traces seems a perfectly fine option to me.

What if I'm fine with the status quo?

Since this would be disabled by default you would not notice any difference. If you're fine with downloading the JSON and uploading it, then there's no need to have a separate archival mechanism. That doesn't mean that status quo is an option to strive for though...

What about the sites that won't be able to use this?

I don't think this solution will add a lot of burden to sites (or maintainers). Yes, companies that don't expose the write API or are unwilling to run a second zipkin server won't be able to use this, but in my opinion this should handle the most common case with the least amount of work for everyone.

Abandoning the effort in an attempt to come up with something that will satisfy everyone (which is what happened with https://github.com/openzipkin/zipkin/pull/1747) doesn't really benefit anyone since it will just get abandoned again. I'm much more in favor of doing something that will work for most and then handle the specific corner cases in a second time.

Clone this wiki locally