Store kafka events from redis to kafka. These events are both click events (generated by the click tracker) and in-app events (generated by the in-app tracker).
The purpose is that, the kafkastore, does the heavy lifting instead of the trackers:
SSL Termination and load balancing is left to the trackers.
The string that is pushed to redis is structured as follows (all values are separeted by a single space):
<IP> <TIMESTAMP> <TOPIC> <EVENT TYPE> <QUERY STRING> <USER AGENT>
- Request IP which is a country by this code.
- Timestamp in seconds since epoch when the request was recieved by the tracker.
- Kafka topic to store the message.
- Event type to store the kafka message with.
- Original query string of the request
- User agent is appended to the end. Note: the user agent can contain spaces, the user agent is assumed to be everything after query string. This is converted to device information here.
If this format should change, then the in-app tracker needs updating, along with the click tracker.
The message format is very simple for the following reasons:
- to make it easy for a consumer to determine whether it needs to handle the event (path is the event type and comes first).
- making it simple to extend the message with more parameters in the future
- clear separation between meta details (i.e. the device and geoip information) and the original payload of the tracking event.
- human readable (i.e. string) making debugging that much easier.
- low computation to decode a message (i.e. using CGI encoding instead of JSON)
Of course, the assumption made here is that tracking calls will always be simple in nature (i.e. not binary values) and the paramters will always be key/value pairs (i.e. CGI/URL parameters). This might not be the case if post requests are used for tracking events but that is out-of-scope here.
An example of a typical kafka message:
/t/ist bot_name&country=DE&device=smartphone&device_name=iPhone&ip=3160898477&klag=1&platform=ios&ts=1465287056 adid=ECC27E57-1605-2714-CCCC-13DC6DFB742E
- First comes the event type, the actual type is assumed to be everything after the final '/' (slash).
- Meta dataset, this is in the form of CGI encoded parameter/value pairs.
The meta data is generated exclusively by the kafkastore and its values
are based on the IP and user agent information. In addition, there is
klag
value the represents the time (in seconds) of how long the message waited in redis before being pushed to kafka. - Query string of the original request. This is just passed through from the tracker, unmodified.
If this format should change, then the consumers need updating. However this is only the case if the format changes
(i.e. <type> <meta> <params>
), not if there are extra "meta" or
"query" parameters included.
Similiar to how the consumers are started, here there is also a scheduler that is run as a sidekiq cron job. It regularly starts the kafka worker. The worker in turn runs the inserter.
The difference here, the worker is given a batch size to pop off redis and is only enqueued if there are redis events waiting in the queue.
Since redis is single thread, there is little point in increasing the number of workers. Instead, to scale this, redeploy to heroku as many kafkastores as necessary.
Each kafkastore deployment gets a new redis and the new redis can be configured on the tracker side (and here), so that the trackers store events in round robin way across all instances of the kafkastore code. Configuration of new redis for the trackers does not require redeployment of the trackers.
The number of clickstores that can be configured for a tracker is not limited,
currently two are given as example, however _3
, _4
, etc are
possible.
Generate a .env
and then fill it with values:
prompt> rake appjson:to_dotenv
prompt> $EDITOR .env
Start the worker and web frontend with:
prompt> foreman start web
prompt> foreman start worker
Easiest way to deploy this, is to use heroku!