Skip to content
/ hippo Public

Heavy-duty data ingestor service for gRPC/REST based clients.

License

Notifications You must be signed in to change notification settings

YashMeh/hippo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hippo

Kafka InfluxDB Zookeeper Docker Go

Hippo is a data ingestor service for gRPC and REST based clients. It publishes your messages on a kafka queue and eventually saves them to influxDB, it is build to be easily scalable and prevent SPOF for your mission-critical data collection.

Why ? Why not ?

High speed data ingestion is a big problem if you ask me, lets take the example of server level logs, say you have multiple microservices deployed in production and all of them are producing high quantity of logs and at a very large rate, how would you make sure that you capture each and every packet without loosing any of them and more importantly save them to a place where you can actually query them at a later stage for analysing, this is where hippo will shine, the gRPC/REST proxy and kafka brokers can be scaled, based on the load and then they will eventually save them to a write intensive InfluxDB and then take advantage of all the services build on top for analysing InfluxDB data check this

Also, while making this project I learnt so many things and it gave me an excuse to solve one of the problems that I face a lot, dealing with server logs.

Performence

These data points were obtained by running 1 instance of each service on a quad-core , 16 GB RAM pc.

Benchmark

Benchmarking

I used ghz as a benchmarking tool

ghz --insecure --proto kafka-pixy.proto --call KafkaPixy/Produce
-d "{\"topic\":\"foo\",\"message\":\"SSBhdGUgYSBjbG9jayB5ZXN0ZXJkYXksIGl0IHdhcyB2ZXJ5IHRpbWUtY29uc3VtaW5nLg==\"}" -c 1000 -n 100 -z 5m 127.0.0.1:19091

Tech stack and their use

Tech Use
apache-zookeeper It manages different kafka brokers, choosing leader node , replacing nodes during failover.
apache-kafka Messaging queue, queue of messages are stored as a TOPIC topics can span across different partitions and you have consumers to consume those messages and producers to put them on the queue
kafka-pixy The gRPC/REST based proxy server to contact the kafka-queue on behalf of your applications
timberio/vector A very generic software to take data from your SOURCE(files,Kafka,statsD etc. ) does aggregation , enrichment etc . on it and saves them to the SINK(DB, S3 , files etc.)
influxDB A timeseries based DB

Architecture

Architecture

Alternative Methods

These are just what I can think of right on top of my head.

  • Use a 3rd party paid service for managing your data collection pipeline but, in the long run you will become dependent and as the data grows your cost will increase manifold.

  • Create a proxy and start dumping everything on cache(Redis etc.) and create a worker to feed that to the database but, cache is expensive to maintain and there is only so much that you can store on the cache.

  • Directly feed to the database but, you will be bounded by query/secs and will not receive reliable gRPC support for major databases.

Configurations

In order to run hippo, you need to configure the following-

  1. kafka-pixy/kafka-pixy.yaml

Point to your kafka broker(s)

kafka:

seed_peers:
  - kafka-server1:9092

Point to your zookeeper

zoo_keeper:

seed_peers:
  - zookeeper:2181
  1. vector/vector.toml

This .toml file will be used by timberio/vector (which is ❤️ btw) to configure your source (apache-kafka) and your sink(influxDB)

Usage

The proxy (kafka-pixy) opens up 2 ports for the communication.

Port Protocol
19091 gRPC
19092 HTTP

The official repository does a great job ❤️ in explaining how to connect with the proxy server but for the sake of simplicity, let me explain here.

Using REST based clients

It can be done like this (here foo is the topic name and bar is the group_id)

  1. Produce a message on a topic
curl -X POST localhost:19092/topics/foo/messages?sync -d msg='sample message'
  1. Consume a message from a topic
curl -G localhost:19092/topics/foo/messages?group=bar

Using gRPC based clients

The official repository(which is awesome, btw) provides the .proto file for the client, just produce the stub for the language of your choice(they have helper libraries for go and python) and you will be ready to go.

For the sake of simplicity, I have provided a sample client.

Running

Simply run docker-compose up to setup the whole thing

How to use in production 🔥

Some things that I can think of on top of my head -

  1. Add a reverse-proxy(a load balancer maybe (?) ) that can communicate over to the pixy instances in different AZs

  2. Configure TLS in the pixy , add password protection to zookeeper and configure proper policies for influxdb

  3. Use any container orchestration tool (kubernetes or docker-swarm and all the fancy jazz) to scale up the servers whenever required.

Usecases

Hippo can be used to handle high volume streaming data, some of the usecases that I can think of are -

  1. Logging server logs produced by different microservices

  2. IoT devices that generate large amount of data

TBD (v2 maybe (?))

  • Right now, the messages sent do not enforce any schema, this feature will be very useful while querying the influxdb (Add an enrichment layer to the connector itself, maybe (?))

  • Provide benchmarks

Acknowledgement

This project wouldn't have been possible without some very very kickass os projects 🔥

  1. kafka-pixy

  2. timberio/vector

  3. bitnami

  4. apache-kafka

  5. apache-zookeeper



Yash Mehrotra

GitHub followers Twitter URL


if(repo.isAwesome || repo.isHelpful){

StarRepo();

}