Kafka, why and how we have used it?

Posted on Mar 18, 20171 Comment

Kafka, why and how we have used it?

Kafka sits at the front-end of streaming data, acting as a messaging system to capture and publish feeds, so that allows data can be “manipulated, enriched and analyzed before it is persisted to be used by an application,” as MemSQL CEO Eric Frenkiel wrote.

Use Case:

We have a function on AWS Lambda which gathers data from several sources that then needs to be processed and saved to our DB. There are few problems:
  • Our app’s primary feature depends on an aggregation of data from multiple sources which needs to be collected, processed, and saved to a database
  • So the first decision is: do we do this in the app? (No, because that would overload the app, so we use AWS Lambda for data gathering instead).
  • Processing the data on Lambda is not an efficient use of that resource because we pay for the amount of time that the process runs. It’s more efficient to collect the data and then process it later.
  • When there is high usage in the app, a large number of lambda requests will be started, which will then all be trying to write to the db at once, so we effectively end up DDoS’ing our own database, unless we have some sort of queuing system.

Solution:

So, for manipulating data before storing it to DB, we have used Kafka messaging system, as it’s a scalable, fault-tolerant, publish-subscribe messaging system that enables us to build distributed applications.

What is Kafka and how it works?

A very good blog to get basic of Kafka.

Kafka is like a publish-subscribe system that can deliver in order, persistent, scalable messaging. It has publishers, topics, and subscribers. It can also partition topics and enable massively parallel consumption.

Kafka is persisted and replicated to peer brokers for fault tolerance, and those messages stay around for a configurable period of time (by default they stay for 30 days).

Key features:

  • It lightens the load by not maintaining any indexes that record what messages it has. There is no random access — consumers just specify offsets and Kafka deliver the messages in order, starting with the offset.
  • There are no deletes. Kafka keeps all parts of the log for the specified time.
  • It can efficiently stream the messages to consumers using kernel-level IO and not be buffering the messages in user space.
  • It can leverage the operating system for file page caches and efficient write back/write through to disk.

For more information regarding why it is so popular, you can refer to this.

How we have implemented it (NodeJs):

We are using a multi (currently three) broker (partition) Kafka architecture, just manage more load properly. Each partition has a leader. Most partitions are written into leaders with multiple replicas. In this case, we have three consumers to consume messages from each partition (all consumer are part of a single group).

Kafka sends a message to a Topic, here we have a Topic named “topic”. So, the process is like this: Producer sends a message to Topic and consumer consumes message from a Topic.

Producer:

We have our producer at aws lambda, a request to lambda is sent when a user request for KC from App. At lambda, a request is sent to Google for top 10 URLs and the resultant links along with a random Topic partition is sent to Kafka.

Kafka server and Consumer:

  • At the server, Kafka Topic receives message and store there. Consumer process running thereon node server waits for the message and consumes the messages in bulk if any. There, the KC is calculated and data is stored in MongoDB and our cached database.

Overall, Kafka reduced our overhead on the single server, which eventually reduces the load on our server.