Communication and Kafka

Take my warning to heart instead, and don't be so unyielding in future, you can't fight against this court, you must confess to guilt. Make your confession at the first chance you get. Until you do that, there is no possibility of getting out of their clutches, none at all.

-- Franz Kafka, Der Prozess

Let's move to the other side and consider Kafka from the perspective of patterns of communication.

The situation with patterns of communication is a little more complicated than it was with database systems and storage paradigms. This is because no matter what storage paradigm, you can pretty much bend every single one to every single use case. It will not be effective, mind, and it may require a lot of work on your side of the program, especially if there is a great mismatch between the optimal use cases for a particular paradigm and your use case, for example, if you have data that you need to model extensively and perform joins, and you use Redis, or another KV store, you will have a lot of work to do, and your solution might end up being subpar anyway.

However, in the case of communication patterns, sometimes you just don't have a choice. The trade-offs would either be too high, or you are integrating with a system or standard that dictates it for you.

Let's take a basic over view of some patterns, so we can finally get to how Kafka falls into the puzzle.

Request-response

This pattern is the one you are most likely to have practiced before when developing software. Also called request-reply, it is one of the most basic methods computers use to communicate with each other in the network.

One program sends a request for some data, and another one responds to the request. In more generic terms, it is a message exchange pattern, in which a requestor sends a requesting message to a replier, and the replier validates and processes the request, finally returning a response message.

This is a simple messaging pattern that allows two parties to have two-way conversation with one another over a channel, and you most commonly see it in client-server architectures (since there is nothing that says that each party has to only have one other party exclusively).

As a result, the requestor is often interchangeably, at the cost of specificity, called client and the responder is called the server

simplistic depiction of request-response

Typically, this pattern is purely synchronous, and your most likely encounter with it is in HTTP, which, to refresh our knowledge, stands for Hyper Text, Transfer Protocol. (technically, HTTP in recent versions can do more, but let's not complicate things here)

The difference between this and one-way computer communication is that we await a response, whereas one way communication doesn't.

Request-response is great, if you need two components to talk each other, but if you want one components' message to be delivered to multiple destinations, you have to use another communication pattern.

Batch processing

This is essentially a one-way communication pattern. One party sends or stores data in a place, and another periodically (and the period might be quite long) comes to pick them up from the destination. The destination may be a different component from the processor entirely.

The processor then either store the data elsewhere, or it may put it back into the original storage medium.

an example of batch processing

For the producer, this pattern of communication is definitely one-way because it doesn't care about what happens to the data it sends.

Some implementations might of course be two-way, if it is requested that the store sends back an ACK (acknowledgment), confirming the receipt of data.

Fire-and-forget

This is an option of the first part of the previous one taken to the extreme. In the fire-and-forget pattern of communication, we do not care. The message sender does not await any sort of response from the destination, it doesn't care if the destination received the message. In this model, the recipient would have no relationship with the originator/sender of the message.

An example of a protocol that can be fire-and-forget is UDP, the User Datagram Protocol. We are just sending datagrams somewhere, and we literally care about nothing with regards to its delivery.

The benefit of fire-and-forget is that from the sender perspective, it is very fast. You just send it somewhere and you are done, and now you focus onto the next task.

Point-to-point

In the point-to-point pattern, the sender sends messages to only one receiver, even if there are many receivers listening in the same message channel/queue. Oftentimes, some there is an intermediary involved that handles routing of the message.

point-to-point

For example, we can consider emails to one persons: You are sending your email to an stmp server (which serves as our channel), and there are receivers that exist for this channel, however, the email is delivered only to the person you intend it to (unless its some sketch malicious implementation, of course).

The difference between point-to-point and fire-and-forget is that point-to-point cares about message delivery success, whereas, as we discussed earlier, fire-and-forget does not.

Pipes and Filters

The pipes and filters pattern can stands betwixt an architecture pattern and a communication patterns. Here, we have a source and a destination, just like in the point-to-point pattern, but there is a couple differences.

For starters, we would now prefer to use the term pipe to refer to a communication channel between the original source and the final destination, and there major difference is that there is now a couple stops along the way. These stops are the titular filters. Each filter is responsible for a single transformation or data operation. The messages, or data, are streamed from one stop to the next as quickly as possible, and all of the operations are running in parallel.

Ideally, the filters are loosely-coupled and so they can be used to create more than one pipeline. One filter always receives data from only one filter/original sender and only delivers transformed data to one filter/final destination.

We can see some similarities with the batch processing pattern, as it was depicted above, the difference here is that the stages of a batch-sequential model operate in turn, that is, one at a time, whereas pipes and filters are all running concurrently in parallel.

The origin and the final destination are often just called source and sink

pipes and filters

If you have used a Unix command-line pipeline, then you have observed the pipes-and-filters pattern in action.

For example:

grep "name" < Cargo.lock | tr -d ' ' | cut -d'=' -f2 > res

But keep in mind that not all UNIX pipelines are technically examples of pipes-and-filters. If we only slightly changed the previous example, it wouldn't be anymore:

grep "name" < Cargo.lock | tr -d ' ' | cut -d'=' -f2 | sort > res

The sort command requires the entire input, and so it is more batch processing than anything else.

Publish-subscribe pattern is a pattern where senders of messages, which we here refer to as publishers, do not send the messages with the intent to be send directly to a specific receiver, but instead, the published messages are stored into some sort of classes, based on which the receivers, in this pattern called subscribers express intent in some classes, and so they only receive messages that are of interest to them, without the knowledge or possible intervention of publishers which subscribers, if even any at all, receive the messages.

To differentiate what messages are delivered to a particular subscriber, the process is called filtering. There are two basic types of filtering:

topic-based
content-based

Under content-based filtering, subscribers subscribe to a particular pattern to be found inside the content of the message. It is typically the subscribers, which are responsible for classifying the messages.

The topic-based model has publishers categorize their messages under topics, which are named logical channels.

Furthermore, in most realizations, a broker stands between publishers and subscribers.

publish-subscribe, sending a message under the topic 't1'

Push-pull (also known as Message Queue)

Push-pull/Message queue is a very similar pattern (in structure and components) to point-to-point and pub-sub. However, here, instead of messages from a sender being delivered to every receiver that "subscribes" to them, instead, they are distributed evenly between each other.

The terminology here is that we call the sender component a producer, and the receiver components we call consumers.

While the methods of distributing the messages may differ, the typical default behavior is to perform a round-robin between all of the consumers.

We can deduce from it that sending messages in a push-pull manner is the best if we want to distribute tasks between destinations that all do the same thing with the data, therefore scaling a particular single component horizontally, whereas publish-subscribe is what we need if we want to send a particular message to all components that require it, generally because they all need to do something different with it.

For example, going back to the old example of you being a weather station, you might require your things to be so distributed that you break down your statistics application suchly:

logger - saves the temperatures into a publicly available database
averager - continually calculates different temperature averages
min/max - watches for minimum and maximum temperatures over a particular time

All of these require the same data on input: whatever the temperature sensors record. That suits the pub-sub pattern of communication.

However, if the logging task would be lagging behind because the machine you rented has too low upload, and you want to spread it about two rented machines, then you might want to distribute messages to two machines that have discrete internet connections, so that the data can freely flow into the final database.

push-pull with round-robin approach, numbers indicate messages

You should be also deduce an important feature of application systems that is key for us: communication patterns are not exclusive, we can use pub-sub and push-pull at once, to tackle two different scaling problems.

If we needed to scale all of the components, there is nothing preventing us from overlaying pub-sub over a push-pull architecture, such that we can scale every component horizontally, while also making our system distributed. This flexible approach facilitates effective scaling out that lets us meet customer/user demand without having to invent NASA-level strong servers, we can just throw many weaker machines at it, that are more efficient. If we design our system suchly, we can use the fact that we have a big application cluster, to distribute our systems across the word, so we reduce latency for our users, and also follow the old backup rule of not having copies of data colocated.

Kafka at last

Now that we have observed some of the most common communication pattern, you might be wondering how Kafka falls into this. As a matter of fact, that's why are here.

Well, we examined Kafka as a storage medium in the previous chapter, but Kafka is a communication medium as well.

In some of the previous communication patterns, a component that stayed the same was something called the broker. Kafka, which calls itself a message streaming platform is essentially a broker from the perspective of communication.

It has some defaults. By default, Kafka saves all messages in a log, as we discussed, this is great for persistence, as we may get into a situation where a program's state may depend on all previous messages (for example, from the weather station example, consider a naively implemented average temperature component).

Another default is the publish-subscribe architecture. All messages sent by producers to a particular topic will be received by all consumers that subscribe to said topic. However, push-pull is built in also.

Keep in mind that there is no real way to do point-to-point in Kafka, as producers cannot know anything about consumers and vice versa, and so they cannot choose a specific consumer to deliver a message to, despite there being multiple consumers subscribed to the same topic

Consumer groups

To be able to leverage Kafka to distribute our work to copies of the same consumer, ie. if we want to scale a particular component only. We need to use something called a consumer groups.

Consumer groups are a tool to group the same consumers together, and by default, messages are round-robin distributed between all of the consumers in a particular consumer group. Let's illustrate this example with a system, where we have one consumer group with three consumers and another consumer that is either a different consumer group, or doesn't have it specified at all.

consumer from group two receives all of the messages

However, we also need to keep in mind that there can be more than one producer producing a particular type of message. This means that we can horizontally scale producers as well. Here, there is no special behavior, messages are typically ordered by arrival.

There is nothing limiting the number of producers and the number of consumers in distinct consumer groups. However, what is limited is the amount of consumers per a consumer group. The upper amount is equal to the amount of partitions a certain topic has.

In the Appendix, depending on how we set up Kafka, we either have a topic test with three partitions, three replicas and three brokers running, or we have only one of each in the case of Docker based setup with docker-compose.

We wouldn't be able to realize the previous example like this with the Docker setup because the g1 consumer group has three consumers, but the topic test only has one partition.

Kafka patterns - 1-to-1

The most simple pattern we encounter when using Kafka is the one-to-one pattern, where we have, per topic, only one producer and one consumer. While this is significantly less useful than using Kafka to distribute data and scale horizontally, it still has some uses.

For one, it opens up the possibility of connecting something like Kafka Connect to pour the data from a topic into some other persistent storage, like an SQL database, which can be very useful by letting you execute queries, or connecting with services that do not have support for Kafka and it is out of your reach or out of your budget to go and implement this functionality, or into Secor, which is a tool for persisting your topic log into cloud storages such as Amazon S3, Google Cloud Storage, Microsoft Azure Blob Storage, and OpenStack Swift.

Another benefit, apart from opening up possibilities for the future is that the message log is persistent. Beyond what we spoke about before, this may also be useful if you encounter a sequence o data that breaks your application. Instead of trying to decipher what might have happened from your logs, which could frankly be incomplete, you can replay the exact sequence of messages that killed your application. Therefore, we have a debugging benefit as well.

Finally, although it may seem clear already, this makes your applications much more loosely coupled, even if your consumer dies, your producer will continue receiving input and producing messages to put into the topic, and this data will not get lost. Compare and contrast with reply-request architecture, where the data may not only be permanently lost, but the failure of a server might lead to a cascade failure of all the clients and bring the entire application down for everybody.

In production, especially if you have many users, this may be very costly to you, both in terms of profit generated by product being missed, and by the customer impression of your service (of course, the common Joe gets unbelievably angry at the slightest inconvenience, such is the way of life).

1-to-1 Kafka pattern

Example with Kafka Connect

Even in a 1-to-1 scenario, Kafka's message delivery semantics can bring some peace of mind, but we shall speak about them a little later.

However, since you are here, you are most likely more interested in patterns that do more interesting things than delivering data from

Kafka patterns - Stream pipeline

In this pattern, we are essentially modeling something from the pipes and filters pattern. Components are encouraged to take the data they received from the topic, process it, and push the processed results to another topic, from whence it may either be stored somewhere or processed further, but that is no longer your problem as that particular cog of the machine,

Here, consumers are producers also, but care must be taken to not push messages to the same queue that you are reading from, or you are gonna have a bad time, even if the formats would have been compatible.

This pattern encourages stream based thinking, and should be at least somewhat familiar to you, if you have ever played the Czech videogame Factorio.

Using Kafka to create this pipes-and-filters approximating architecture has the benefit, of letting you easily scale horizontally every component of the pipeline to meet the demands of your usecase and the amount of your users.

In the following depiction, there is multiple Kafka circles, this is a limitation of the chart DSL used, so just imagine that it is all one Kafka cluster:

Stream processing

Kafka patterns - Many to Many

While this pattern already makes sense and should be fairly clear, let's mention it anyway. The most common way to use Kafka is to scale out our producers and consumers, not to mention provide them with more durability and decoupling as producers no longer have to care about consumers existence or numbers and vice versa.

many to many

Message delivery semantics

An important aspect of message delivery is the semantics, or rather guarantees that we can expect. Kafka provides three different models of message delivery:

At most once
At least once
Exactly once

Depending on which semantics we use, we can expect different things.

At most once

This strategy is also known as best-effort. Here, the producers sends messages to the broker without waiting for an acknowledgement, and if the broker is unreachable or the message is lost for one reason or another, there will be no attempt to resend the message, in other words, the message will be delivered either one, in the best case scenario, or not at all.

This is useful when the data isn't critical, but the throughput may be, and when we care more about progress rather than a result. For example, if you have a service that registers clicks for creating a heat map of your website usage, you probably don't care if you miss one or two clicks every now and then. The same can be applied to other types of tracking, and even IoT sensors. You may be a very special weather station that just doesn't feel bad about missing some data.

This strategy should definitely not be used for data you really want to keep.

At least once

Here, the producer sends a message to the broker and expects ACKs (acknowledgements) to make sure that the message has been received successfully and added top the topic. If there was no acknowledgement received after a timeout is reached, whether due to network latency or any other reason / issue along the way, the producer will retry sending the message, acting under the assumption that the message has not been received in the previous attempt.

This makes it ensures that our messages get through, but the downside is that it may lead to duplicates. If your usecase doesn't have a problem with that, then this is a good strategy to use. For example, if you had a message "update user X's name to Y", then you would not mind if the message is processed twice, the end result is the same and the user will not notice anything.

However, if you were processing financial transactions, you really do not want to double spend funds, or create invalid states by duplicating transactions, that would be quite bad.

In those cases, we must rely on the Exactly once strategy.

Exactly once

The final strategy essentially does what it says on the tin. Every message is delivered precisely once, it offers and end-to-end exactly-once guarantee for read-process-write tasks such as stream processing applications. To support this guarantee, Kafka utilizes two features:

Idempotent delivery - this allows a producer to send a message exactly once, wherein messages that are duplicated belonging to the same producer will be ignored by the broker (a simple check-summing technique is in place)
Transactional delivery - producers can send data to multiple partitions in an atomic way, which implies that either all events are successfully delivered, or none of them are. We spoke about the atomicity of transactions in the storage chapters, feel free to come freshed up on ACID

Keep in mind that exactly-once is a guarantee on the part of Kafka only. This means consuming events, updating state stores, and producing events (messages). If you are trying to update a state that is not a part of Kafka, like a row in database, or you make an API call, the guarantee will be of course weaker. We have already proven that doing things exactly once is a rather complicated problem in computing.

The Task

Take the producer and consumer from the previous chapter, and try scaling them horizontally.

Braiins University