The Domain of Communication and Storage

“But I’m not guilty,” said K. “there’s been a mistake. How is it even possible for someone to be guilty? We’re all human beings here, one like the other.” “That is true” said the priest “but that is how the guilty speak”

-- Franz Kafka, Der Prozess

As the demands for applications increase, we must start thinking about how to scale properly. The algorithm, idea, or user-perspective functionality is no longer the only important thing, and we must pay heed to what goes on behind the curtain.

Scaling is a ubiquitous issue that affects all industries, not least the IT industry, and that's at every step of the way. We have to scale our companies, our services, our hardware, our operations, our audience and whatever else may be necessary. What are minutia in the beginning start to matter as time goes on.

In software development, we talk about two types of scaling - horizontal and vertical. Vertical scaling, also known as scaling up entails adding more resources to the system(s) running your application. In practice, this means using a more performant server, adding additional RAM or storage, or upgrading the network connection to the server. From a developer's perspective, vertical scaling is easier, and may even require no action from the developer. Of course, exceptions exists, for example, to benefit from a stronger CPU with more cores you have to write your software in a multi-threaded manner, otherwise the extra cores would be useless.

p.1: Vertical scaling

Horizontal scaling, also known as scaling out is the practice of adding more nodes to your infrastructure to cope with the increasing customer/user demands. This means running the application or more machines, with some mechanism existing to distribute the load. Oftentimes, to be able to scale horizontally, the application has to be written in a way that supports it. There are two fundamental ways of going about horizontal scaling, which both may be utilized at once (and often are):

Running the same binary on multiple systems with a load balancer / load distributor
Splitting the application into services, each performing a different part of the total functionality of the application.

If you have heard ever about the monolithic vs. microservice architecture debate, that has become an especially popular topic during recent years, this practice shouldn't be entirely new to you.

p.2: Horizontal scaling

However, as we transform our applications from lonely monoliths into bustling communities of (micro)services, new problems arise, that we, as developers need to deal with, in order to be even able to undertake the endeavor of horizontal scaling.

The communication between components wasn't a problem originally. Everything lived in the same binary, and yes, you had some internal APIs, and distinct modules or namespaces within your program, but all of it lived under the same roof, and communication was not an issue, you would just call your methods and functions, use your types, and so on, and so on. The problem was mostly designing internal APIs that were reasonable. In more advanced cases, things like binding to libraries dynamically, concepts like language interoperability and calling conventions were encountered, but most applications can do without it, or if this is in a library you depend on, it has likely been solved by the author of the library and it is not an issue for you to deal with.

However, as we split off the components of our application system into actual components, which may not be (and in production in fact rarely are) running on the same machine, you have to worry, about how to communicate information between these components, in a matter that is:

safe
efficient
effective

It also helps if it is general enough (as opposed to home-grown ad hoc what you made up for a particular set of programming languages and components), so that you can fully leverage the potential benefits of the microservice architecture, such as not caring about concepts like ABIs, architectures, and programming languages. This is not a trivial task, and even if you may be convinced otherwise in writing your own binary protocol for serialization and communication between your services, remember these words when you eg. forget that endianness is an issue, data gets corrupted, or straight out lost.

This task is better left to the experts whose work it is directly to work on designing and developing these protocols, and the less you have to worry about it, the more you can focus on delivering the product you are developing in a timely and effective manner. Very importantly, it is also a delegation of responsibility to the respective parties, and slightly more peace of mind for your as a developer. If there is a bug in gRPC, it's not your fault.

Multiple paradigms or patterns of communication have been conjured up over the last decades, with each having their use-cases and pros and cons. Selecting what pattern and what technology to develop your application with is an important decision when scaling horizontally.

A related problem is the problem of storage. In the inception of an application, where things such as scaling or data safety are not a worry, the selection is not very restricted at all. Small applications may even get away with storing their data in plain files. However, as your service grows, you have to start worrying about three things:

The amount
Effective access
Safety in terms of preservation and concurrent access

(interesting how the problem domains of communication and storage overlap, isn't it?)

It is real trouble if two services writing to the same destination can corrupt the data, or if the failure of a single disk, (or otherwise a single node) can put you of business. If you store your data in text-based formats like JSON, you may also learn that the amount of data you are storing swells up in file-size.

Effective concurrent access also becomes a critical topic, if the storage is not merely a final destination for data to be idly retrieved from, but is also a medium from which data is constantly pulled for another processing, only to be then put back in, you may run into issues with database responsivity and performance degradation due to too many simultaneous connections.

In these situations, whether we like it or not, the database becomes a medium of communication. We want be able to pass some sort of messages to a different part of the system, but those messages have an information value so high that we need to make them persistent. It may also be important to be able to replay these messages.

In the last paragraphs, I am subtly coercing us to start thinking in terms of events, not objects. These events, persistent for all intents and purposes, are leading us to think differently about service communication and application structure. That is, in terms of messages, queues and events triggering other events.

I am far from the first person to notice there is a useful overlap between communication and storage, and in the following chapters, I hope to crossover these domains through the singular technology called Apache Kafka.

Kafka

This Kafka cycle the beginning of which you are currently reading is slightly different from the majority of Braiins university, in that we shall explore the theoretical implications of what leads us to deciding to use Kafka and hopefully these texts will inspire you to use it properly.

We shall examine both of these perspectives, starting with storage paradigms and then communication patterns.

Braiins University

The Domain of Communication and Storage