Instaclustr: 9 tips to improve Apache Kafka management

This is a guest post for the Computer Weekly Open Source Inside blog written by by Ben Slater in his role as chief product officer at Instaclustr.

Instaclustr is known for its focus on providing the Cassandra database as a managed service in the cloud — the company is equally known for its work providing Apache Kafka, Apache Spark and Elasticsearch.

Slater writes on the subject of improving Apache Kafka Management – Apache Kafka is an open source ‘stream processing’ software platform developed by the Apache Software Foundation, written in Scala and Java… it handles trillions of events every day.

Stream processing is useful in areas such as massive multiplayer online gaming and other more enterprise level forms of extreme data processing connectivity such as trading and Internet of Things device log file processing.

Slaters writes as follows…

Like the works of the famed novelist that bears its name, Apache Kafka is fairly easy to get started with – but understanding its deeper nuances and capturing all that it has to offer can be a heck of a challenge.

Here are nine tips for making sure your Kafka deployment remains simple to manage and fully optimized:

1) Configure logs to keep them from getting out of hand.

log.segment.bytes, log.segment.ms and log.cleanup.policy (or the topic-level equivalent) are the parameters that allow you to control log behaviour. For example, if you have no need for past logs you can set cleanup.policy to “delete”, so that Kafka eliminates log files after a set time period or once they reach a pre-determined file size. Alternatively, you can set the policy to “compact” to retain logs, tailoring the parameters to fit your use case as needed.

2) Understand Kafka’s hardware needs.

Because Kafka is designed for horizontal scaling and doesn’t require a great deal of resources, you can run successful deployments while using affordable commodity hardware. Here’s a breakdown:

Kafka doesn’t require a powerful CPU, except when SSL and log compression are needed.
6 GB of RAM, used for heap space, allows Kafka to run optimally in most use cases. More is often helpful to assist with disk caching.
When it comes to the disk, non-SSD drives are often suitable due to Kafka’s typical sequential access pattern.

3) Make the most of Apache ZooKeeper.

Be sure to cap the number of Apache ZooKeeper nodes at five or fewer. ZooKeeper also pairs well with strong network bandwidth. In pursuing minimal latency, use optimal disks with logs stored elsewhere, isolate the ZooKeeper process with swap disabled, and monitor latency closely.

4) Be smart in establishing replication & redundancy.

Kafka’s resilience depends on your wise pursuit of redundancy and reliability in the face of disaster. For example, Kafka’s default replication factor of two should be increased to three in most production deployments.

5) Be careful with topic configurations.

Set topic configurations properly in the first place, and create a new topic if changes do become necessary.

6) Take advantage of parallel processing.

More partitions mean greater parallelisation and throughput, but also extra replication latency, rebalances, and open server files. Safely estimated, a single partition on a single topic can deliver 10 MB/s (the reality is more favourable); using this baseline you can determine the targeted total throughput for your system.

7) Secure Kafka through proper configuration & isolation.

The .9 release of Kafka added an array of useful security features, including support for authentication between both Kafka and clients, and Kafka and ZooKeeper. Kafka also added support for TLS, which is a key security precaution for systems with clients directly connecting from the public internet.

8) Set a high Ulimit to avoid outages.

Setting your Ulimit configuration is pretty straightforward: Edit /etc/sysctl.conf and set a hard Ulimit of 128,000 or higher for the maximum open files allowed by your deployment system, then restart. Doing so avoids the all-too-common scenario of experiencing what looks like a load issue with brokers going down, but is actually a simple “too many open files” error.

9) Utilise effective monitoring & alerts.

Kafka’s two key areas to monitor are 1) system metrics and 2) JVM stats. Monitoring system metrics means tracking open file handles, network throughput, load, memory, disk usage, and more. For JVM stats, be aware of any GC pauses and heap usage. Informative history tools and dashboards for swift debugging are your friends here.

Instaclustr’s Slater: Apache Kafka is fairly easy to get started with – but understanding its deeper nuances and capturing all that it has to offer can be a heck of a challenge.