The Apache Kafka agent is used to stream records into Anodot and is one of our most commonly-used agents. This article provides an overview of some of the key elements required when setting up your Kafka agent, and includes:
Config file settings
There are a number of source config properties you can set, including Kafka version, Kafka broker connection string, Topic list, and Number of threads. The following source config properties are mandatory:
- type (string)
- name (string)
- config (object)
Note: The following Kafka versions are supported: 0.10, 0.11, 1.0+, 2.0+
There are also a number of pipeline config and pipeline file config properties you can set, including Pipeline ID, Measurement name, and Timestamp config.
For a complete list of the properties you can set for these config files, as well as code samples, see Config List.
Authentication
There are two main methods of authentication: SSL/SASL and Kerberos. For further information and examples, see Kafka authentication.
Load distribution
A single pipeline can consume about 1000 eps. If you have more than 1000 eps in your Kafka topic, you need to split the load between pipelines so data can reach Anodot in real-time. You can do that by correctly partitioning the data inside the topic, as described in Distributing the load.
Filtering
To learn more about filtering conditions, which can consist of multiple expressions, see Filtering and Transformations.
Use cases
In addition to real-time data streaming to Anodot, we've collected some examples of specific Kafka use cases, including:
- Running counters: If the value of your metrics is an increasing counter, you can specify
running_counter
as a target type. - Dynamic metric name: When the metric name is stored in the data, you need to choose a not static
what
option when configuring the pipeline. - Metrics in JSON array: Add metrics to your JSON array.
For more details, see our Use cases page.
FAQ
What are the core Kafka capabilities that make this agent effective?
This agent is so effective because data can be streamed in real-time. Kafka can integrate with many other systems, so instead of creating a collector for a variety of systems, we can use Kafka as a single source of data, and other systems will simply publish data to Kafka.
What guidelines should we follow to ensure an efficient workflow for Kafka topics?
To enable ordered processing of the Kafka records, you need to make sure that:
- The number of partitions is larger than or equal to the number of threads. Thus each thread handles a single partition or more, resulting in ordered handling of the records.
- The producers of a given combination of measurement and dimensions store such records to the same partition.
- You do not use the transformations feature because changing metrics after fetching them from Kafka may affect the ordering.
When should we NOT use the Kafka agent?
- When topics can not be configured as described in the question/answer above. This will likely cause data to fall out of order.
- If real-time processing is not required, you can switch to other data sources which support aggregation queries. This can lead to easier processing of data (and lower system resource usage, etc).