-
Notifications
You must be signed in to change notification settings - Fork 5
How it works
Kafka Connect takes care of reading and deserialising the data before it is passed to the sink code. Any data transformations required can leverage Connect Single Message Transform (SMT), to adjust the data shape before it reaches the sink code.
Each Connect sink instance can read from one or more Kafka topics, and publish to a given target table. To send data to more than one target, a connector has to be created per target.
The connector leverages the EMS Continuous Push API. The API requires parquet files to be uploaded. Once in EMS the files are processed on a time interval.
A Kafka topic represents a stream of data - which is an infinite sequence of records. To reduce the network chattiness level, the sink accumulates files locally for each topic-partition tuple involved. When the accumulated file reaches specified thresholds (size, records or time) they are being uploaded to EMS. This of course introduces latency, a latency which can be tweaked according to the needs. Once the file is uploaded the process continues, a new file is created and records are appended.
Writing with exactly once semantics is a challenge given there are writes across systems: uploading the file and committing the Kafka offsets. In EMS, the data is processed with as an UPSERT (An operation that inserts rows into a database table if they do not already exist, or updates them if they do). Because of this the constraints can be relaxed and rely on eventual consistency if a record is uploaded multiple times; the latest record data will be eventually stored.
A file written to EMS is considered “durable” in the sense that, eventually the data will be processed and made available to the Process Query Language(PQL) engine. The sink code will not check the EMS internal data processing jobs for the files uploaded. Because a data job can take, at times, hours to be processed that does not play well with the continuous flow of data in Kafka, and could easily trigger a connector rebalance. One of the main reasons the EMS internal job will fail is down to data misalignment: the incoming file data and schema it's not aligned with the target table schema. For that at the moment there is no error reporting, future solutions will, however, address this gap
Here is the logical flow for a Kafka message record within the connector:
Copyright @ Celonis 2022