feat: Add Kafka Connect Cloud Bigtable sink connector #2466

prawilny · 2025-01-12T14:07:20Z

This PR adds Kafka Connect sink connector.

The code is to land in a different repository, but the repository hasn't been created yet, so we bring the code for early review here.
The fact that it's targetting another repo is the reason of the following:

modification of Github Actions CI so that it executes the sink's tests instead of adding it as a new maven submodule and plugging it into existing CI
the code is not a submodule in the root pom.xml nor does it use any information from outside of the directory
maven plugins' config is duplicated
there is another copy of the license in the directory

Things yet to be done (in future PRs):

Logical types support
More comprehensive integration tests
- they might include more detailed compatibility checks against the Confluent sink
Use of kokoro in CI

google-cla · 2025-01-12T14:07:25Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

generated-files-bot · 2025-01-12T14:07:30Z

Warning: This pull request is touching the following templated files:

.github/workflows/ci.yaml

brandtnewton

This is shaping up nicely! All my comments are pretty minor. I did not review the Integration tests yet. I'll get back to you soon on how to handle logical types.

brandtnewton · 2025-01-15T14:47:41Z

.../src/main/java/com/google/cloud/kafka/connect/bigtable/autocreate/BigtableSchemaManager.java

+   *       invalid, it is assumed that input {@link SinkRecord SinkRecord(s)} map to invalid values,
+   *       so all the {@link SinkRecord SinkRecord(s)} needing the resource whose creation failed
+   *       are returned.
+   *   <li>Other resource creation errors are logged.


Please elaborate more on the exception handling logic

Done. Is it appropriate now?

brandtnewton · 2025-01-15T14:48:28Z

.../src/main/java/com/google/cloud/kafka/connect/bigtable/autocreate/BigtableSchemaManager.java

+      Map<Fut, ResourceAndRecords<Id>> createdColumnFamilyFuturesAndRecords,
+      String errorMessageTemplate) {
+    Set<SinkRecord> dataErrors = new HashSet<>();
+    createdColumnFamilyFuturesAndRecords.forEach(


Rename this variable - these aren't necessarily column family resources

brandtnewton · 2025-01-15T15:46:13Z

...-connect-sink/src/main/java/com/google/cloud/kafka/connect/bigtable/mapping/ValueMapper.java

+      for (Map.Entry<Object, Object> field : getChildren(rootKafkaValue)) {
+        String kafkaFieldName = field.getKey().toString();
+        Object kafkaFieldValue = field.getValue();
+        if (kafkaFieldValue == null && nullMode == NullValueMode.IGNORE) {


Is this going to be the expected behavior, or will the user only expect the nullMode to effect root level fields

I think it should apply to all the values (including the nested ones).
Rationale:

we need NullValueMode.WRITE to behave like the Confluent's sink and it treats it (within values) as if it were empty byte arrays on all the nesting levels.

We got Gary's approval for introduction of this configuration arguing that nulls causing deletes by default would be a footgun for users migrating from the Confluent's sink. Do you agree or should we rethink this idea?

we need NullValue.DELETE to behave as described in the design doc, which means deletion on all the levels.

Alternatively, we could introduce per-nesting-level configuration, but it seems excessive to me. What do you think?

I tweaked the docstring a bit to make it a bit clearer that this config affects more than only root null.

brandtnewton · 2025-01-15T15:46:29Z

...-connect-sink/src/main/java/com/google/cloud/kafka/connect/bigtable/mapping/ValueMapper.java

+            ByteString kafkaSubfieldName =
+                ByteString.copyFrom(subfield.getKey().toString().getBytes(StandardCharsets.UTF_8));
+            Object kafkaSubfieldValue = subfield.getValue();
+            if (kafkaSubfieldValue == null && nullMode == NullValueMode.IGNORE) {


(same as above) Is this going to be the expected behavior, or will the user only expect the nullMode to effect root level fields

I think it should apply to all the values (including the nested ones).
Rationale:

we need NullValueMode.WRITE to behave like the Confluent's sink and it treats it (within values) as if it were empty byte arrays on all the nesting levels.

We got Gary's approval for introduction of this configuration arguing that nulls causing deletes by default would be a footgun for users migrating from the Confluent's sink. Do you agree or should we rethink this idea?

we need NullValue.DELETE to behave as described in the design doc, which means deletion on all the levels.

Alternatively, we could introduce per-nesting-level configuration, but it seems excessive to me. What do you think?

I tweaked the docstring a bit to make it a bit clearer that this config affects more than only root null.

brandtnewton · 2025-01-15T16:01:07Z

...-connect-sink/src/main/java/com/google/cloud/kafka/connect/bigtable/mapping/ValueMapper.java

+        }
+      }
+    } else {
+      if (defaultColumnFamily != null && defaultColumnQualifier != null) {


I hadn't considered this use case. Good catch!

FYI we talked with Gary and he approved this additional config.

brandtnewton · 2025-01-22T18:56:24Z