Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Kafka Connect Cloud Bigtable sink connector #2466

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

prawilny
Copy link

This PR adds Kafka Connect sink connector.

The code is to land in a different repository, but the repository hasn't been created yet, so we bring the code for early review here.
The fact that it's targetting another repo is the reason of the following:

  • modification of Github Actions CI so that it executes the sink's tests instead of adding it as a new maven submodule and plugging it into existing CI
  • the code is not a submodule in the root pom.xml nor does it use any information from outside of the directory
  • maven plugins' config is duplicated
  • there is another copy of the license in the directory

Things yet to be done (in future PRs):

  • Logical types support
  • More comprehensive integration tests
    • they might include more detailed compatibility checks against the Confluent sink
  • Use of kokoro in CI

@prawilny prawilny requested review from a team as code owners January 12, 2025 14:07
Copy link

google-cla bot commented Jan 12, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@product-auto-label product-auto-label bot added size: xl Pull request size is extra large. api: bigtable Issues related to the googleapis/java-bigtable API. labels Jan 12, 2025
Copy link

Warning: This pull request is touching the following templated files:

  • .github/workflows/ci.yaml

@prawilny prawilny force-pushed the kafka_connect_bigtable_sink branch from e2f0361 to 40ae074 Compare January 13, 2025 11:20
Copy link

@brandtnewton brandtnewton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is shaping up nicely! All my comments are pretty minor. I did not review the Integration tests yet. I'll get back to you soon on how to handle logical types.

* invalid, it is assumed that input {@link SinkRecord SinkRecord(s)} map to invalid values,
* so all the {@link SinkRecord SinkRecord(s)} needing the resource whose creation failed
* are returned.
* <li>Other resource creation errors are logged.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please elaborate more on the exception handling logic

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Is it appropriate now?

Map<Fut, ResourceAndRecords<Id>> createdColumnFamilyFuturesAndRecords,
String errorMessageTemplate) {
Set<SinkRecord> dataErrors = new HashSet<>();
createdColumnFamilyFuturesAndRecords.forEach(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename this variable - these aren't necessarily column family resources

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

for (Map.Entry<Object, Object> field : getChildren(rootKafkaValue)) {
String kafkaFieldName = field.getKey().toString();
Object kafkaFieldValue = field.getValue();
if (kafkaFieldValue == null && nullMode == NullValueMode.IGNORE) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to be the expected behavior, or will the user only expect the nullMode to effect root level fields

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should apply to all the values (including the nested ones).
Rationale:

  • we need NullValueMode.WRITE to behave like the Confluent's sink and it treats it (within values) as if it were empty byte arrays on all the nesting levels.
    • We got Gary's approval for introduction of this configuration arguing that nulls causing deletes by default would be a footgun for users migrating from the Confluent's sink. Do you agree or should we rethink this idea?
  • we need NullValue.DELETE to behave as described in the design doc, which means deletion on all the levels.

Alternatively, we could introduce per-nesting-level configuration, but it seems excessive to me. What do you think?

I tweaked the docstring a bit to make it a bit clearer that this config affects more than only root null.

ByteString kafkaSubfieldName =
ByteString.copyFrom(subfield.getKey().toString().getBytes(StandardCharsets.UTF_8));
Object kafkaSubfieldValue = subfield.getValue();
if (kafkaSubfieldValue == null && nullMode == NullValueMode.IGNORE) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same as above) Is this going to be the expected behavior, or will the user only expect the nullMode to effect root level fields

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should apply to all the values (including the nested ones).
Rationale:

  • we need NullValueMode.WRITE to behave like the Confluent's sink and it treats it (within values) as if it were empty byte arrays on all the nesting levels.
    • We got Gary's approval for introduction of this configuration arguing that nulls causing deletes by default would be a footgun for users migrating from the Confluent's sink. Do you agree or should we rethink this idea?
  • we need NullValue.DELETE to behave as described in the design doc, which means deletion on all the levels.

Alternatively, we could introduce per-nesting-level configuration, but it seems excessive to me. What do you think?

I tweaked the docstring a bit to make it a bit clearer that this config affects more than only root null.

}
}
} else {
if (defaultColumnFamily != null && defaultColumnQualifier != null) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't considered this use case. Good catch!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI we talked with Gary and he approved this additional config.

final String fieldName = "Double";

List<Double> testValues =
Arrays.asList(Double.POSITIVE_INFINITY, Double.NEGATIVE_INFINITY, Double.NaN);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

calculateKey(List.of(innerFieldIntegerName), DELIMITER, kafkaConnectInnerStruct),
innerIntegerValue.toString().getBytes(StandardCharsets.UTF_8)));

Schema kafkaConnectMiddleSchema =

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate test please

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd need a lot of copying since nested structs are used widely here. For example kafkaConnectInnerStruct is used thorough the whole test case.
Maybe I should restructure this test function instead? E.g., first schemas, then structs, then organized assertions with comments specifying what's being tested?


Schema kafkaConnectInnerSchema =
SchemaBuilder.struct()
.field(innerFieldStringName, Schema.STRING_SCHEMA)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the "inner" prefix is redundant and made me think there were outer fields

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
Sorry, I must've left it after a refactor.

Struct kafkaConnectStruct = new Struct(kafkaConnectSchema);
kafkaConnectStruct.put(nullableFieldName, nullableFieldValue);
kafkaConnectStruct.put(requiredFieldName, requiredFieldValue);
assertThrows(DataException.class, () -> calculateKey(List.of(), DELIMITER, kafkaConnectStruct));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please add a comment for why this exception is expected

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


Struct kafkaConnectStruct = new Struct(kafkaConnectSchema);
kafkaConnectStruct.put(nullableFieldName, null);
assertThrows(DataException.class, () -> calculateKey(List.of(), DELIMITER, kafkaConnectStruct));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please add a comment for why this exception is expected

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@prawilny
Copy link
Author

@brandtnewton
I didn't resolve any of the conversations since I'm used to the reviewer doing that.

I also have a question: how do you want to review further commits (mainly logical types support, some more integration tests, and some minor tweaks throughout the codebase)? In this PR? In a new one? Or maybe do you want to create a new repository and have the PR(s) sent there? Please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigtable Issues related to the googleapis/java-bigtable API. size: xl Pull request size is extra large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants