-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add Kafka Connect Cloud Bigtable sink connector #2466
base: main
Are you sure you want to change the base?
Conversation
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Warning: This pull request is touching the following templated files:
|
e2f0361
to
40ae074
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is shaping up nicely! All my comments are pretty minor. I did not review the Integration tests yet. I'll get back to you soon on how to handle logical types.
* invalid, it is assumed that input {@link SinkRecord SinkRecord(s)} map to invalid values, | ||
* so all the {@link SinkRecord SinkRecord(s)} needing the resource whose creation failed | ||
* are returned. | ||
* <li>Other resource creation errors are logged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please elaborate more on the exception handling logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Is it appropriate now?
Map<Fut, ResourceAndRecords<Id>> createdColumnFamilyFuturesAndRecords, | ||
String errorMessageTemplate) { | ||
Set<SinkRecord> dataErrors = new HashSet<>(); | ||
createdColumnFamilyFuturesAndRecords.forEach( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename this variable - these aren't necessarily column family resources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
for (Map.Entry<Object, Object> field : getChildren(rootKafkaValue)) { | ||
String kafkaFieldName = field.getKey().toString(); | ||
Object kafkaFieldValue = field.getValue(); | ||
if (kafkaFieldValue == null && nullMode == NullValueMode.IGNORE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to be the expected behavior, or will the user only expect the nullMode to effect root level fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should apply to all the values (including the nested ones).
Rationale:
- we need
NullValueMode.WRITE
to behave like the Confluent's sink and it treats it (within values) as if it were empty byte arrays on all the nesting levels.- We got Gary's approval for introduction of this configuration arguing that
null
s causing deletes by default would be a footgun for users migrating from the Confluent's sink. Do you agree or should we rethink this idea?
- We got Gary's approval for introduction of this configuration arguing that
- we need
NullValue.DELETE
to behave as described in the design doc, which means deletion on all the levels.
Alternatively, we could introduce per-nesting-level configuration, but it seems excessive to me. What do you think?
I tweaked the docstring a bit to make it a bit clearer that this config affects more than only root null
.
ByteString kafkaSubfieldName = | ||
ByteString.copyFrom(subfield.getKey().toString().getBytes(StandardCharsets.UTF_8)); | ||
Object kafkaSubfieldValue = subfield.getValue(); | ||
if (kafkaSubfieldValue == null && nullMode == NullValueMode.IGNORE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same as above) Is this going to be the expected behavior, or will the user only expect the nullMode to effect root level fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should apply to all the values (including the nested ones).
Rationale:
- we need
NullValueMode.WRITE
to behave like the Confluent's sink and it treats it (within values) as if it were empty byte arrays on all the nesting levels.- We got Gary's approval for introduction of this configuration arguing that
null
s causing deletes by default would be a footgun for users migrating from the Confluent's sink. Do you agree or should we rethink this idea?
- We got Gary's approval for introduction of this configuration arguing that
- we need
NullValue.DELETE
to behave as described in the design doc, which means deletion on all the levels.
Alternatively, we could introduce per-nesting-level configuration, but it seems excessive to me. What do you think?
I tweaked the docstring a bit to make it a bit clearer that this config affects more than only root null
.
} | ||
} | ||
} else { | ||
if (defaultColumnFamily != null && defaultColumnQualifier != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't considered this use case. Good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI we talked with Gary and he approved this additional config.
final String fieldName = "Double"; | ||
|
||
List<Double> testValues = | ||
Arrays.asList(Double.POSITIVE_INFINITY, Double.NEGATIVE_INFINITY, Double.NaN); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
calculateKey(List.of(innerFieldIntegerName), DELIMITER, kafkaConnectInnerStruct), | ||
innerIntegerValue.toString().getBytes(StandardCharsets.UTF_8))); | ||
|
||
Schema kafkaConnectMiddleSchema = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate test please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd need a lot of copying since nested structs are used widely here. For example kafkaConnectInnerStruct
is used thorough the whole test case.
Maybe I should restructure this test function instead? E.g., first schemas, then structs, then organized assertions with comments specifying what's being tested?
|
||
Schema kafkaConnectInnerSchema = | ||
SchemaBuilder.struct() | ||
.field(innerFieldStringName, Schema.STRING_SCHEMA) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the "inner" prefix is redundant and made me think there were outer fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Sorry, I must've left it after a refactor.
Struct kafkaConnectStruct = new Struct(kafkaConnectSchema); | ||
kafkaConnectStruct.put(nullableFieldName, nullableFieldValue); | ||
kafkaConnectStruct.put(requiredFieldName, requiredFieldValue); | ||
assertThrows(DataException.class, () -> calculateKey(List.of(), DELIMITER, kafkaConnectStruct)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please add a comment for why this exception is expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
Struct kafkaConnectStruct = new Struct(kafkaConnectSchema); | ||
kafkaConnectStruct.put(nullableFieldName, null); | ||
assertThrows(DataException.class, () -> calculateKey(List.of(), DELIMITER, kafkaConnectStruct)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please add a comment for why this exception is expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@brandtnewton I also have a question: how do you want to review further commits (mainly logical types support, some more integration tests, and some minor tweaks throughout the codebase)? In this PR? In a new one? Or maybe do you want to create a new repository and have the PR(s) sent there? Please let me know. |
This PR adds Kafka Connect sink connector.
The code is to land in a different repository, but the repository hasn't been created yet, so we bring the code for early review here.
The fact that it's targetting another repo is the reason of the following:
Things yet to be done (in future PRs):