Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename s3 sink object metadata config options #5041

Merged
merged 5 commits into from
Oct 14, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
/*
* Copyright OpenSearch Contributors
* SPDX-License-Identifier: Apache-2.0
*/

package org.opensearch.dataprepper.plugins.sink.s3;

import com.fasterxml.jackson.annotation.JsonProperty;
public class ObjectMetadata {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the purpose or context for this change

@JsonProperty("number_of_events_key")
private String numberOfEventsKey;

public String getNumberOfEventsKey() {
return numberOfEventsKey;
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,9 @@ public class S3SinkConfig {
@JsonProperty("predefined_object_metadata")
private PredefinedObjectMetadata predefinedObjectMetadata;

@JsonProperty("object_metadata")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is locking our configuration in a way that may prevent future useful expansion.

Customers may want to add other S3 object metadata that is not here. I tend to think this should be inverted.

For example, the user might want a static value named pipeline_name and a more dynamic value like number_of_events.

object_metadata:
  my_pipeline_name: pipeline-123
  my_event_count: ${numberOfEvents}

I think Data Prepper does currently lack a good, consistent way for plugins to provide expressions that are specific to that plugin. But, we could follow the pattern used elsewhere where we look specifically for this string. It allows us to extend this in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dlvenable I do not think it will prevent future "dynamic" metadata. Once we add support for dynamic (expression based) metadata, old style metadata can be deprecated. I am not sure there is any easy way to add dynamic metadata now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem I see is that object_metadata is the place to put dynamic metadata.

I say we keep what we have and then improve it later with the dynamic approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This configuration was already released in Data Prepper 2.9, so this is a breaking change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dlvenable It is breaking change but no one is using.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say that is the case or not.

private ObjectMetadata objectMetadata;

@AssertTrue(message = "You may not use both bucket and bucket_selector together in one S3 sink.")
private boolean isValidBucketConfig() {
return (bucketName != null && bucketSelector == null) ||
Expand Down Expand Up @@ -142,8 +145,8 @@ public ObjectKeyOptions getObjectKeyOptions() {
return objectKeyOptions;
}

public PredefinedObjectMetadata getPredefinedObjectMetadata() {
return predefinedObjectMetadata;
public ObjectMetadata getObjectMetadata() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to return either the predefined or the new one here.

Also, add an @AssertTrue to be sure they are not both set by the user.

return objectMetadata;
}

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,24 +5,24 @@

package org.opensearch.dataprepper.plugins.sink.s3.grouping;

import org.opensearch.dataprepper.plugins.sink.s3.PredefinedObjectMetadata;
import org.opensearch.dataprepper.plugins.sink.s3.ObjectMetadata;
import java.util.Map;
import java.util.Objects;

class S3GroupIdentifier {
private final Map<String, Object> groupIdentifierHash;
private final String groupIdentifierFullObjectKey;

private final PredefinedObjectMetadata predefinedObjectMetadata;
private final ObjectMetadata objectMetadata;
private final String fullBucketName;

public S3GroupIdentifier(final Map<String, Object> groupIdentifierHash,
final String groupIdentifierFullObjectKey,
final PredefinedObjectMetadata predefineObjectMetadata,
final ObjectMetadata objectMetadata,
final String fullBucketName) {
this.groupIdentifierHash = groupIdentifierHash;
this.groupIdentifierFullObjectKey = groupIdentifierFullObjectKey;
this.predefinedObjectMetadata = predefineObjectMetadata;
this.objectMetadata = objectMetadata;
this.fullBucketName = fullBucketName;
}

Expand All @@ -43,6 +43,6 @@ public int hashCode() {

public Map<String, Object> getGroupIdentifierHash() { return groupIdentifierHash; }

public Map<String, String> getMetadata(int eventCount) { return predefinedObjectMetadata != null ? Map.of(predefinedObjectMetadata.getNumberOfObjects(), Integer.toString(eventCount)) : null; }
public Map<String, String> getMetadata(int eventCount) { return objectMetadata != null ? Map.of(objectMetadata.getNumberOfEventsKey(), Integer.toString(eventCount)) : null; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is coupling the metadata with the group itself. We should have a more extensible approach that allows for getting the metadata elsewhere.

Also, this design is intrinsically connected to the count. But, the metadata may be more than just this.

I think having a class to get metadata for any given S3 Object write would make more sense.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have more extensible approach. We can do that in future. The current approached is NOT connected to count. getMetadata() returns a map which can be more than the count. Currently, it is just a count.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S3 PutObjectRequest API takes metadata as a MAP. We have ObjectMetadata class already being added in the PR. We can have API to populate it and return a MAP instead of creating map outside. I think this can be done in a future PR

public String getFullBucketName() { return fullBucketName; }
}
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,6 @@ public S3GroupIdentifier getS3GroupIdentifierForEvent(final Event event) {
}


return new S3GroupIdentifier(groupIdentificationHash, fullObjectKey, s3SinkConfig.getPredefinedObjectMetadata(), fullBucketName);
return new S3GroupIdentifier(groupIdentificationHash, fullObjectKey, s3SinkConfig.getObjectMetadata(), fullBucketName);
}
}
Loading