Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-14110 Support to limit content size in PackageFlowFile #9595

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

EndzeitBegins
Copy link
Contributor

@EndzeitBegins EndzeitBegins commented Dec 25, 2024

Due to using a different API to retrieve the FlowFiles the behaviour when working with multiple queues is no longer unspecified.

I had an circular dependency problem when depending on nifi-mock from nifi-utils, which is why I use an anonymous implementation of FlowFile inside the tests instead of MockFlowFile.

Summary

NIFI-14110

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using mvn clean install -P contrib-check
    • JDK 21

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

Due to using a different API to retrieve the FlowFiles
the behaviour when working with multiple queues is no longer unspecified.
return FlowFileFilterResult.ACCEPT_AND_CONTINUE;
}

if ((size + flowFile.getSize() > maxBytes) || (count + 1 > maxCount)) {
if (size > maxBytes || count > maxCount) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous code count and size were not changed for this case. The case of reject and terminate is terminal and this FlowFileFilterResult will not be further used nor those values in anyway referenced?

Copy link
Contributor Author

@EndzeitBegins EndzeitBegins Dec 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review feedback @joewitt.
I'm not sure I correctly understood your comment. Please clarify if the following doesn't address your point.

My goal was to simplify the implementation by unifying the three points of computation (inside the first if statement, inside the predicate of the second if, and before the last return) into a single point.
I added test cases for a few selected scenarios before adjusting the implementation to ensure that the behavior remains the same.

The JavaDoc states:

Returns a new {@link FlowFileFilter} that will pull FlowFiles until the maximum file size has been reached, or the maximum FlowFile Count was been reached ...

Therefore, once either the maximum count or accumulated size limit is reached, all subsequent FlowFiles should be rejected.

Upon closer inspection, the old implementation contained a minor bug related to this behavior. If a FlowFile exceeding the size limit was initially rejected (with REJECT_AND_TERMINATE), and the caller subsequently passed an empty FlowFile to the filter, the filter would incorrectly return ACCEPT_AND_CONTINUE even though the limit had already been reached. While this behavior is technically incorrect, it's primarily a consequence of the caller not adhering to the FlowFileFilterResult contract and thus not a critical bug.
I enhanced one of the test cases to demonstrate this behavior.

Both the old and new implementations are susceptible to potential integer overflow issues for both count and size. I'm not sure if that's something worth addressing though.

I'm content with leaving the implementation of newSizeBasedFilter unchanged if that's prefered. I encountered the code while working on the desired changes for PackageFlowFile and deemed it worth some tests / minor refactor.

@mosermw
Copy link
Member

mosermw commented Dec 30, 2024

I recommend modifying a MultiProcessorUseCase or creating another UseCase in order to make the documentation for combining the two Batch Size properties clear. It's very important to be clear that flowfiles will not be delayed in the input queue waiting for a batch size to be reached. It's also very important to support packaging exactly 1 flowfile.

While this improvement appears to be worthwhile, we should be very careful with configuration creep on PackageFlowFile. It's only justification for existence is to be easier to use than MergeContent for a specific use case. Too many features would ruin that justification.

@EndzeitBegins
Copy link
Contributor Author

Thank you for the useful feedback @mosermw. I've adjusted the documentation of the UseCases to clarify the batching behaviour of the processor.

PackageFlowFile in combination with UnpackContent is a useful pair of processors to transfer FlowFiles between NiFi clusters where the more robust approach using remote process groups is not applicable, e.g. due to network restrictions.
Packaging more than one FlowFile can improve efficiency both in storage and transmission.
In my opinion, when the content size of the FlowFiles to transfer can vary largely, being able to apply a soft constraint on the package size can be helpful.

Personally I do not intent to add other properties to the processor at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants