-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingestion/s3): ignore depth mismatched path #12326
base: master
Are you sure you want to change the base?
feat(ingestion/s3): ignore depth mismatched path #12326
Conversation
e6d168e
to
d5a4387
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
... and 5 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
c63a02e
to
a2690ac
Compare
a2690ac
to
98d9263
Compare
def _is_allowed_path(path_spec_: PathSpec, s3_uri: str) -> bool: | ||
allowed = path_spec_.allowed(s3_uri) | ||
if not allowed: | ||
logger.debug(f"File {s3_uri} not allowed and skipping") | ||
return allowed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this function to log when a file is ignored.
If you have a better approach, please feel free to suggest it. 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add this to the report as well -> something like this ->
self.report.report_dropped(datahub_dataset_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just make sure in the report you collect in a lossy list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
98d9263
to
651892a
Compare
for key, group in group_s3_objects_by_dirname(s3_objects).items(): | ||
file_size = 0 | ||
creation_time = None | ||
modification_time = None | ||
|
||
for item in group: | ||
file_path = self.create_s3_path(item.bucket_name, item.key) | ||
if not path_spec.allowed(file_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify my intention, I wrote it as qna format.
Q: Why path_spec.allowed
moved to L879?
A: Since path_spec.allowed
already implements depth comparison, I used it to determine whether each path should be skipped or not while iterating over s3_objects.
Q: Does moving this logic to L879 make any breaking changes?
A: No, because it retains the original behavior by continuing to skip each disallowed path, even after being moved to L879.
651892a
to
98366cd
Compare
98366cd
to
60d83cd
Compare
60d83cd
to
c1841f6
Compare
c1841f6
to
ab015f5
Compare
ab015f5
to
23db0f8
Compare
23db0f8
to
7e29893
Compare
7e29893
to
f65a443
Compare
821fc88
to
bd05768
Compare
f0a1541
to
ef6eff6
Compare
Changes
|
ef6eff6
to
2e58331
Compare
2e58331
to
d6ff3ac
Compare
d6ff3ac
to
3d5d49b
Compare
Summary
Feature Addition:
Ignores the paths whose depth dose not match to
path_spec.include
while ingesting S3.Changes
As-is
To-be
Background
path_spec.include
.Logs
How to reproduce
You can reproduce this issue by running the code below.
Checklist