Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement catalog filter for KedroDataCatalog #4449

Open
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

ElenaKhaustova
Copy link
Contributor

Description

Implemented KedroDataCatalog.filter() to filter datasets by name and type.

Solves #3917

Reasoning: #3917 (comment)

Development notes

Implemented on top of #4448 <-- please review it first.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

ElenaKhaustova and others added 27 commits January 22, 2025 11:49
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
@ElenaKhaustova ElenaKhaustova marked this pull request as ready for review January 28, 2025 15:50
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already discussed on the call that this looks good! 👍 Left some minor comments.

tests/io/test_kedro_data_catalog.py Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ankatiyar ankatiyar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

manually tested, looks good! 👍🏾

@ElenaKhaustova
Copy link
Contributor Author

ElenaKhaustova commented Jan 31, 2025

After the conversation with @idanov, we concluded that this method doesn't bring much new functionality, so it makes sense to keep the old naming. However, we're still unsure about its necessity.

Our initial thinking was that the original list() method doesn't give any value since it allows filtering only by dataset names, which can now be easily done by applying a regex filter to keys(). That’s why we suggested extending the functionality to filter by kind, as users might struggle to access dataset types given that now we have lazy and materialized datasets. So, there might be an additional value in “by kind” filtering.

Based on the above we still have two options:

  • Remove list() method
  • Keep list() method but extend it with by_type (which basically means just renaming filter() -> list() it in the current implementation)

Please share your feedback on these two points (naming and necessity) 🙏

@datajoely
Copy link
Contributor

I'd vote for Keep list() method but extend it with by_type (which basically means just renaming filter() -> list() it in the current implementation)

The 3 things I don't like as a user today:

  • Unable to filter by type
  • Namespaces are weird (__ internal representation)
  • Factories aren't visible until used

@astrojuanlu
Copy link
Member

which can now be easily done by applying a regex filter to keys()

I don't follow, could you elaborate? How would the code would look like with and without this PR?

@ElenaKhaustova
Copy link
Contributor Author

which can now be easily done by applying a regex filter to keys()

I don't follow, could you elaborate? How would the code would look like with and without this PR?

pattern = re.compile(regex, flags=regex_flags)
filtered_ds = [ds_name for ds_name in catalog.keys() if pattern.search(ds_name)] #  or iterate via values/items

The point is that since dataset names can be easily accessed, it's also easy to apply custom filtering on top, and we don't need a dedicated method for it. However, it's a bit more complicated with dataset types, so filtering by types could add some value.

@datajoely
Copy link
Contributor

You're technically correct, but I think there is something to be said for the loss of DX / convenience for a function designed to use in interactive mode.

self,
name_regex: str | None = None,
name_regex_flags: int | re.RegexFlag = re.IGNORECASE,
type_regex: str | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I've read the PR description about filter by type I though that will be basically instanceof filtering, allowing to pass Type parameter, not string based class name, moreover a fully qualified one, which is more laborious to obtain for a plugin developer than just doing data_catalog.filter(by_type=PluginSpecificDataset). WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I hadn't even considered that, I thought it would be the string representation people do in the YAML config! To be honest, both would be useful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about that as well but in case we allow both Type and string type. So extending the suggested implementation sounds good to me.

@astrojuanlu
Copy link
Member

Thanks @ElenaKhaustova , that's what I imagined. In my view, given that our users aren't necessarily software engineers, our APIs shouldn't just be perfect partitions of the set of possible use cases, but allow for some convenience methods that, as @datajoely says, alleviate the burden a bit.

@merelcht
Copy link
Member

merelcht commented Feb 3, 2025

I still feel like filter() is a more descriptive name and makes it more obvious that you can provide filter arguments. Whereas with list I wouldn't really expect to be able to add a regex argument.

So my vote would be: introduce filter now and remove list when we do 1.0.0

@ElenaKhaustova
Copy link
Contributor Author

Thanks @ElenaKhaustova , that's what I imagined. In my view, given that our users aren't necessarily software engineers, our APIs shouldn't just be perfect partitions of the set of possible use cases, but allow for some convenience methods that, as @datajoely says, alleviate the burden a bit.

Could you please clarify whether you mean keeping the existing list implementation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants