Implement catalog filter for `KedroDataCatalog` #4449

ElenaKhaustova · 2025-01-28T15:47:12Z

Description

Implemented KedroDataCatalog.filter() to filter datasets by name and type.

Development notes

Implemented on top of #4448 <-- please review it first.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Read the contributing guidelines
Signed off each commit with a Developer Certificate of Origin (DCO)
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes
Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Elena Khaustova <[email protected]>

…talog-filter Signed-off-by: Elena Khaustova <[email protected]>

Signed-off-by: Elena Khaustova <[email protected]>

merelcht

Already discussed on the call that this looks good! 👍 Left some minor comments.

tests/io/test_kedro_data_catalog.py

kedro/io/kedro_data_catalog.py

Signed-off-by: Elena Khaustova <[email protected]>

…talog-filter

Signed-off-by: Elena Khaustova <[email protected]>

ankatiyar

manually tested, looks good! 👍🏾

ElenaKhaustova · 2025-01-31T16:15:10Z

After the conversation with @idanov, we concluded that this method doesn't bring much new functionality, so it makes sense to keep the old naming. However, we're still unsure about its necessity.

Our initial thinking was that the original list() method doesn't give any value since it allows filtering only by dataset names, which can now be easily done by applying a regex filter to keys(). That’s why we suggested extending the functionality to filter by kind, as users might struggle to access dataset types given that now we have lazy and materialized datasets. So, there might be an additional value in “by kind” filtering.

Based on the above we still have two options:

Remove list() method
Keep list() method but extend it with by_type (which basically means just renaming filter() -> list() it in the current implementation)

Please share your feedback on these two points (naming and necessity) 🙏

datajoely · 2025-01-31T16:25:56Z

I'd vote for Keep list() method but extend it with by_type (which basically means just renaming filter() -> list() it in the current implementation)

The 3 things I don't like as a user today:

Unable to filter by type
Namespaces are weird (__ internal representation)
Factories aren't visible until used

astrojuanlu · 2025-02-03T07:01:06Z

which can now be easily done by applying a regex filter to keys()

I don't follow, could you elaborate? How would the code would look like with and without this PR?

ElenaKhaustova · 2025-02-03T10:19:30Z

which can now be easily done by applying a regex filter to keys()

I don't follow, could you elaborate? How would the code would look like with and without this PR?

pattern = re.compile(regex, flags=regex_flags)
filtered_ds = [ds_name for ds_name in catalog.keys() if pattern.search(ds_name)] #  or iterate via values/items

The point is that since dataset names can be easily accessed, it's also easy to apply custom filtering on top, and we don't need a dedicated method for it. However, it's a bit more complicated with dataset types, so filtering by types could add some value.

datajoely · 2025-02-03T10:58:24Z

You're technically correct, but I think there is something to be said for the loss of DX / convenience for a function designed to use in interactive mode.

marrrcin · 2025-02-03T12:35:28Z

kedro/io/kedro_data_catalog.py

+        self,
+        name_regex: str | None = None,
+        name_regex_flags: int | re.RegexFlag = re.IGNORECASE,
+        type_regex: str | None = None,


When I've read the PR description about filter by type I though that will be basically instanceof filtering, allowing to pass Type parameter, not string based class name, moreover a fully qualified one, which is more laborious to obtain for a plugin developer than just doing data_catalog.filter(by_type=PluginSpecificDataset). WDYT?

Oh I hadn't even considered that, I thought it would be the string representation people do in the YAML config! To be honest, both would be useful

I was thinking about that as well but in case we allow both Type and string type. So extending the suggested implementation sounds good to me.

astrojuanlu · 2025-02-03T12:39:53Z

Thanks @ElenaKhaustova , that's what I imagined. In my view, given that our users aren't necessarily software engineers, our APIs shouldn't just be perfect partitions of the set of possible use cases, but allow for some convenience methods that, as @datajoely says, alleviate the burden a bit.

merelcht · 2025-02-03T13:41:46Z

I still feel like filter() is a more descriptive name and makes it more obvious that you can provide filter arguments. Whereas with list I wouldn't really expect to be able to add a regex argument.

So my vote would be: introduce filter now and remove list when we do 1.0.0

ElenaKhaustova · 2025-02-03T14:47:07Z

Thanks @ElenaKhaustova , that's what I imagined. In my view, given that our users aren't necessarily software engineers, our APIs shouldn't just be perfect partitions of the set of possible use cases, but allow for some convenience methods that, as @datajoely says, alleviate the burden a bit.

Could you please clarify whether you mean keeping the existing list implementation?

ElenaKhaustova and others added 27 commits January 22, 2025 11:49

Fixed catalog list for KedroDataCatalog

c328842

Signed-off-by: Elena Khaustova <[email protected]>

Replaced solution

0e4080b

Signed-off-by: Elena Khaustova <[email protected]>

Updated solution and made it on the catalog side

bf09541

Signed-off-by: Elena Khaustova <[email protected]>

Updated internal datasets access for KedroDataCatalog

2ec76e9

Signed-off-by: Elena Khaustova <[email protected]>

Fixed __getattribute__

43cafca

Signed-off-by: Elena Khaustova <[email protected]>

Added test template

4972a0c

Signed-off-by: Elena Khaustova <[email protected]>

Updated solution and test

424eea6

Signed-off-by: Elena Khaustova <[email protected]>

Fixed linter

bd36f24

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into fix/4436-catalog-list

b48a274

Updated release notes

93cbbb1

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into fix/4436-catalog-list

b02f5c0

Implemented a draft of filtering method

329a56c

Signed-off-by: Elena Khaustova <[email protected]>

Updated filter

9839bd8

Signed-off-by: Elena Khaustova <[email protected]>

Fixed lint

1fc083f

Signed-off-by: Elena Khaustova <[email protected]>

Updated old list method

6fcd7f4

Signed-off-by: Elena Khaustova <[email protected]>

Implemented tests for new filter

13a0b2b

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into feature/3917-refactor-catalog-filter

ede073f

Signed-off-by: Elena Khaustova <[email protected]>

Added tests for lazy datasets

9aa62f7

Signed-off-by: Elena Khaustova <[email protected]>

Added docstrings and usage examples

f4befbf

Signed-off-by: Elena Khaustova <[email protected]>

Updated examples in the docstrings

392309f

Signed-off-by: Elena Khaustova <[email protected]>

Updated lazy dataset representation

f80ebaa

Signed-off-by: Elena Khaustova <[email protected]>

Updated unit tests

156a0d3

Signed-off-by: Elena Khaustova <[email protected]>

Updated tests to reach coverage

cdd4c8b

Signed-off-by: Elena Khaustova <[email protected]>

Updated release notes

9ec854d

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'update-lazy-dataset-repr' into feature/3917-refactor-ca…

cbe02a3

…talog-filter Signed-off-by: Elena Khaustova <[email protected]>

Updated _LazyDataset representation

80da4e8

Signed-off-by: Elena Khaustova <[email protected]>

Updated release notes

b26302c

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova mentioned this pull request Jan 28, 2025

Update lazy dataset representation #4448

Merged

7 tasks

ElenaKhaustova marked this pull request as ready for review January 28, 2025 15:50

ElenaKhaustova requested a review from merelcht as a code owner January 28, 2025 15:50

ElenaKhaustova requested review from datajoely, DimedS, merelcht, lrcouto and ankatiyar January 28, 2025 15:50

Added default value to the docstrings

f07f0a2

Signed-off-by: Elena Khaustova <[email protected]>

merelcht approved these changes Jan 29, 2025

View reviewed changes

tests/io/test_kedro_data_catalog.py Show resolved Hide resolved

kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved

ElenaKhaustova added 7 commits January 29, 2025 16:19

Renamed _compile_pattern to _compile_regex_pattern

3884abe

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into update-lazy-dataset-repr

70f61a6

Updated release notes

d9a0f1a

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'update-lazy-dataset-repr' into feature/3917-refactor-ca…

03e303d

…talog-filter

Updated release notes

690f105

Signed-off-by: Elena Khaustova <[email protected]>

Updated secrets baseline

cbeec99

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into feature/3917-refactor-catalog-filter

2c9891d

Signed-off-by: Elena Khaustova <[email protected]>

ankatiyar approved these changes Jan 30, 2025

View reviewed changes

ElenaKhaustova requested review from marrrcin, deepyaman, noklam, rashidakanchwala, ravi-kumar-pilla and SajidAlamQB January 31, 2025 16:16

marrrcin reviewed Feb 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement catalog filter for `KedroDataCatalog` #4449

Implement catalog filter for `KedroDataCatalog` #4449

ElenaKhaustova commented Jan 28, 2025

merelcht left a comment

ankatiyar left a comment

ElenaKhaustova commented Jan 31, 2025 •

edited

Loading

datajoely commented Jan 31, 2025

astrojuanlu commented Feb 3, 2025

ElenaKhaustova commented Feb 3, 2025

datajoely commented Feb 3, 2025

marrrcin Feb 3, 2025

datajoely Feb 3, 2025

ElenaKhaustova Feb 3, 2025

astrojuanlu commented Feb 3, 2025

merelcht commented Feb 3, 2025

ElenaKhaustova commented Feb 3, 2025

Implement catalog filter for KedroDataCatalog #4449

Are you sure you want to change the base?

Implement catalog filter for KedroDataCatalog #4449

Conversation

ElenaKhaustova commented Jan 28, 2025

Description

Development notes

Developer Certificate of Origin

Checklist

merelcht left a comment

Choose a reason for hiding this comment

ankatiyar left a comment

Choose a reason for hiding this comment

ElenaKhaustova commented Jan 31, 2025 • edited Loading

datajoely commented Jan 31, 2025

astrojuanlu commented Feb 3, 2025

ElenaKhaustova commented Feb 3, 2025

datajoely commented Feb 3, 2025

marrrcin Feb 3, 2025

Choose a reason for hiding this comment

datajoely Feb 3, 2025

Choose a reason for hiding this comment

ElenaKhaustova Feb 3, 2025

Choose a reason for hiding this comment

astrojuanlu commented Feb 3, 2025

merelcht commented Feb 3, 2025

ElenaKhaustova commented Feb 3, 2025

Implement catalog filter for `KedroDataCatalog` #4449

Implement catalog filter for `KedroDataCatalog` #4449

ElenaKhaustova commented Jan 31, 2025 •

edited

Loading