Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate code snippets in datasets documentation - Part 1 #1962

Merged
merged 43 commits into from
Nov 9, 2022

Conversation

AhdraMeraliQB
Copy link
Contributor

@AhdraMeraliQB AhdraMeraliQB commented Oct 21, 2022

Description

This follows on from #2287. In this PR I have tested all the datasets that can be instantiated locally. The following datasets remain to be tested in part 2:

  • dask.ParquetDataSet
  • holoviews.HoloviewsWriter
  • pandas.GBQQueryDataSet
  • pandas.GBQTableDataSet
  • pandas.SQLQueryDataSet
  • pandas.SQLTableDataSet
  • redis.PickleDataSet
  • all Spark datasets
  • tensorflow.TensorFlowModelDataset

The main changes in this PR are as follows:

  • Some code blocks had duplicate lines for providing filepath arguments. I believe this was to illustrate both local and external (S3/GCS) filesytems, but after discussion with @MerelTheisenQB we have agreed this is a bit redundant.
  • Some comments rendered with weird line breaks. This was fixed.
  • There was a comment for a dependency to be installed within an example code block. This has been moved out to extras_require in setup.py
  • Added missing imports

Also warranting a flag 🚩: GraphMLDataSet and GMLDataSet appear to be identical in purpose and in implementation

Development notes

Both the YAML and Python examples have been tested under the following setup:

  • New conda environment
  • pip install kedro
  • pip install kedro[all]
For Python:

kedro jupyter lab
Code snippets were copied and pasted over
An additional cell was added to check <dataset name>.exists() returned True

For YAML:

kedro new -s spaceflights
The dataset was added to the data catalog in conf/base/catalog.yml and added to the pipeline arbitrarily.

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

@@ -28,17 +28,13 @@ class JSONDataSet(AbstractVersionedDataSet[Any, Any]):
>>> json_dataset:
>>> type: json.JSONDataSet
>>> filepath: data/01_raw/location.json
>>> load_args:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!

noklam and others added 26 commits October 21, 2022 13:19
* Update release version and release notes

Signed-off-by: Nok Chan <[email protected]>

* Update missing release notes

Signed-off-by: Nok Chan <[email protected]>

* update vresion

Signed-off-by: Nok Chan <[email protected]>

* update release notes

Signed-off-by: Nok Chan <[email protected]>

Signed-off-by: Nok Chan <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
* Fix code snippets

Signed-off-by: Ahdra Merali <[email protected]>

* Separate code blocks

Signed-off-by: Ahdra Merali <[email protected]>

* Lint

Signed-off-by: Ahdra Merali <[email protected]>

Signed-off-by: Ahdra Merali <[email protected]>
* Update RELEASE.md

* fix broken link

* Update RELEASE.md

Co-authored-by: Merel Theisen <[email protected]>

Co-authored-by: Merel Theisen <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
* Deprecating `kedro test` and `kedro lint`

Signed-off-by: Nok Chan <[email protected]>

* Deprecate commands

Signed-off-by: Nok Chan <[email protected]>

* Make kedro looks prettier

* Update Linting

Signed-off-by: Nok <[email protected]>

Signed-off-by: Nok Chan <[email protected]>
Signed-off-by: Nok <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
Signed-off-by: Florian Gaudin-Delrieu <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
…related issues better (#1881)

* Update message for VersionNotFoundError

Signed-off-by: Ankita Katiyar <[email protected]>

* Add test for VersionNotFoundError for cloud protocols

* Update test_data_catalog.py

Update NoVersionFoundError test

* minor linting update

* update docs link + styling changes

* Revert "update docs link + styling changes"

This reverts commit 6088e00.

* Update test with styling changes

* Update RELEASE.md

Signed-off-by: ankatiyar <[email protected]>

Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: ankatiyar <[email protected]>
Co-authored-by: Ahdra Merali <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
* Add best_practices.md with introductory sections

Signed-off-by: Jannic Holzer <[email protected]>

* Add pytest and pytest-cov sections

Signed-off-by: Jannic Holzer <[email protected]>

* Add pytest-cov coverage report

Signed-off-by: Jannic Holzer <[email protected]>

* Add sections on pytest-cov

Signed-off-by: Jannic Holzer <[email protected]>

* Add automated_testing to index.rst

Signed-off-by: Jannic Holzer <[email protected]>

* Reformat third-party library names and clean grammar.

Signed-off-by: Jannic Holzer <[email protected]>

* Add link to virtual environment docs

Signed-off-by: Jannic Holzer <[email protected]>

* Add example of good test naming

Signed-off-by: Jannic Holzer <[email protected]>

* Improve link accessibility

Signed-off-by: Jannic Holzer <[email protected]>

* Improve pytest docs link accessibility

Signed-off-by: Jannic Holzer <[email protected]>

* Add reminder link to virtual environment docs

Signed-off-by: Jannic Holzer <[email protected]>

* Fix formatting in link to coverage docs

Signed-off-by: Jannic Holzer <[email protected]>

* Remove reference to /src under 'Run your tests'

Signed-off-by: Jannic Holzer <[email protected]>

* Modify references to <project_name> to <package_name>

Signed-off-by: Jannic Holzer <[email protected]>

* Fix sentence structure

Signed-off-by: Jannic Holzer <[email protected]>

* Fix broken databricks doc link

Signed-off-by: Jannic Holzer <[email protected]>

Signed-off-by: Jannic Holzer <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
* Capitalised kedro-viz

Signed-off-by: yash6318 <[email protected]>

* capitalised Kedro viz

Signed-off-by: yash6318 <[email protected]>

* Updated set_up_experiment_tracking.md

Co-authored-by: Deepyaman Datta <[email protected]>
Signed-off-by: yash6318 <[email protected]>

Signed-off-by: yash6318 <[email protected]>
Co-authored-by: Deepyaman Datta <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
Signed-off-by: Merel Theisen <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
Signed-off-by: Carla Vieira <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
* Change a file extention to match the previous article

Signed-off-by: dinotuku <[email protected]>

* Add a missing import

Signed-off-by: dinotuku <[email protected]>

* Change both preprocessed datasets to parquet files

Signed-off-by: dinotuku <[email protected]>

* Change data type to ParquetDataSet for parquet files

Signed-off-by: dinotuku <[email protected]>

* Add a note for installing seaborn if it is not installed

Signed-off-by: dinotuku <[email protected]>

Signed-off-by: dinotuku <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
* Add documentation for linting tools

Signed-off-by: Ankita Katiyar <[email protected]>

* Revert changes to commands_reference.md

Signed-off-by: Ankita Katiyar <[email protected]>

* Update linting docs with suggestions

Signed-off-by: Ankita Katiyar <[email protected]>

* Update linting doc

Signed-off-by: Ankita Katiyar <[email protected]>

Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
Signed-off-by: Merel Theisen <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
* Create dependabot.yml configuration file

* Update dependabot.yml

Signed-off-by: SajidAlamQB <[email protected]>

* add target-branch

Signed-off-by: SajidAlamQB <[email protected]>

* Update dependabot.yml

Signed-off-by: SajidAlamQB <[email protected]>

* limit dependabot to just dependency folder

Signed-off-by: SajidAlamQB <[email protected]>

* Update test_requirements.txt

Signed-off-by: SajidAlamQB <[email protected]>

* Update MANIFEST.in

Signed-off-by: SajidAlamQB <[email protected]>

* fix e2e

Signed-off-by: SajidAlamQB <[email protected]>

* Update continue_config.yml

Signed-off-by: SajidAlamQB <[email protected]>

* Update requirements.txt

Signed-off-by: SajidAlamQB <[email protected]>

* Update requirements.txt

Signed-off-by: SajidAlamQB <[email protected]>

* fix link

Signed-off-by: SajidAlamQB <[email protected]>

* revert

Signed-off-by: SajidAlamQB <[email protected]>

* Delete requirements.txt

Signed-off-by: SajidAlamQB <[email protected]>

Signed-off-by: SajidAlamQB <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
SajidAlamQB and others added 9 commits October 21, 2022 13:19
Signed-off-by: Ahdra Merali <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
* Update dependabot.yml

Signed-off-by: SajidAlamQB <[email protected]>

* pin jupyterlab_services to requirments

Signed-off-by: SajidAlamQB <[email protected]>

* lint

Signed-off-by: SajidAlamQB <[email protected]>

Signed-off-by: SajidAlamQB <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
Updates the requirements on [pip-tools](https://github.com/jazzband/pip-tools) to permit the latest version.
- [Release notes](https://github.com/jazzband/pip-tools/releases)
- [Changelog](https://github.com/jazzband/pip-tools/blob/master/CHANGELOG.md)
- [Commits](jazzband/pip-tools@6.5.0...6.9.0)

---
updated-dependencies:
- dependency-name: pip-tools
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: Ahdra Merali <[email protected]>
Updates the requirements on [toposort]() to permit the latest version.

---
updated-dependencies:
- dependency-name: toposort
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Sajid Alam <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
* remove a redundant function call

Signed-off-by: Nok Chan <[email protected]>

* Remove redundant resolove_load_version & fix test

Signed-off-by: Nok Chan <[email protected]>

* Fix HoloviewWriter tests with more specific error message pattern & Lint

Signed-off-by: Nok Chan <[email protected]>

* Rename tests

Signed-off-by: Nok Chan <[email protected]>

Signed-off-by: Nok Chan <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
@merelcht
Copy link
Member

I found the PR where the GraphMLDataSet and GMLDataSet were added: #881 It looks like both of them were added because "Included GraphML in addition to GML. Some tools prefer GraphML as it's able to perfectly preserve internal property maps (ref)."

So let's just keep them both.

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on going through all these examples and testing them! 👍 ⭐

@@ -28,17 +28,13 @@ class JSONDataSet(AbstractVersionedDataSet[Any, Any]):
>>> json_dataset:
>>> type: json.JSONDataSet
>>> filepath: data/01_raw/location.json
>>> load_args:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!

setup.py Outdated
@@ -78,6 +78,7 @@ def _collect_requirements(requires):
"pandas.XMLDataSet": [PANDAS, "lxml~=4.6"],
"pandas.GenericDataSet": [PANDAS],
}
pickle_require = {"pickle.PickleDataSet": ["compress-pickle~=2.1.0"]}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original comment said: "# Add "compress_pickle[lz4]" to requirements.txt" Do we need to specify [lz4] for this requirement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - this is what the pip list showed after manually installing compress_pickle[lz4], I'll make the change to follow the dask example further up

@AhdraMeraliQB
Copy link
Contributor Author

@MerelTheisenQB

I found the PR where the GraphMLDataSet and GMLDataSet were added: #881 It looks like both of them were added because "Included GraphML in addition to GML. Some tools prefer GraphML as it's able to perfectly preserve internal property maps (ref)."

So let's just keep them both.

I took a look at the ref but couldn't see the preference; is it the name that they prefer? As the implementation is carbon copies 😬

Copy link
Contributor

@ankatiyar ankatiyar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@merelcht
Copy link
Member

I took a look at the ref but couldn't see the preference; is it the name that they prefer? As the implementation is carbon copies 😬

The difference between GraphML and GML (apparently) is that GraphML preserves internal property maps, but GML doesn't. See this comment in the docs: "The only file formats which are capable of perfectly preserving the internal property maps are “gt” and “graphml”. Because of this, they should be preferred over the other formats whenever possible." https://graph-tool.skewed.de/static/doc/graph_tool.html#graph_tool.load_graph

Perhaps the question is whether our users make use of both the datasets or not. It's not an extreme maintenance burden so I'd just keep them both.

This was referenced Nov 7, 2022
@merelcht merelcht mentioned this pull request Nov 7, 2022
5 tasks
@AhdraMeraliQB AhdraMeraliQB self-assigned this Nov 8, 2022
@AhdraMeraliQB AhdraMeraliQB merged commit f52249b into main Nov 9, 2022
@AhdraMeraliQB AhdraMeraliQB deleted the fix/check-code-snippets branch November 9, 2022 08:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.