- Added a
MatplotlibS3Writer
dataset incontrib
for saving Matplotlib images to S3.
ParallelRunner
now works withSparkDataSet
.- Allowed the use of nulls in
parameters.yml
. - Fixed an issue where
%reload_kedro
wasn't reloading all user modules. - Fixed
pandas_to_spark
andspark_to_pandas
decorators to work with functions with kwargs.
- Renamed entry point for running pip-installed projects to
run_package()
instead ofmain()
insrc/<package>/run.py
.
kedro jupyter
now gives the default kernel a sensible name.Pipeline.name
has been deprecated in favour ofPipeline.tags
.- Reuse pipelines within a Kedro project using
Pipeline.transform
, it simplifies dataset and node renaming. - Added Jupyter Notebook line magic (
%run_viz
) to runkedro viz
in a Notebook cell (requireskedro-viz
version3.0.0
or later). - Added the following datasets:
NetworkXLocalDataSet
inkedro.contrib.io.networkx
to load and save local graphs (JSON format) via NetworkX. (by @josephhaaga)SparkHiveDataSet
inkedro.contrib.io.pyspark.SparkHiveDataSet
allowing usage of Spark and insert/upsert on non-transactional Hive tables
kedro.contrib.config.TemplatedConfigLoader
now supports name/dict key templating and default values.
get_last_load_version()
method for versioned datasets now returns exact last load version if the dataset has been loaded at least once andNone
otherwise.- Fixed a bug in
_exists
method for versionedSparkDataSet
. - Enabled the customisation of the ExcelWriter in
ExcelLocalDataSet
by specifying options underwriter
key insave_args
. - Fixed a bug in IPython startup script, attempting to load context from the incorrect location.
- Removed capping the length of a dataset's string representation.
- Fixed
kedro install
command failing on Windows ifsrc/requirements.txt
contains a different version of Kedro. - Enabled passing a single tag into a node or a pipeline without having to wrap it in a list (i.e.
tags="my_tag"
).
- Removed
_check_paths_consistency()
method fromAbstractVersionedDataSet
. Version consistency check is now done inAbstractVersionedDataSet.save()
. Custom versioned datasets should modifysave()
method implementation accordingly.
Joseph Haaga, Deepyaman Datta, Joost Duisters, Zain Patel, Tom Vigrass
- Narrowed the requirements for
PyTables
so that we maintain support for Python 3.5.
- Added
--load-version
, akedro run
argument that allows you run the pipeline with a particular load version of a dataset. - Support for modular pipelines in
src/
, break the pipeline into isolated parts with reusability in mind. - Support for multiple pipelines, an ability to have multiple entry point pipelines and choose one with
kedro run --pipeline NAME
. - Added a
MatplotlibWriter
dataset incontrib
for saving Matplotlib images. - An ability to template/parameterize configuration files with
kedro.contrib.config.TemplatedConfigLoader
. - Parameters are exposed as a context property for ease of access in iPython / Jupyter Notebooks with
context.params
. - Added
max_workers
parameter forParallelRunner
.
- Users will override the
_get_pipeline
abstract method inProjectContext(KedroContext)
inrun.py
rather than thepipeline
abstract property. Thepipeline
property is not abstract anymore. - Improved an error message when versioned local dataset is saved and unversioned path already exists.
- Added
catalog
global variable to00-kedro-init.py
, allowing you to load datasets withcatalog.load()
. - Enabled tuples to be returned from a node.
- Disallowed the
ConfigLoader
loading the same file more than once, and deduplicated theconf_paths
passed in. - Added a
--open
flag tokedro build-docs
that opens the documentation on build. - Updated the
Pipeline
representation to include name of the pipeline, also making it readable as a context property. kedro.contrib.io.pyspark.SparkDataSet
andkedro.contrib.io.azure.CSVBlobDataSet
now support versioning.
KedroContext.run()
no longer acceptscatalog
andpipeline
arguments.node.inputs
now returns the node's inputs in the order required to bind them properly to the node's function.
Deepyaman Datta, Luciano Issoe, Joost Duisters, Zain Patel, William Ashford, Karlson Lee
- Extended
versioning
support to cover the tracking of environment setup, code and datasets. - Added the following datasets:
FeatherLocalDataSet
incontrib
for usage with Pandas. (by @mdomarsaleem)
- Added
get_last_load_version
andget_last_save_version
toAbstractVersionedDataSet
. - Implemented
__call__
method onNode
to allow for users to executemy_node(input1=1, input2=2)
as an alternative tomy_node.run(dict(input1=1, input2=2))
. - Added new
--from-inputs
run argument.
- Fixed a bug in
load_context()
not loading context in non-Kedro Jupyter Notebooks. - Fixed a bug in
ConfigLoader.get()
not listing nested files for**
-ending glob patterns. - Fixed a logging config error in Jupyter Notebook.
- Updated documentation in
03_configuration
regarding how to modify the configuration path. - Documented the architecture of Kedro showing how we think about library, project and framework components.
extras/kedro_project_loader.py
renamed toextras/ipython_loader.py
and now runs any IPython startup scripts without relying on the Kedro project structure.- Fixed TypeError when validating partial function's signature.
- After a node failure during a pipeline run, a resume command will be suggested in the logs. This command will not work if the required inputs are MemoryDataSets.
Omar Saleem, Mariana Silva, Anil Choudhary, Craig
- Added
KedroContext
base class which holds the configuration and Kedro's main functionality (catalog, pipeline, config, runner). - Added a new CLI command
kedro jupyter convert
to facilitate converting Jupyter Notebook cells into Kedro nodes. - Added support for
pip-compile
and new Kedro commandkedro build-reqs
that generatesrequirements.txt
based onrequirements.in
. - Running
kedro install
will install packages to conda environment ifsrc/environment.yml
exists in your project. - Added a new
--node
flag tokedro run
, allowing users to run only the nodes with the specified names. - Added new
--from-nodes
and--to-nodes
run arguments, allowing users to run a range of nodes from the pipeline. - Added prefix
params:
to the parameters specified inparameters.yml
which allows users to differentiate between their different parameter node inputs and outputs. - Jupyter Lab/Notebook now starts with only one kernel by default.
- Added the following datasets:
CSVHTTPDataSet
to load CSV using HTTP(s) links.JSONBlobDataSet
to load json (-delimited) files from Azure Blob Storage.ParquetS3DataSet
incontrib
for usage with Pandas. (by @mmchougule)CachedDataSet
incontrib
which will cache data in memory to avoid io/network operations. It will clear the cache once a dataset is no longer needed by a pipeline. (by @tsanikgr)YAMLLocalDataSet
incontrib
to load and save local YAML files. (by @Minyus)
- Documentation improvements including instructions on how to initialise a Spark session using YAML configuration.
anyconfig
default log level changed fromINFO
toWARNING
.- Added information on installed plugins to
kedro info
. - Added style sheets for project documentation, so the output of
kedro build-docs
will resemble the style ofkedro docs
.
- Simplified the Kedro template in
run.py
with the introduction ofKedroContext
class. - Merged
FilepathVersionMixIn
andS3VersionMixIn
under one abstract classAbstractVersionedDataSet
which extendsAbstractDataSet
. name
changed to be a keyword-only argument forPipeline
.CSVLocalDataSet
no longer supports URLs.CSVHTTPDataSet
supports URLs.
This guide assumes that:
- The framework specific code has not been altered significantly
- Your project specific code is stored in the dedicated python package under
src/
.
The breaking changes were introduced in the following project template files:
<project-name>/.ipython/profile_default/startup/00-kedro-init.py
<project-name>/kedro_cli.py
<project-name>/src/tests/test_run.py
<project-name>/src/<package-name>/run.py
<project-name>/.kedro.yml
(new file)
The easiest way to migrate your project from Kedro 0.14.* to Kedro 0.15.0 is to create a new project (by using kedro new
) and move code and files bit by bit as suggested in the detailed guide below:
-
Create a new project with the same name by running
kedro new
-
Copy the following folders to the new project:
results/
references/
notebooks/
logs/
data/
conf/
- If you customised your
src/<package>/run.py
, make sure you apply the same customisations tosrc/<package>/run.py
- If you customised
get_config()
, you can overrideconfig_loader
property inProjectContext
derived class - If you customised
create_catalog()
, you can overridecatalog()
property inProjectContext
derived class - If you customised
run()
, you can overriderun()
method inProjectContext
derived class - If you customised default
env
, you can override it inProjectContext
derived class or pass it at construction. By default,env
islocal
. - If you customised default
root_conf
, you can overrideCONF_ROOT
attribute inProjectContext
derived class. By default,KedroContext
base class hasCONF_ROOT
attribute set toconf
.
- The following syntax changes are introduced in ipython or Jupyter notebook/labs:
proj_dir
->context.project_path
proj_name
->context.project_name
conf
->context.config_loader
.io
->context.catalog
(e.g.,io.load()
->context.catalog.load()
)
-
If you customised your
kedro_cli.py
, you need to apply the same customisations to yourkedro_cli.py
in the new project. -
Copy the contents of the old project's
src/requirements.txt
into the new project'ssrc/requirements.in
and, from the project root directory, run thekedro build-reqs
command in your terminal window.
If you defined any custom dataset classes which support versioning in your project, you need to apply the following changes:
- Make sure your dataset inherits from
AbstractVersionedDataSet
only. - Call
super().__init__()
with the appropriate arguments in the dataset's__init__
. If storing on local filesystem, providing the filepath and the version is enough. Otherwise, you should also pass in anexists_function
and aglob_function
that emulateexists
andglob
in a different filesystem (seeCSVS3DataSet
as an example). - Remove setting of the
_filepath
and_version
attributes in the dataset's__init__
, as this is taken care of in the base abstract class. - Any calls to
_get_load_path
and_get_save_path
methods should take no arguments. - Ensure you convert the output of
_get_load_path
and_get_save_path
appropriately, as these now returnPurePath
s instead of strings. - Make sure
_check_paths_consistency
is called withPurePath
s as input arguments, instead of strings.
These steps should have brought your project to Kedro 0.15.0. There might be some more minor tweaks needed as every project is unique, but now you have a pretty solid base to work with. If you run into any problems, please consult the Kedro documentation.
Dmitry Vukolov, Jo Stichbury, Angus Williams, Deepyaman Datta, Mayur Chougule, Marat Kopytjuk, Evan Miller, Yusuke Minami
- Tab completion for catalog datasets in
ipython
orjupyter
sessions. (Thank you @datajoely and @WaylonWalker) - Added support for transcoding, an ability to decouple loading/saving mechanisms of a dataset from its storage location, denoted by adding '@' to the dataset name.
- Datasets have a new
release
function that instructs them to free any cached data. The runners will call this when the dataset is no longer needed downstream.
- Add support for pipeline nodes made up from partial functions.
- Expand user home directory
~
for TextLocalDataSet (see issue #19). - Add a
short_name
property toNode
s for a display-friendly (but not necessarily unique) name. - Add Kedro project loader for IPython:
extras/kedro_project_loader.py
. - Fix source file encoding issues with Python 3.5 on Windows.
- Fix local project source not having priority over the same source installed as a package, leading to local updates not being recognised.
- Remove the max_loads argument from the
MemoryDataSet
constructor and from theAbstractRunner.create_default_data_set
method.
Joel Schwarzmann, Alex Kalmikov
- Added Data Set transformer support in the form of AbstractTransformer and DataCatalog.add_transformer.
- Merged the
ExistsMixin
intoAbstractDataSet
. Pipeline.node_dependencies
returns a dictionary keyed by node, with sets of parent nodes as values;Pipeline
andParallelRunner
were refactored to make use of this for topological sort for node dependency resolution and running pipelines respectively.Pipeline.grouped_nodes
returns a list of sets, rather than a list of lists.
- New I/O module
HDFS3DataSet
.
- Improved API docs.
- Template
run.py
will throw a warning instead of error ifcredentials.yml
is not present.
None
The initial release of Kedro.
Jo Stichbury, Aris Valtazanos, Fabian Peters, Guilherme Braccialli, Joel Schwarzmann, Miguel Beltre, Mohammed ElNabawy, Deepyaman Datta, Shubham Agrawal, Oleg Andreyev, Mayur Chougule, William Ashford, Ed Cannon, Nikhilesh Nukala, Sean Bailey, Vikram Tegginamath, Thomas Huijskens, Musa Bilal
We are also grateful to everyone who advised and supported us, filed issues or helped resolve them, asked and answered questions and were part of inspiring discussions.