This is a completed version of the spaceflights tutorial project and the extra tutorial sections on visualisation with Kedro-Viz. This project includes the data required to run it. The code in this repository demonstrates best practice when working with Kedro and PySpark.
To create a project based on this starter, ensure you have installed Kedro into a virtual environment. Then use the following command:
pip install kedro
kedro new --starter=spaceflights-pyspark-viz
After the project is created, navigate to the newly created project directory:
cd <my-project-name> # change directory
Install the required dependencies:
pip install -r requirements.txt
Now you can run the project:
kedro run
While Spark allows you to specify many different configuration options, this starter uses /conf/base/spark.yml
as a single configuration location.
This Kedro starter contains the initialisation code for SparkSession
in hooks.py
and takes its configuration from /conf/base/spark.yml
. Modify the SparkHooks
code if you want to further customise your SparkSession
, e.g. to use YARN.
In some cases it can be desirable to handle one dataset in different ways, for example to load a parquet file into your pipeline using pandas
and to save it using spark
. In this starter, one of the input datasets shuttles
, is an excel file.
It's not possible to load an excel file directly into Spark, so we use transcoding to save the file as a pandas.CSVDataset
first which then allows us to load it as a spark.SparkDataset
further on in the pipeline.