This repository has been archived by the owner on Jun 30, 2022. It is now read-only.
Version 0.2.1
The 0.2.1 release includes the following changes:
- Optimized performance for the following features:
- Logging
- Shuffle Writing
- Using Coders
- Compiling some of the worker modules with Cython
- Changed the default behavior for Cloud execution: Instead of downloading the SDK from a Cloud Storage bucket, you now download the SDK as a tarball from GitHub. When you run jobs using the Dataflow service, the SDK version used will match the version you've downloaded (to your local environment). You can use the --sdk_location pipeline option to override this behavior and provide an explicit tarball location (Cloud Storage path or URL).
- Fixed several pickling issues related to how Dataflow serializes user functions and data.
- Fixed several worker lease expiration issues experienced when processing large datasets.
- Improved validation to detect various common errors, such as access issues and invalid parameter combinations, much earlier in time.