Skip to content

Commit

Permalink
Merge pull request #33630 Expand documentation about YAML providers.
Browse files Browse the repository at this point in the history
  • Loading branch information
robertwb authored Jan 17, 2025
2 parents edc4766 + 16400af commit dd16104
Show file tree
Hide file tree
Showing 3 changed files with 110 additions and 68 deletions.
109 changes: 109 additions & 0 deletions website/www/site/content/en/documentation/sdks/yaml-providers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
---
type: languages
title: "Apache Beam YAML Providers"
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Providers

Though we aim to offer a large suite of built-in transforms, it is inevitable
that people will need to author their own. This is made possible
through the notion of Providers which leverage expansion services and
vend catalogues of schema transforms.

## Java

For example, you could build a jar that vends a
[cross language transform](https://beam.apache.org/documentation/sdks/python-multi-language-pipelines/)
or [schema transform](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.html)
and then use it in a transform as follows

```
pipeline:
type: chain
source:
type: ReadFromCsv
config:
path: /path/to/input*.csv
transforms:
- type: MyCustomTransform
config:
arg: whatever
sink:
type: WriteToJson
config:
path: /path/to/output.json
providers:
- type: javaJar
config:
jar: /path/or/url/to/myExpansionService.jar
transforms:
MyCustomTransform: "urn:registered:in:expansion:service"
```

A full example of how to build a java provider can be found
[here](https://github.com/apache/beam-starter-java-provider).

## Python

Arbitrary Python transforms can be provided as well, using the syntax

```
providers:
- type: pythonPackage
config:
packages:
- my_pypi_package>=version
- /path/to/local/package.zip
transforms:
MyCustomTransform: "pkg.module.PTransformClassOrCallable"
```

We offer a [python provider starter project](https://github.com/apache/beam-starter-python-provider)
that serves as a complete example for how to do this.

## YAML Provider listing files

One can reference an external listings of providers in the yaml pipeline file
via the syntax

```
providers:
- include: "file:///path/to/local/providers.yaml"
- include: "gs://path/to/remote/providers.yaml"
- include: "https://example.com/hosted/providers.yaml"
...
```

where `providers.yaml` is simply a yaml file containing a list of providers
in the same format as those inlined in this providers block.
See, for example, the provider listing [here](
https://github.com/apache/beam-starter-python-provider/blob/main/examples/provider_listing.yaml).

In fact, this is how many of the the built in transforms are declared,
see for example the [builtin io listing file](
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/standard_io.yaml).

Hosting these listing files (together with their required artifacts) allows
one to easily share catalogues of transforms that can be directly used
by others in their YAML pipelines.
68 changes: 0 additions & 68 deletions website/www/site/content/en/documentation/sdks/yaml.md
Original file line number Diff line number Diff line change
Expand Up @@ -623,74 +623,6 @@ options:
streaming: true
```
## Providers
Though we aim to offer a large suite of built-in transforms, it is inevitable
that people will want to author their own. This is made possible
through the notion of Providers which leverage expansion services and
schema transforms.
For example, you could build a jar that vends a
[cross language transform](https://beam.apache.org/documentation/sdks/python-multi-language-pipelines/)
or [schema transform](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.html)
and then use it in a transform as follows
```
pipeline:
type: chain
source:
type: ReadFromCsv
config:
path: /path/to/input*.csv

transforms:
- type: MyCustomTransform
config:
arg: whatever

sink:
type: WriteToJson
config:
path: /path/to/output.json

providers:
- type: javaJar
config:
jar: /path/or/url/to/myExpansionService.jar
transforms:
MyCustomTransform: "urn:registered:in:expansion:service"
```
A full example of how to build a java provider can be found
[here](https://github.com/Polber/beam-yaml-xlang).
Arbitrary Python transforms can be provided as well, using the syntax
```
providers:
- type: pythonPackage
config:
packages:
- my_pypi_package>=version
- /path/to/local/package.zip
transforms:
MyCustomTransform: "pkg.subpkg.PTransformClassOrCallable"
```
One can additionally reference an external listings of providers as follows
```
providers:
- include: "file:///path/to/local/providers.yaml"
- include: "gs://path/to/remote/providers.yaml"
- include: "https://example.com/hosted/providers.yaml"
...
```
where `providers.yaml` is simply a yaml file containing a list of providers
in the same format as those inlined in this providers block.
## Pipeline options
[Pipeline options](https://beam.apache.org/documentation/programming-guide/#configuring-pipeline-options)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@
<li><a href="/documentation/sdks/yaml-combine/">Yaml Aggregation</a></li>
<li><a href="/documentation/sdks/yaml-errors/">Error handling</a></li>
<li><a href="/documentation/sdks/yaml-inline-python/">Inlining Python</a></li>
<li><a href="/documentation/sdks/yaml-providers/">Yaml Providers</a></li>
<li><a href="/documentation/sdks/yaml-join/">Yaml Join</a></li>
<li><a href="https://beam.apache.org/releases/yamldoc/current/" target="_blank">YAML API reference <img src="/images/external-link-icon.png"
width="14" height="14"
Expand Down

0 comments on commit dd16104

Please sign in to comment.