You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A bug was found that if a user is converting arbitrary byte sequences to strings, to get around being unable to use []byte as a key to a map. This leads to these strings to sometimes be non-UTF8 compliant, which will break on encoding/decoding.
Eg. Converting the byte sequences like [2 208 15] or [2 239 191 189 15] to strings simply can't be round-tripped correctly as JSON, so the encoded and decoded values do not match.
The check would be to recursively examine every exported field in a structural DoFn for use of string, and checking if it's utf8 compliant. The check could be skipped for subtypes that implement the MarshalJSON and UnmarshalJSON interface methods.
The vet runner which can be electively run before any given pipeline with the --beam_strict flag would be the appropriate place to add this sort of checking to avoid more expensive checks 100% of the time.
A complete fix would also add documentation to the website and GoDoc around the JSON encoding of DoFns, in particular calling out this issue (that is, emphasizing strings must be UTF8).
Issue Priority
Priority: 3 (nice-to-have improvement)
Issue Components
Component: Python SDK
Component: Java SDK
Component: Go SDK
Component: Typescript SDK
Component: IO connector
Component: Beam examples
Component: Beam playground
Component: Beam katas
Component: Website
Component: Spark Runner
Component: Flink Runner
Component: Samza Runner
Component: Twister2 Runner
Component: Hazelcast Jet Runner
Component: Google Cloud Dataflow Runner
The text was updated successfully, but these errors were encountered:
@tobehardest Done! In the future, you can self assign an issue by commenting .take-issue and a bot will handle it. See the Beam contribution guide for more! https://beam.apache.org/contribute
What needs to happen?
A bug was found that if a user is converting arbitrary byte sequences to strings, to get around being unable to use
[]byte
as a key to a map. This leads to these strings to sometimes be non-UTF8 compliant, which will break on encoding/decoding.Eg. Converting the byte sequences like [2 208 15] or [2 239 191 189 15] to strings simply can't be round-tripped correctly as JSON, so the encoded and decoded values do not match.
The check would be to recursively examine every exported field in a structural DoFn for use of
string
, and checking if it's utf8 compliant. The check could be skipped for subtypes that implement the MarshalJSON and UnmarshalJSON interface methods.The vet runner which can be electively run before any given pipeline with the
--beam_strict
flag would be the appropriate place to add this sort of checking to avoid more expensive checks 100% of the time.A complete fix would also add documentation to the website and GoDoc around the JSON encoding of DoFns, in particular calling out this issue (that is, emphasizing strings must be UTF8).
Issue Priority
Priority: 3 (nice-to-have improvement)
Issue Components
The text was updated successfully, but these errors were encountered: