Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task][Go SDK]: Add String UTF8 check to vet runner for serialization. #24949

Closed
1 of 15 tasks
lostluck opened this issue Jan 9, 2023 · 4 comments · Fixed by #33949
Closed
1 of 15 tasks

[Task][Go SDK]: Add String UTF8 check to vet runner for serialization. #24949

lostluck opened this issue Jan 9, 2023 · 4 comments · Fixed by #33949

Comments

@lostluck
Copy link
Contributor

lostluck commented Jan 9, 2023

What needs to happen?

A bug was found that if a user is converting arbitrary byte sequences to strings, to get around being unable to use []byte as a key to a map. This leads to these strings to sometimes be non-UTF8 compliant, which will break on encoding/decoding.

Eg. Converting the byte sequences like [2 208 15] or [2 239 191 189 15] to strings simply can't be round-tripped correctly as JSON, so the encoded and decoded values do not match.

The check would be to recursively examine every exported field in a structural DoFn for use of string, and checking if it's utf8 compliant. The check could be skipped for subtypes that implement the MarshalJSON and UnmarshalJSON interface methods.

The vet runner which can be electively run before any given pipeline with the --beam_strict flag would be the appropriate place to add this sort of checking to avoid more expensive checks 100% of the time.


A complete fix would also add documentation to the website and GoDoc around the JSON encoding of DoFns, in particular calling out this issue (that is, emphasizing strings must be UTF8).

Issue Priority

Priority: 3 (nice-to-have improvement)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@tobehardest
Copy link

Can you assign this task to me? I want to try.

@lostluck
Copy link
Contributor Author

@tobehardest Done! In the future, you can self assign an issue by commenting .take-issue and a bot will handle it. See the Beam contribution guide for more! https://beam.apache.org/contribute

@mohamedawnallah
Copy link
Contributor

It seems this issue has been inactive for a while, and no one has submitted a PR to address it. I'd like to take it on. 🙏

@mohamedawnallah
Copy link
Contributor

.take-issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants