Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify uses of "transformation," "relation," "operation," and "operator" in the spec? #762

Open
ingomueller-net opened this issue Jan 6, 2025 · 0 comments

Comments

@ingomueller-net
Copy link
Contributor

I realized that some terminology isn't as consistent as it could be in the spec. Let us first acknowledge that the following two concepts are central to the spec and distinct from each other:

  • The (type of the) data Substrait plans work on. Possible terms: dataset, table, relation, ...
  • The computations Substrait plans do. Possible terms: transformations, (relational) operators, (relational) operations, relations, ...

For example, the doc on "relation basics" says (annotation is mine):

Substrait is designed to allow a user to describe arbitrarily complex data transformations. These transformations are composed of one or more relational operations. Relational operations are well-defined transformation operations that work by taking zero or more input datasets and transforming them into zero or more output transformations [should be: datasets].

I think several points are less than perfect:

  • The spec uses too many terms for the computations: "transformations," "data transformations," "relational operations," and "transformation operations" just in these few lines, then "relations" as the core concept of the spec but "relational operators" as the section heading in the "basics" section and a mix of "operator" and "operation" for the headings of the individual relations (e.g., "Aggregate Operation" but "Reference Operator").
  • The spec uses "relation" for the computation -- whereas in all other places that I know, the word "relation" refers to the data.

I suggest we clean up the spec to make things easier to understand. What are peoples' preferred terms for the two concepts?

I feel pretty strongly about not using "relation" for the computation but expect a lot of headwind against changing that term at this point and have mixed feelings myself: that term has made it into the protobuf definition, from which we can't remove it, and using it there but not in the prose is also a source of potential confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant