Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TaxonomyParser schema version behavior not intuitive #44

Closed
khaledsulayman opened this issue Aug 6, 2024 · 5 comments
Closed

TaxonomyParser schema version behavior not intuitive #44

khaledsulayman opened this issue Aug 6, 2024 · 5 comments

Comments

@khaledsulayman
Copy link
Member

Caught this while reviewing instructlab/sdg#62. They initialize the parser using version=0, which allows the parser to read the version from the taxonomy file, however, this behavior is not documented in the class docstring here. Instead, it is just listed as int | None, where the None behavior simply uses the latest version. This behavior should at least be described more clearly in the doctoring.

Furthermore, I think the use of None to denote the latest version is misleading. I would suggest switching to version tags, so users can specify 'v1', 'v2', ..., 'latest', with 'latest' as the default, and then users can pass None to have the parser detect version from the yaml file.

@bjhargrave
Copy link
Contributor

Does

schema_version (int | None, optional): The version of the Taxonomy schema.
Specifying a version less than 1 will use the schema version specified by each YAML document's "version" key.
The highest schema version is used if None is specified.

not count as documentation?

Furthermore, I think the use of None to denote the latest version is misleading. I would suggest switching to version tags, so users can specify 'v1', 'v2', ..., 'latest', with 'latest' as the default, and then users can pass None to have the parser detect version from the yaml file.

Schema versions are ints. So I do not think using tag strings in the API is helpful. 'latest' is also not a tag string. Also consider that this API is used by taxonomy repo's check_yaml script (instructlab/taxonomy#1259) which generally will want to use the latest version and None is common for specifying or defaulting and used as such by that script.

@khaledsulayman
Copy link
Member Author

Ah, thanks, I'm not sure how I missed that. So it is documented.

I'm still not a fan of using None to mean 'latest' and < 1 to mean 'parse the yaml'. I'm not married to the version tags idea, but I think it might be worth coming up with something a bit more intuitive.

I'll think on this and share the issue with folks to see if we can get other ideas or overall thoughts on this.

@khaledsulayman khaledsulayman changed the title TaxonomyParser schema version behavior not clear/documented TaxonomyParser schema version behavior not intuitive Aug 6, 2024
@nathan-weinberg
Copy link
Member

Could we introduce some sort of mapping layer here, where a user can pass either a raw int or string and if the latter if either maps to the appropriate non-string value or fails if no mapping exists?

Could provide a better UX without needing to rework much of the existing logic.

@bjhargrave
Copy link
Contributor

Could we introduce some sort of mapping layer here, where a user can pass either a raw int or string and if the latter if either maps to the appropriate non-string value or fails if no mapping exists?

Not sure I understand.

The schema_version parameter is of type int. If you want version 1,2 or 3, you pass in 1, 2, or 3. If you want to latest schema version, i.e. taxonomy repo PR checking, you pass in some value we will call "X". If you want to use the schema version as stated in the yaml file, i.e. SDG, you pass in some value we will call "Y". So now we are discussing what is "X" and what is "Y". As the code now stands, "X" is None which is the default argument value, and "Y" is a non-positive integer (this is the set of invalid schema versions). Of course we can discuss swapping X and Y or whether the default for schema_version is 0 instead of the current None.

But I am not sure I see any utility for adding support for a str type argument with some code to "map" the str value into an int value or None.

Finally, this API (TaxonomyParser) is going to be used by a vanishingly small set of programmers. We can probably count the number on 1 hand :-) Right now it is the check_yaml script in the taxonomy repo, the taxonomy diff command, and the data generate command. Given there are likely to be so few users, we probably do not need to spend a large amount of time on getting this API "right" :-) It has to support the capabilities we need and be properly documented. I think both of those things are currently true.

@khaledsulayman
Copy link
Member Author

Thanks, BJ! I see your point here, might not be worth devoting cycles to this. I'll close this issue for now and if we find that it's a problem in the future we can revisit.

@khaledsulayman khaledsulayman closed this as not planned Won't fix, can't repro, duplicate, stale Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants