Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Central schema yaml #89

Merged
merged 10 commits into from
Dec 6, 2023

Conversation

stuartmcalpine
Copy link
Collaborator

The schema information is replicated in three places

  • When creating the schema using the script
  • Describing it in the docs
  • The command line args for the CLI

So we don't have to replicate all the entries each time in these three cases, I've put the schema in a yaml file, which all three sections above can read.

This tidies the code a lot, but also when we add a new row in a table we only have to do it in one place.

This contains all the tables and row entries in a central location.

Now the schema creation script, the CLI and the docs can use this, so we don't have to worry about missing somewhere when adding a new entry to a table.
@stuartmcalpine stuartmcalpine changed the base branch from main to u/stuart/database_generator December 2, 2023 15:23
@stuartmcalpine
Copy link
Collaborator Author

stuartmcalpine commented Dec 2, 2023

  • More CI test with CLI registering datasets, e.g., DateTime

Copy link
Collaborator

@JoanneBogart JoanneBogart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change all those occurrences of "row" (the ones I've marked and anything similar) to "column". That's the main thing. There are also one or two very minor things.

schema_yaml = load_schema()


def _get_rows(schema, table):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name should be _get_columns, or _get_column_definitions rather than _get_rows. There are no rows here, only column names and definitions. "Rows" normally refers to rows of data within a table. It's true that the properties of a column can be thought of as a row in a table, but, unless there is some extra explanatory text, that is not the first thing people will think of.

All of the remaining review comments for create_registry_db.py make essentially the same point. I see now that the code being replaced had the same confusing (to me) nomenclature; I apologize for not catching this earlier.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've captured all instances of row -> column now

scripts/create_registry_db.py Outdated Show resolved Hide resolved
scripts/create_registry_db.py Outdated Show resolved Hide resolved
src/cli/cli.py Outdated Show resolved Hide resolved
src/dataregistry/schema/schema.yaml Outdated Show resolved Hide resolved
foreign_key_table: "dataset"
foreign_key_row: "dataset_id"
description: "Dataset this alias is linked to"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment here that, for any given row, exactly one of input_id and input_production_id is non-null

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case something like this comes up again: I was actually thinking of a yaml comment preceding the table definition, e.g.
# For each row in this table, exactly one of input_id and input_production_id is non-null
but the way you've done it is perfectly ok, perhaps preferable.

@stuartmcalpine
Copy link
Collaborator Author

stuartmcalpine commented Dec 6, 2023

I've changed all the instances of row->column where needed.

I've tweaked the register CLI a bit to make sure it's now able to accept all the properties (it was missing creation date before for example).

Also the only other change (other than your review comments) is in the last commit, where I reduce the register doc strings (which were pretty large). Basically any input args that are column names have a ** to refer to the docs for their description. The desciption of what the column represents doesn't help the code. Plus in the spirit of this PR its to reduce places in the code where these descriptions are duplicated.

Copy link
Collaborator

@JoanneBogart JoanneBogart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks.

foreign_key_table: "dataset"
foreign_key_row: "dataset_id"
description: "Dataset this alias is linked to"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case something like this comes up again: I was actually thinking of a yaml comment preceding the table definition, e.g.
# For each row in this table, exactly one of input_id and input_production_id is non-null
but the way you've done it is perfectly ok, perhaps preferable.

@stuartmcalpine stuartmcalpine merged commit 6a6635c into u/stuart/database_generator Dec 6, 2023
8 checks passed
@stuartmcalpine stuartmcalpine deleted the u/stuart/central_schema branch December 6, 2023 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants