Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link schemas #170

Merged
merged 16 commits into from
Jan 16, 2025
Merged

Link schemas #170

merged 16 commits into from
Jan 16, 2025

Conversation

stuartmcalpine
Copy link
Collaborator

@stuartmcalpine stuartmcalpine commented Dec 13, 2024

Have the working and production schemas work more seamlessly together.

Now when connecting to the registry the user only specifies the working schema. Through the provenance table, the associated production schema is automatically known.

The fundamental operations are the same for registering/modifying/deleting entries. For production entries, when initiating the DataRegtistry instance, users must flag production_mode=True to set the production schema to be the target (the specified schema during initiation is always the "working" one).

Now when querying both schemas (working and production) are searched. This is done through two independent queries to each schema and the results concatenated. For this reason I've removed the sqlalchemy CursorResult as a return option, as there is no real easy way to combine them. Now the only returns are the property_dict and the dataframe.

Removed TableMetadata class, put the reflection and information within the DbConnection object.

Thoughts

  • Change instances of "schema" to "database" (in terms of wording/reference)? More familiar sounding, and now the idea of working with specific schemas is removed and pushed behind the scenes.
  • Have a flag to ignore searching the production schema during queries?

Todo

  • Fix test_query_between_columns test
  • Fix test_register_dataset_alias test
  • Update docs
  • Fix Sqlite tests
  • FIx get_db_versioning in query.py
  • Fix CLI queries

Copy link
Collaborator

@JoanneBogart JoanneBogart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some comments associated with particular lines, and I haven't yet looked in detail at all the files under dataregistry_cli or tests, though I expect they're probably ok, except for the code to test aliases. If the tests (which I originally wrote!) really exercised everything, something should have failed.
Among the alias routines, resolve_alias and find_aliasesat least need some revision. The calls to _render_filter are missing the new schema argument. We need to decide how to handle the possibility of two schemas. Either for aliases we can do the same as is now done for datasets (combine the results of two queries) or else add a schema argument. In the latter case, should it have a default? I'm leaning towards "no", because it doesn't seem like it has a natural default, but I could be convinced otherwise.

results = conn.execute(stmt)
r = results.fetchone()
if r is None:
warnings.warn(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of issuing a warning, why even make the query when called in the context of database creation? In the old way of doing things, there was the get_db_version flag to indicate whether or not the query should be made. Is it possible to do something similar here? One way might be to add another argument creation_mode defaulting to False when setting up a DbConnection. It would be set to True when create_registry_schema.py creates connections. (I haven't thought this through that carefully but I think it would do.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a creation_mode flag for the DbConnection which skips querying the provenance table during reflection. Now if nothing is found in the provenance table, an exception is raised rather than a warning.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Please add a comment for creation_mode under "Parameters" in the docstring for DbConnection.init

src/dataregistry/db_basic.py Outdated Show resolved Hide resolved
src/dataregistry/db_basic.py Show resolved Hide resolved
src/dataregistry/query.py Outdated Show resolved Hide resolved
src/dataregistry/query.py Show resolved Hide resolved
src/dataregistry/query.py Show resolved Hide resolved
src/dataregistry/query.py Outdated Show resolved Hide resolved
@stuartmcalpine
Copy link
Collaborator Author

I will move the renaming instances of "schema" to "database" to a new PR, and update the docs to reflect

Copy link
Collaborator

@JoanneBogart JoanneBogart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made some minor suggestions inline. Concerning resolve_alias, I think we need to provide a way to look in the production schema. I suggest for now we add an argument schema or perhaps a boolean, e.g. is_working which defaults to True. Then the returns should include either the schema name or a boolean indicating whether the entry was from the production schema or not. I'm not sure how best to indicate this for the sqlite case and the case when working schema = production schema. Maybe boolean for input argument but schema name for return?

provenance table. In the default mode both schemas
working/production are avaliable for queries, but new
entries/modifications are done to the working schema. To create new
entries/modifications to production entries, `production_mode` must
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If production_mode is True you can get away with a little less work. You've alreaady done metadata.reflect(self.engine, self.schema) at line 253 so you don't need metadata.reflect(self.engine, self._prod_schema). The code in lines 276 - 283 should be modified to recognize this case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema that is passed to the DbConnection is always the working schema, though we might change this to collection as we discussed.

production_mode does not mean it only connects to the production schema, it just means it defaults to the production schema for entries and modifications during that instance. I've tried to make the doc string a bit clearer.

But in all cases, i.e., production_mode true or not, it always reflects/connects to both schemas, as querying always searches both schemas regardless.


if column.name in all_columns:
duplicates.add(column.name)
all_columns.append(column.name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be

         all_columns.append(column.name)

No need to append it if it's already there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by your recommendation (that is the line that is in there already). I assume you mean no need to all_columns to accumulate multiple instances of the same column name, I have made it a set rather than a list.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that the code could read

if att not in column_list.keys():
    column_list[att] = [temp_column_list[att]]
else:
    column_list[att].extend(tmp_column_list[att])

but it doesn't matter much. It's ok the way it is.

src/dataregistry/query.py Show resolved Hide resolved
Copy link
Collaborator

@JoanneBogart JoanneBogart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a couple new, very minor, comments, but basically it looks ok except for one thing: I believeresolve_alias does not cover a case I would like to see covered (working schema is the active but the user wishes to resolve a production alias). As far as I can tell one would have to make a separate DbConnection with production mode set in order to do it. I propose we leave addressing that issue for a separate PR.

@stuartmcalpine stuartmcalpine merged commit 84272e4 into main Jan 16, 2025
26 checks passed
@stuartmcalpine stuartmcalpine deleted the u/stuart/link_schemas branch January 16, 2025 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants