-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Link schemas #170
Link schemas #170
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some comments associated with particular lines, and I haven't yet looked in detail at all the files under dataregistry_cli or tests, though I expect they're probably ok, except for the code to test aliases. If the tests (which I originally wrote!) really exercised everything, something should have failed.
Among the alias routines, resolve_alias
and find_aliases
at least need some revision. The calls to _render_filter
are missing the new schema
argument. We need to decide how to handle the possibility of two schemas. Either for aliases we can do the same as is now done for datasets (combine the results of two queries) or else add a schema argument. In the latter case, should it have a default? I'm leaning towards "no", because it doesn't seem like it has a natural default, but I could be convinced otherwise.
src/dataregistry/db_basic.py
Outdated
results = conn.execute(stmt) | ||
r = results.fetchone() | ||
if r is None: | ||
warnings.warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of issuing a warning, why even make the query when called in the context of database creation? In the old way of doing things, there was the get_db_version
flag to indicate whether or not the query should be made. Is it possible to do something similar here? One way might be to add another argument creation_mode
defaulting to False
when setting up a DbConnection. It would be set to True
when create_registry_schema.py
creates connections. (I haven't thought this through that carefully but I think it would do.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a creation_mode
flag for the DbConnection
which skips querying the provenance table during reflection. Now if nothing is found in the provenance table, an exception is raised rather than a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Please add a comment for creation_mode
under "Parameters" in the docstring for DbConnection.init
I will move the renaming instances of "schema" to "database" to a new PR, and update the docs to reflect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made some minor suggestions inline. Concerning resolve_alias
, I think we need to provide a way to look in the production schema. I suggest for now we add an argument schema
or perhaps a boolean, e.g. is_working
which defaults to True. Then the returns should include either the schema name or a boolean indicating whether the entry was from the production schema or not. I'm not sure how best to indicate this for the sqlite case and the case when working schema = production schema. Maybe boolean for input argument but schema name for return?
src/dataregistry/db_basic.py
Outdated
provenance table. In the default mode both schemas | ||
working/production are avaliable for queries, but new | ||
entries/modifications are done to the working schema. To create new | ||
entries/modifications to production entries, `production_mode` must |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If production_mode is True you can get away with a little less work. You've alreaady done metadata.reflect(self.engine, self.schema)
at line 253 so you don't need metadata.reflect(self.engine, self._prod_schema)
. The code in lines 276 - 283 should be modified to recognize this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schema that is passed to the DbConnection
is always the working schema, though we might change this to collection
as we discussed.
production_mode
does not mean it only connects to the production schema, it just means it defaults to the production schema for entries and modifications during that instance. I've tried to make the doc string a bit clearer.
But in all cases, i.e., production_mode
true or not, it always reflects/connects to both schemas, as querying always searches both schemas regardless.
src/dataregistry/db_basic.py
Outdated
|
||
if column.name in all_columns: | ||
duplicates.add(column.name) | ||
all_columns.append(column.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be
all_columns.append(column.name)
No need to append it if it's already there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean by your recommendation (that is the line that is in there already). I assume you mean no need to all_columns to accumulate multiple instances of the same column name, I have made it a set rather than a list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that the code could read
if att not in column_list.keys():
column_list[att] = [temp_column_list[att]]
else:
column_list[att].extend(tmp_column_list[att])
but it doesn't matter much. It's ok the way it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a couple new, very minor, comments, but basically it looks ok except for one thing: I believeresolve_alias
does not cover a case I would like to see covered (working schema is the active but the user wishes to resolve a production alias). As far as I can tell one would have to make a separate DbConnection with production mode set in order to do it. I propose we leave addressing that issue for a separate PR.
Have the working and production schemas work more seamlessly together.
Now when connecting to the registry the user only specifies the working schema. Through the provenance table, the associated production schema is automatically known.
The fundamental operations are the same for registering/modifying/deleting entries. For production entries, when initiating the
DataRegtistry
instance, users must flagproduction_mode=True
to set the production schema to be the target (the specified schema during initiation is always the "working" one).Now when querying both schemas (working and production) are searched. This is done through two independent queries to each schema and the results concatenated. For this reason I've removed the sqlalchemy
CursorResult
as a return option, as there is no real easy way to combine them. Now the only returns are theproperty_dict
and thedataframe
.Removed
TableMetadata
class, put the reflection and information within theDbConnection
object.Thoughts
Todo
test_query_between_columns
testtest_register_dataset_alias
testget_db_versioning
inquery.py