-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enrich data source / asset association #584
Enrich data source / asset association #584
Conversation
e5954a2
to
94edde2
Compare
Co-authored-by: Padraic Shafer <[email protected]>
Co-authored-by: Padraic Shafer <[email protected]>
4668d7b
to
34b5d16
Compare
Testing migration on SQLite Before migration:
After migration:
|
Testing migration on Postgres Before migration:
After migration:
|
Re-verifying PG migration after adding a constraint in d58f8c6:
|
This has benefited from multiple high-level design conversations with @tacaswell and @dylanmcreynolds between October and today. I reviewed the current state with Dylan again today, and I think we should move ahead. I'll leave it for a day or so in case anyone has a chance to take another look at the details, but I will go ahead and merge this pretty soon. |
I ran through the migration one more time after the renaming in 9f2a2a3. I won't bother posting the output---same as above but with the new names. |
tiled/catalog/orm.py
Outdated
connection.execute( | ||
text( | ||
""" | ||
CREATE TRIGGER cannot_insert_num_null_if_num_int_exists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we create an index on parameter, num and data_source_id to avoid a table scan on every insert? Even if we do, these triggers are going to be pretty heavy weight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a compound index of data_source_id, assset_id
. Will the condition
data_source_id = NEW.data_source_id
leverage that index and reduce the records to be scanned to just the assets for a single data source. Is that plausible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I made some comments for optional changes, but nothing else stood out to me.
assets.append( | ||
Asset( | ||
data_uri=ensure_uri(filepath), | ||
is_directory=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HDF5 virtual dataset is not treated as a directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A virtual database comprises one master
file and N data
files. The master
file must be handed to the Adapter for opening. The data
files are not handled directly by the adapter (parameter=None/NULL
) but they still ought to be tracked for purposes of data movement, accounting for data size, etc.
One could do one-dataset-per-directory. But like TIFF series in practice they are often mixed, so we address that general case and track them at the per-file level.
Contrast this to Zarr, where the files involves are always bundled by directory. We track Zarr at the directory level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the above as a comment in the test. Eventually it can make its way into docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so is_directory
indicates a proper directory, rather than directory-like containers. Thanks for clarifying that.
settings, | ||
) | ||
client = from_context(context) | ||
actual = list(client["image"][:, 0, 0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a single pixel from the image?
...which contains a value that matches the image name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, way too cute. I added a detailed comment.
for i in ordering: | ||
file = Path(tmpdir, f"image{i:05}.tif") | ||
files.append(file) | ||
tifffile.imwrite(file, i * data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the tests rely on the pixel values, it could be useful to import data
as ones_data
or something that clarifies that this is a block of 1
s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed with a comment, afraid of breaking things with a rename.
tiled/adapters/hdf5.py
Outdated
@@ -31,7 +31,7 @@ class HDF5Adapter(collections.abc.Mapping, IndexersMixin): | |||
From the root node of a file given a filepath | |||
|
|||
>>> import h5py | |||
>>> HDF5Adapter.from_file("path/to/file.h5") | |||
>>> HDF5Adapter.from_filepath("path/to/file.h5") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the method from_filepath()
exist on HDF5Adapter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated docstring
tiled/catalog/orm.py
Outdated
connection.execute( | ||
text( | ||
""" | ||
CREATE TRIGGER cannot_insert_num_null_if_num_int_exists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the trigger name is misleading because it does not matter whether the existing entry is null
or int
...just that it already exists, right?
Perhaps something like cannot_insert_num_null_if_num_exists
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion! Updated.
tiled/catalog/orm.py
Outdated
connection.execute( | ||
text( | ||
""" | ||
CREATE TRIGGER cannot_insert_num_null_if_num_int_exists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See earlier comment about potentially misleading name for this trigger
"""Enrich DataSource-Asset association. | ||
|
||
Revision ID: a66028395cab | ||
Revises: 3db11ff95b6c | ||
Create Date: 2024-01-21 15:17:20.571763 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have only skimmed through this file, and not looked closely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very reasonable of you.
Here goes! |
Status Quo
The DataSource gives us a
mimetype
, which identifies (via a registry) which Adapter to use to read/write this data. The DataSource additionally provides some optionalparameters
to configure the Adapter.Through a many-to-many association table, each DataSource is associated with an unordered set of Assets. This might be one file (e.g. a CSV), a group of files (e.g. a TIFF sequence), or a directory with some internal structure (Zarr). When there are multiple Assets in a DataSource, they are passed to the Adapter's first argument as variadic args:
adapter(*asset_paths, **parameters)
.It is left to the Adapter to deal with:
This is messy and limiting. It makes mess of use cases like:
This PR
This adds to the
Asset
schema two fields:parameter
-- string corresponding to the Adapter's parameter name that this Asset should be pasesd tonum
-- int giving a position in a list (like a TIFF sequence number) if the parameter expects a listIf the
parameter
isNone
, the Asset is an indirect dependency and not passed to the Adapter. Ifnum
isNone
, that indicates that there is only one Asset for the given parameter, to be passed to the Adapter as a scalar; ifnum
is a integer, a list is given, sorted in ascending order.At the database level, these columns go in to the datasource--asset association table. At the HTTP API and Python object level, these fields are just flattened (denormalized) into the Asset for simplicity.
TO DO
Developer docs on this and on the catalog database schema generallydeferred to Should we normalize the 'structure' column into its own table? #576