Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enrich data source / asset association #584

Merged
merged 38 commits into from
Jan 23, 2024

Conversation

danielballan
Copy link
Member

@danielballan danielballan commented Oct 18, 2023

Status Quo

The DataSource gives us a mimetype, which identifies (via a registry) which Adapter to use to read/write this data. The DataSource additionally provides some optional parameters to configure the Adapter.

Through a many-to-many association table, each DataSource is associated with an unordered set of Assets. This might be one file (e.g. a CSV), a group of files (e.g. a TIFF sequence), or a directory with some internal structure (Zarr). When there are multiple Assets in a DataSource, they are passed to the Adapter's first argument as variadic args: adapter(*asset_paths, **parameters).

It is left to the Adapter to deal with:

  • Do these asset paths need to be sorted?
  • Do these asset paths have different roles and/or formats?
  • Are some of these assets indirect dependencies that the Adapter should not directly use?

This is messy and limiting. It makes mess of use cases like:

  • TIFF sequences that may not be in alphanumeric sort order
  • Data file with sidecar metadata file, like an image file with some associated YAML metadata
  • HDF5 "master" files with underling "data" files that should be tracked as dependencies of the DataSource, for the purposes of export or calculating size, but are not directly needed by the Adapter

This PR

This adds to the Asset schema two fields:

  • parameter -- string corresponding to the Adapter's parameter name that this Asset should be pasesd to
  • num -- int giving a position in a list (like a TIFF sequence number) if the parameter expects a list

If the parameter is None, the Asset is an indirect dependency and not passed to the Adapter. If num is None, that indicates that there is only one Asset for the given parameter, to be passed to the Adapter as a scalar; if num is a integer, a list is given, sorted in ascending order.

At the database level, these columns go in to the datasource--asset association table. At the HTTP API and Python object level, these fields are just flattened (denormalized) into the Asset for simplicity.

TO DO

  • Developer docs on this and on the catalog database schema generally deferred to Should we normalize the 'structure' column into its own table? #576
  • Tests that cover the use cases sketched above
  • Tests that verify the ordering is respected
  • All Adapters should accept a URI as input. (They may also accept a filepath.)
  • Database migration to update existing deployments

@danielballan danielballan mentioned this pull request Oct 18, 2023
10 tasks
@danielballan danielballan marked this pull request as ready for review October 31, 2023 18:18
@danielballan danielballan force-pushed the data-source-asset-assocation branch from e5954a2 to 94edde2 Compare November 20, 2023 15:18
@danielballan danielballan force-pushed the data-source-asset-assocation branch from 4668d7b to 34b5d16 Compare January 20, 2024 18:36
@danielballan
Copy link
Member Author

Testing migration on SQLite

Before migration:

sqlite> select data_sources.mimetype, assets.data_uri from data_sources JOIN data_source_asset_association ON data_sources.id == data_source_asset_association.data_source_id JOIN assets ON assets.id == data_source_asset_association.asset_id;
mimetype                                                           data_uri                                                               
-----------------------------------------------------------------  -----------------------------------------------------------------------
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet  file://localhost/home/dallan/Repos/bnl/tiled/files/tables.xlsx         
image/tiff                                                         file://localhost/home/dallan/Repos/bnl/tiled/files/a.tif               
text/csv                                                           file://localhost/home/dallan/Repos/bnl/tiled/files/another_table.csv   
image/tiff                                                         file://localhost/home/dallan/Repos/bnl/tiled/files/c.tif               
image/tiff                                                         file://localhost/home/dallan/Repos/bnl/tiled/files/b.tif               
multipart/related;type=image/tiff                                  file://localhost/home/dallan/Repos/bnl/tiled/files/more/A0002.tif      
multipart/related;type=image/tiff                                  file://localhost/home/dallan/Repos/bnl/tiled/files/more/A0003.tif      
multipart/related;type=image/tiff                                  file://localhost/home/dallan/Repos/bnl/tiled/files/more/A0001.tif      
multipart/related;type=image/tiff                                  file://localhost/home/dallan/Repos/bnl/tiled/files/more/B0001.tif      
multipart/related;type=image/tiff                                  file://localhost/home/dallan/Repos/bnl/tiled/files/more/B0002.tif      
image/tiff                                                         file://localhost/home/dallan/Repos/bnl/tiled/files/more/even_more/f.tif
image/tiff                                                         file://localhost/home/dallan/Repos/bnl/tiled/files/more/even_more/e.tif

After migration:

sqlite> select data_sources.mimetype, data_source_asset_association.parameter, data_source_asset_association.num, assets.data_uri from data_sources JOIN data_source_asset_association ON data_sources.id == data_source_asset_association.data_source_id JOIN assets ON assets.id == data_source_asset_association.asset_id;
mimetype                                                           parameter  num  data_uri                                                               
-----------------------------------------------------------------  ---------  ---  -----------------------------------------------------------------------
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet  data_uri        file://localhost/home/dallan/Repos/bnl/tiled/files/tables.xlsx         
image/tiff                                                         data_uri        file://localhost/home/dallan/Repos/bnl/tiled/files/a.tif               
text/csv                                                           data_uri        file://localhost/home/dallan/Repos/bnl/tiled/files/another_table.csv   
image/tiff                                                         data_uri        file://localhost/home/dallan/Repos/bnl/tiled/files/c.tif               
image/tiff                                                         data_uri        file://localhost/home/dallan/Repos/bnl/tiled/files/b.tif               
multipart/related;type=image/tiff                                  data_uris  2    file://localhost/home/dallan/Repos/bnl/tiled/files/more/A0002.tif      
multipart/related;type=image/tiff                                  data_uris  3    file://localhost/home/dallan/Repos/bnl/tiled/files/more/A0003.tif      
multipart/related;type=image/tiff                                  data_uris  1    file://localhost/home/dallan/Repos/bnl/tiled/files/more/A0001.tif      
multipart/related;type=image/tiff                                  data_uris  1    file://localhost/home/dallan/Repos/bnl/tiled/files/more/B0001.tif      
multipart/related;type=image/tiff                                  data_uris  2    file://localhost/home/dallan/Repos/bnl/tiled/files/more/B0002.tif      
image/tiff                                                         data_uri        file://localhost/home/dallan/Repos/bnl/tiled/files/more/even_more/f.tif
image/tiff                                                         data_uri        file://localhost/home/dallan/Repos/bnl/tiled/files/more/even_more/e.tif
sqlite> .schema data_source_asset_association
CREATE TABLE IF NOT EXISTS "data_source_asset_association" (
	data_source_id INTEGER, 
	asset_id INTEGER, 
	parameter VARCHAR(255), 
	num INTEGER, 
	CONSTRAINT parameter_num_unique_constraint UNIQUE (data_source_id, parameter, num), 
	FOREIGN KEY(data_source_id) REFERENCES data_sources (id) ON DELETE CASCADE, 
	FOREIGN KEY(asset_id) REFERENCES assets (id) ON DELETE CASCADE
);
sqlite> select * from sqlite_master where type = 'trigger';
trigger|cannot_insert_num_null_if_num_int_exists|data_source_asset_association|0|CREATE TRIGGER cannot_insert_num_null_if_num_int_exists
    BEFORE INSERT ON data_source_asset_association
    WHEN NEW.num IS NULL
    BEGIN
        SELECT RAISE(ABORT, 'Can only insert num=NULL if no other row exists for the same parameter')
        WHERE EXISTS
        (
            SELECT 1
            FROM data_source_asset_association
            WHERE parameter = NEW.parameter
            AND data_source_id = NEW.data_source_id
        );
    END
trigger|cannot_insert_num_int_if_num_null_exists|data_source_asset_association|0|CREATE TRIGGER cannot_insert_num_int_if_num_null_exists
    BEFORE INSERT ON data_source_asset_association
    WHEN NEW.num IS NOT NULL
    BEGIN
        SELECT RAISE(ABORT, 'Can only insert INTEGER num if no NULL row exists for the same parameter')
        WHERE EXISTS
        (
            SELECT 1
            FROM data_source_asset_association
            WHERE parameter = NEW.parameter
            AND num IS NULL
            AND data_source_id = NEW.data_source_id
        );
    END

@danielballan
Copy link
Member Author

Testing migration on Postgres

Before migration:

postgres=# select data_sources.mimetype, assets.data_uri from data_sources JOIN data_source_asset_association ON data_sources.id = data_source_asset_association.data_source_id JOIN assets ON assets.id = data_source_asset_association.asset_id;
                             mimetype                              |                                data_uri                                 
-------------------------------------------------------------------+-------------------------------------------------------------------------
 application/vnd.openxmlformats-officedocument.spreadsheetml.sheet | file://localhost/home/dallan/Repos/bnl/tiled/files/tables.xlsx
 image/tiff                                                        | file://localhost/home/dallan/Repos/bnl/tiled/files/a.tif
 text/csv                                                          | file://localhost/home/dallan/Repos/bnl/tiled/files/another_table.csv
 image/tiff                                                        | file://localhost/home/dallan/Repos/bnl/tiled/files/c.tif
 image/tiff                                                        | file://localhost/home/dallan/Repos/bnl/tiled/files/b.tif
 multipart/related;type=image/tiff                                 | file://localhost/home/dallan/Repos/bnl/tiled/files/more/A0002.tif
 multipart/related;type=image/tiff                                 | file://localhost/home/dallan/Repos/bnl/tiled/files/more/A0003.tif
 multipart/related;type=image/tiff                                 | file://localhost/home/dallan/Repos/bnl/tiled/files/more/A0001.tif
 multipart/related;type=image/tiff                                 | file://localhost/home/dallan/Repos/bnl/tiled/files/more/B0001.tif
 multipart/related;type=image/tiff                                 | file://localhost/home/dallan/Repos/bnl/tiled/files/more/B0002.tif
 image/tiff                                                        | file://localhost/home/dallan/Repos/bnl/tiled/files/more/even_more/f.tif
 image/tiff                                                        | file://localhost/home/dallan/Repos/bnl/tiled/files/more/even_more/e.tif
(12 rows)

After migration:

postgres=# select data_sources.mimetype, data_source_asset_association.parameter, data_source_asset_association.num, assets.data_uri from data_sources JOIN data_source_asset_association ON data_sources.id = data_source_asset_association.data_source_id JOIN assets ON assets.id = data_source_asset_association.asset_id;
                             mimetype                              | parameter | num |                                data_uri                                 
-------------------------------------------------------------------+-----------+-----+-------------------------------------------------------------------------
 application/vnd.openxmlformats-officedocument.spreadsheetml.sheet | data_uri  |     | file://localhost/home/dallan/Repos/bnl/tiled/files/tables.xlsx
 image/tiff                                                        | data_uri  |     | file://localhost/home/dallan/Repos/bnl/tiled/files/a.tif
 text/csv                                                          | data_uri  |     | file://localhost/home/dallan/Repos/bnl/tiled/files/another_table.csv
 image/tiff                                                        | data_uri  |     | file://localhost/home/dallan/Repos/bnl/tiled/files/c.tif
 image/tiff                                                        | data_uri  |     | file://localhost/home/dallan/Repos/bnl/tiled/files/b.tif
 multipart/related;type=image/tiff                                 | data_uris |   3 | file://localhost/home/dallan/Repos/bnl/tiled/files/more/A0003.tif
 multipart/related;type=image/tiff                                 | data_uris |   2 | file://localhost/home/dallan/Repos/bnl/tiled/files/more/A0002.tif
 multipart/related;type=image/tiff                                 | data_uris |   1 | file://localhost/home/dallan/Repos/bnl/tiled/files/more/A0001.tif
 multipart/related;type=image/tiff                                 | data_uris |   2 | file://localhost/home/dallan/Repos/bnl/tiled/files/more/B0002.tif
 multipart/related;type=image/tiff                                 | data_uris |   1 | file://localhost/home/dallan/Repos/bnl/tiled/files/more/B0001.tif
 image/tiff                                                        | data_uri  |     | file://localhost/home/dallan/Repos/bnl/tiled/files/more/even_more/f.tif
 image/tiff                                                        | data_uri  |     | file://localhost/home/dallan/Repos/bnl/tiled/files/more/even_more/e.tif
postgres-# \d+ data_source_asset_association
                                          Table "public.data_source_asset_association"
     Column     |          Type          | Collation | Nullable | Default | Storage  | Compression | Stats target | Description 
----------------+------------------------+-----------+----------+---------+----------+-------------+--------------+-------------
 data_source_id | integer                |           |          |         | plain    |             |              | 
 asset_id       | integer                |           |          |         | plain    |             |              | 
 parameter      | character varying(255) |           |          |         | extended |             |              | 
 num            | integer                |           |          |         | plain    |             |              | 
Indexes:
    "parameter_num_unique_constraint" UNIQUE CONSTRAINT, btree (data_source_id, parameter, num)
Foreign-key constraints:
    "data_source_asset_association_asset_id_fkey" FOREIGN KEY (asset_id) REFERENCES assets(id) ON DELETE CASCADE
    "data_source_asset_association_data_source_id_fkey" FOREIGN KEY (data_source_id) REFERENCES data_sources(id) ON DELETE CASCADE
Triggers:
    cannot_insert_num_null_if_num_int_exists BEFORE INSERT ON data_source_asset_association FOR EACH ROW WHEN (new.num IS NULL) EXECUTE FUNCTION test_parameter_exists()
Access method: heap
postgres=# select * from information_schema.triggers;
 trigger_catalog | trigger_schema |               trigger_name               | event_manipulation | event_object_catalog | event_object_schema |      event_object_table       | action_order | action_condition  |             action_statement             | action_orientation | action_timing | action_reference_old_table | action_reference_new_table | action_reference_old_row | action_reference_new_row | created 
-----------------+----------------+------------------------------------------+--------------------+----------------------+---------------------+-------------------------------+--------------+-------------------+------------------------------------------+--------------------+---------------+----------------------------+----------------------------+--------------------------+--------------------------+---------
 postgres        | public         | cannot_insert_num_null_if_num_int_exists | INSERT             | postgres             | public              | data_source_asset_association |            1 | (new.num IS NULL) | EXECUTE FUNCTION test_parameter_exists() | ROW                | BEFORE        |                            |                            |                          |                          | 
(1 row)

@danielballan
Copy link
Member Author

Re-verifying PG migration after adding a constraint in d58f8c6:

postgres=# select data_sources.mimetype, data_source_asset_association.parameter, data_source_asset_association.num, assets.data_uri from data_sources JOIN data_source_asset_association ON data_sources.id = data_source_asset_association.data_source_id JOIN assets ON assets.id = data_source_asset_association.asset_id;
                             mimetype                              | parameter | num |                                data_uri                                 
-------------------------------------------------------------------+-----------+-----+-------------------------------------------------------------------------
 application/vnd.openxmlformats-officedocument.spreadsheetml.sheet | data_uri  |     | file://localhost/home/dallan/Repos/bnl/tiled/files/tables.xlsx
 image/tiff                                                        | data_uri  |     | file://localhost/home/dallan/Repos/bnl/tiled/files/a.tif
 text/csv                                                          | data_uri  |     | file://localhost/home/dallan/Repos/bnl/tiled/files/another_table.csv
 image/tiff                                                        | data_uri  |     | file://localhost/home/dallan/Repos/bnl/tiled/files/c.tif
 image/tiff                                                        | data_uri  |     | file://localhost/home/dallan/Repos/bnl/tiled/files/b.tif
 multipart/related;type=image/tiff                                 | data_uris |   3 | file://localhost/home/dallan/Repos/bnl/tiled/files/more/A0003.tif
 multipart/related;type=image/tiff                                 | data_uris |   2 | file://localhost/home/dallan/Repos/bnl/tiled/files/more/A0002.tif
 multipart/related;type=image/tiff                                 | data_uris |   1 | file://localhost/home/dallan/Repos/bnl/tiled/files/more/A0001.tif
 multipart/related;type=image/tiff                                 | data_uris |   2 | file://localhost/home/dallan/Repos/bnl/tiled/files/more/B0002.tif
 multipart/related;type=image/tiff                                 | data_uris |   1 | file://localhost/home/dallan/Repos/bnl/tiled/files/more/B0001.tif
 image/tiff                                                        | data_uri  |     | file://localhost/home/dallan/Repos/bnl/tiled/files/more/even_more/f.tif
 image/tiff                                                        | data_uri  |     | file://localhost/home/dallan/Repos/bnl/tiled/files/more/even_more/e.tif
(12 rows)
postgres=# \d+ data_source_asset_association
                                          Table "public.data_source_asset_association"
     Column     |          Type          | Collation | Nullable | Default | Storage  | Compression | Stats target | Description 
----------------+------------------------+-----------+----------+---------+----------+-------------+--------------+-------------
 data_source_id | integer                |           |          |         | plain    |             |              | 
 asset_id       | integer                |           |          |         | plain    |             |              | 
 parameter      | character varying(255) |           |          |         | extended |             |              | 
 num            | integer                |           |          |         | plain    |             |              | 
Indexes:
    "parameter_num_unique_constraint" UNIQUE CONSTRAINT, btree (data_source_id, parameter, num)
Foreign-key constraints:
    "data_source_asset_association_asset_id_fkey" FOREIGN KEY (asset_id) REFERENCES assets(id) ON DELETE CASCADE
    "data_source_asset_association_data_source_id_fkey" FOREIGN KEY (data_source_id) REFERENCES data_sources(id) ON DELETE CASCADE
Triggers:
    cannot_insert_num_int_if_num_null_exists BEFORE INSERT ON data_source_asset_association FOR EACH ROW WHEN (new.num IS NOT NULL) EXECUTE FUNCTION test_not_null_parameter_exists()
    cannot_insert_num_null_if_num_int_exists BEFORE INSERT ON data_source_asset_association FOR EACH ROW WHEN (new.num IS NULL) EXECUTE FUNCTION test_parameter_exists()
Access method: heap

@danielballan
Copy link
Member Author

This has benefited from multiple high-level design conversations with @tacaswell and @dylanmcreynolds between October and today. I reviewed the current state with Dylan again today, and I think we should move ahead.

I'll leave it for a day or so in case anyone has a chance to take another look at the details, but I will go ahead and merge this pretty soon.

@danielballan
Copy link
Member Author

I ran through the migration one more time after the renaming in 9f2a2a3. I won't bother posting the output---same as above but with the new names.

connection.execute(
text(
"""
CREATE TRIGGER cannot_insert_num_null_if_num_int_exists
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create an index on parameter, num and data_source_id to avoid a table scan on every insert? Even if we do, these triggers are going to be pretty heavy weight.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a compound index of data_source_id, assset_id. Will the condition

data_source_id = NEW.data_source_id

leverage that index and reduce the records to be scanned to just the assets for a single data source. Is that plausible?

Copy link
Contributor

@padraic-shafer padraic-shafer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I made some comments for optional changes, but nothing else stood out to me.

assets.append(
Asset(
data_uri=ensure_uri(filepath),
is_directory=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HDF5 virtual dataset is not treated as a directory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A virtual database comprises one master file and N data files. The master file must be handed to the Adapter for opening. The data files are not handled directly by the adapter (parameter=None/NULL) but they still ought to be tracked for purposes of data movement, accounting for data size, etc.

One could do one-dataset-per-directory. But like TIFF series in practice they are often mixed, so we address that general case and track them at the per-file level.

Contrast this to Zarr, where the files involves are always bundled by directory. We track Zarr at the directory level.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the above as a comment in the test. Eventually it can make its way into docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so is_directory indicates a proper directory, rather than directory-like containers. Thanks for clarifying that.

settings,
)
client = from_context(context)
actual = list(client["image"][:, 0, 0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a single pixel from the image?

...which contains a value that matches the image name

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, way too cute. I added a detailed comment.

for i in ordering:
file = Path(tmpdir, f"image{i:05}.tif")
files.append(file)
tifffile.imwrite(file, i * data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the tests rely on the pixel values, it could be useful to import data as ones_data or something that clarifies that this is a block of 1s.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed with a comment, afraid of breaking things with a rename.

@@ -31,7 +31,7 @@ class HDF5Adapter(collections.abc.Mapping, IndexersMixin):
From the root node of a file given a filepath

>>> import h5py
>>> HDF5Adapter.from_file("path/to/file.h5")
>>> HDF5Adapter.from_filepath("path/to/file.h5")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the method from_filepath() exist on HDF5Adapter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated docstring

connection.execute(
text(
"""
CREATE TRIGGER cannot_insert_num_null_if_num_int_exists
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the trigger name is misleading because it does not matter whether the existing entry is null or int...just that it already exists, right?

Perhaps something like cannot_insert_num_null_if_num_exists?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion! Updated.

connection.execute(
text(
"""
CREATE TRIGGER cannot_insert_num_null_if_num_int_exists
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See earlier comment about potentially misleading name for this trigger

Comment on lines +1 to +5
"""Enrich DataSource-Asset association.

Revision ID: a66028395cab
Revises: 3db11ff95b6c
Create Date: 2024-01-21 15:17:20.571763
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only skimmed through this file, and not looked closely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very reasonable of you.

@danielballan
Copy link
Member Author

Here goes!

@danielballan danielballan merged commit 97ad75b into bluesky:main Jan 23, 2024
8 checks passed
@danielballan danielballan deleted the data-source-asset-assocation branch January 23, 2024 01:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants