Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Approach to sharing metadata alongside files for reuse #153

Open
Bankso opened this issue Dec 12, 2024 · 3 comments
Open

Approach to sharing metadata alongside files for reuse #153

Bankso opened this issue Dec 12, 2024 · 3 comments

Comments

@Bankso
Copy link
Contributor

Bankso commented Dec 12, 2024

Issue: files can be added to Synapse Datasets, but the only metadata we can directly add to the table are file annotations. We should consider how we want to approach sharing experiment-related record metadata (Biospecimen, Model, Individual, GeoMx AOI info, etc.)

Tables can be easily subsetted or queried to extract experiment-specific information. However, if we are going to use table subsets to derive experiment-specific record metadata that can be packaged with files for download, what is the best way to go about this?

  • apply as file annotations and directly include in Synapse Datasets?
  • create a bunch of views, annotate as Datasets, and include in a Collection with file Datasets?
  • create a CSV that is added to the Dataset?
    • This would need to be regenerated if information is updated, but is a very straightforward approach
  • What else?
@Bankso
Copy link
Contributor Author

Bankso commented Jan 24, 2025

Current thoughts on this:

  • Define a Collection as the full set of information shared for reuse. At minimum, this contains:
    • File Datasets
    • Metadata Tables
      • We can create metadata subsets by querying schematic-generated synapse_storage_manifest tables
      • Biospecimen Keys can be acquired from File View metadata
      • Model and/or Individual Keys can be acquired from Biospecimen metadata
  • Current metadata table slots in Collection model include Biospecimen, Model, Individual, Imaging Channel, and GeoMx ROI/Annotations

@jaybee84
Copy link
Collaborator

Define a Collection as the full set of information shared for reuse

+1 to this. This is flexible enough to include multiple data modalities if needed.

Current metadata table slots in Collection model include Biospecimen, Model, Individual, Imaging Channel, and GeoMx ROI/Annotations

I would recommend considering the following framework: All metadata from included files in all datasets of a collection "roll-up" to the top level collection. This functionality will be similar to "Add all Annotations" function that currently exists in the fileview webUI, but will need to be implemented in the context of collections. Then the metadata file that will be readily available for the collection will contain all metadata attributes in one csv file.

@Bankso
Copy link
Contributor Author

Bankso commented Jan 27, 2025

All metadata from included files in all datasets of a collection "roll-up" to the top level collection

This makes sense and would be a nice feature to have!

The problem that I think we need to address is how to surface record-based metadata that we want to share alongside files.

Metadata types like Biospecimen and Individual are uploaded via their own manifests and stored in their own tables - they aren't tied to files or applied as file annotations. That way, we can have a database of the record-based info, which can be searched and associated with files, as needed. To associate files with this metadata, there is a system of primary and foreign keys that can be used to refer to the relevant entries.

Considering this comment:

the metadata file that will be readily available for the collection will contain all metadata attributes in one csv file.

I have considered using Datasets/Collections to surface this info as annotations and this concept actually makes it seem like the better idea here.

We can extract record-based metadata entries from the source table and apply it to the files in a Dataset, based on foreign key attributes in the metadata. Even if this functionality is currently exclusive to Datasets, having a single manifest with the file, assay, and specimen info seems like a convenient way to share metadata

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants