Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated download link curation: Datasets, Tools, Educational Resources #159

Open
aclayton555 opened this issue Jan 22, 2025 · 3 comments
Open

Comments

@aclayton555
Copy link

aclayton555 commented Jan 22, 2025

Note: This ticket may be split into separate tickets.

This ticket captures curation-related needs emerging from the following v1 CCKP Homepage updates:

https://sagebionetworks.jira.com/browse/CPO-1409
https://sagebionetworks.jira.com/browse/CPO-1427

Overarching intent is to improve surfacing of resource-related DOIs for access/download, and clarify which resources users will be directed to on Synapse via external sites.

@jaybee84 is requesting curation-related work to be completed by mid Feb, if possible, to provide sufficient time for portal development work.

To be discussed during 25-2 sprint kick-off:

TIMELINE/PRIORITIES: the timeline for this work is expected to shift priorities for existing ongoing work

SCOPE: Need to decide if this is a data model update and/or a table update, and which table(s) these updates are occurring (i.e. CCKP backend table, admin tables, or all the way backpopulated to the grant projects)

OVERVIEW:

Datasets - “externalLink” in backend table = “External link” on CCKP

  • Review these URLs

  • Where a DOI exists, pull this into a new DOI column

  • Where the URL is a Synpase link, pull this into a new Synapse Link column

  • Where the URL is a DOI or a non-Synapse link, leave this in the existing URL column

  • Where either a Synapse link or non-Synapse link exists, but no DOI, see if we can find a DOI and add this to the new DOI column

Desired outcome: every dataset has a curated DOI to an access or download site AND either a Synapse link (if it is stored in Synapse) or an external link

Tools - “downloadUrl" in backend table = “Download link” on CCKP

  • Review these URLs

  • Where a DOI exists, pull this into a new DOI column

  • Unlikely that tools will have a Synapse link, but IF there is a URL that is a Synpase link, pull this into a new Synapse Link column

  • Where the URL is a DOI or a non-Synapse link, leave this in the existing URL column

  • Where a non-Synapse link exists, but no DOI, see if we can find a DOI and add this to the new DOI column

Desired outcome: every tool has a curated DOI to an access or download site AND an external link

Educational Resources - "link" in backend table = "link" on CCKP

  • Review these URLs

  • Where a DOI exists, pull this into a new DOI column

  • Where the URL is a Synpase link, pull this into a new Synapse Link column

  • Where the URL is a DOI or a non-Synapse link, leave this in the existing URL column

  • Where either a Synapse link or non-Synapse link exists, but no DOI, see if we can find a DOI and add this to the new DOI column

Desired outcome: every educational resource has a curated DOI to an access or download site AND either a Synapse link (if it is stored in Synapse) or an external link

Value:

  • highlighting links to resources in Synapse
  • ability to leverage download cart for resources in Synapse
  • expected improved user flow/call to action
@aclayton555
Copy link
Author

With this curation, comment back on https://sagebionetworks.jira.com/browse/PORTALS-3209 to clarify mapping of which columns in the backend table will map to to cite as button. For example:

  • Publications - existing DOI column maps to cite as
  • Datasets - new DOI column to map to cite as
  • Tools - new DOI column to map to cite as
  • Educational resource - new COI column to map to cite as

@aclayton555
Copy link
Author

25-2/3:

  • ideally do this as an inline transformation when we do portal sync, rather than update manifests. Existing external link attribute allows for string list of URLs, but the risk here is that a contributor may not provide more than one link (but most of the curation is done by us anyway).
  • Where data are stored in Synapse, DatasetAlias is the synapse link

Decision:

  • Add DOI to Datasets, Tools, and Educational Resources; already in Publications. To be added as a required field, and included in release in End of 25-1 data model release #167
  • Ensure DOI is added to schemas
  • Add DOI to table schemas in Synapse (grant projects through to portals tables)
  • Parse out Synapse links from alias in portal sync stage
  • For any DOIs that are missing, curate these (this will likely extend beyond this sprint)
  • Assess whether crawler can pull DOI information (future sprint)

@Bankso
Copy link
Contributor

Bankso commented Feb 1, 2025

Notes on progress:

  • Doi columns have been defined and added to schema definitions for Dataset View, Tool View, and Educational Resources
    • All have validation rule url and required==TRUE
  • union_qc.py was updated to reflect current status of tables and used to download UNION tables in their current state
    • Output of this process is a merged manifest that can be process for backpopulation
  • updated prep_backpopulation.py to add Doi columns for Dataset View, Tool View and add remaining missing columns for Educational Resources
  • updated replace_table_schema.py to reflect new columns and desired table schema order

Next, I reviewed the links and populated the DOI column, if possible
Dataset View

  • I reviewed the Dataset Url column and copied any DOIs I found to Dataset Doi column
  • If only a DOI was provided for datasets stored in Synapse or Zenodo, I used the Dataset Alias to build the Dataset Url
  • I updated any GEO links in Dataset Url that were missing an accession number

Next steps for table updates and backpopulation:

  • add DOIs to DOI column and fix broken links, if present
    • Datasets
    • Tools
    • Educational Resources
  • split manifests via split_manifest_grants.py using --csv and -db flags
    • Datasets
    • Tools
    • Educational Resources
  • create an upload manifest using gen-mp-csv.py
    • Datasets
    • Tools
    • Educational Resources
  • update table schemas to reflect new columns using replace_table_schema.py
    • Datasets
    • Tools
    • Educational Resources
  • upload manifests using upload_manifests.py
    • Datasets
    • Tools
    • Educational Resources

Separately, syncing scripts for Dataset View, Tool View, and Educational Resources need to be updated to:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants