Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features to allow sharing of full pipeline and results for paper publication #861

Open
4 of 6 tasks
lfrank opened this issue Mar 7, 2024 · 22 comments
Open
4 of 6 tasks
Assignees
Labels
enhancement New feature or request infrastructure Unix, MySQL, etc. settings/issues impacting users

Comments

@lfrank
Copy link
Contributor

lfrank commented Mar 7, 2024

I would like us to work toward a system that allows us to make the full set of pipelines and data files for a paper available publicly. We need to discuss the details, but the solution could look something like this.

  • (1) Select all table entries that are necessary to recreate the contents of a paper.

  • (2) Export either the full database or, ideally, a database with just those entries. Possibly create docker image for that database.

  • (3) Assemble and upload all of the raw and analysis NWB files to DANDI.

  • (4) Add table or functionality to allow reads of files from DANDI, respecting file name changes.

  • (5) Make repo for project specific notebooks / scripts public with notebooks that create figures.

  • (6) Work with DANDI to create hub that allows people to start up database and use Spyglass to replicate figures and other analyses.

@CBroz1
Copy link
Member

CBroz1 commented Mar 7, 2024

I think (1) the selective dump of mysql is readily achievable. It might involve generating a list of the tables associated with paper data, and then using datajoint to generate the restriction for use during the mysqldump process.

I predict some 'devil-in-the-details' when it comes to replication with uploaded files. If the Analysis files are edited to be DANDI compliant, issues may arise in any portion of our pipeline/analysis/visualization code downstream of those edits. Is it fair to say that the DANDI compliance is a moving target? Is our Spyglass-native format fixed?

I can envision the following approaches:

  1. Periodic updates of all portions of Spyglass ingestion processes to detect format version and handle it accordingly.
  2. Periodic maintenance to bring us into DANDI compliance across (a) Spyglass pipelines, (b) server files, (c) mysql checksum tables
  3. A single bidirectional translation mechanism to edit files prior to upload and reverse edits on download

We took option 1 for spatial series changes in NWB v2.5. It seems to have worked out well, but I know the permutations could balloon out when we multiply across X data types, Y times across spyglass tables and Z NWB version changes over time.

Option 2 carries a heavy maintenance burden across a large set of files that may never be needed. This is achievable for a subset of the data selected for a given upload, but would need to be accompanied by record keeping of when something was updated and for-which DANDI/NWB version to prevent issues with accessing existing data. I think @samuelbray32 has already started exploring this.

Option 3 least aligns with open-science ideals, but it narrows the maintenance burden to a single tool designed for this purpose. It would allow spyglass development to continue without interruptions and would be easier to maintain

@CBroz1 CBroz1 added the infrastructure Unix, MySQL, etc. settings/issues impacting users label Mar 7, 2024
@edeno edeno added the enhancement New feature or request label Mar 7, 2024
@lfrank
Copy link
Contributor Author

lfrank commented Mar 8, 2024 via email

@CBroz1
Copy link
Member

CBroz1 commented Mar 19, 2024

Thinking through some ideas here with the help of diagrams ...

Strategy

In an effort to monitor an analysis and generate a corresponding export, I found I would identify 'leaf' nodes in the resulting subgraph, and the restriction applied to each leaf. If a user restricts a table while export is active, I can keep track of the table as a leaf and the restriction applied, making the assumption that downstream tables are not needed. This has two requirements for the user: (1) all custom tables must inherit SpyglassMixin, and (2) any analyses that use all entries in a table must run a null restriction, (Table & True).

Q1: Are these assumptions valid? Are these acceptable requirements?

In order to generate the MySQL dump, I will need a list of all required tables in the graph and the restriction that gives me the subset of each table required for downstream fk references.

Restrictions

graph TD;
    classDef lightFill fill:#DDD,stroke:#999,stroke-width:2px;
    R2(Root2) --> M1(Mid1);
    M1 --> M2(Mid2);
    M2 --> M3(Mid3);
    M3 --> L1(Leaf1);
    M2 --> L2(Leaf2);
    R3(Root3) --> M3;
    R1 --> M1;
    R1(Root1) --> P1(Periph1);
    P1 -.-> M3;
Loading

A given leaf is dependent on all ancestors. By looking at the paths from a given leaf to each root in those ancestors, I have subgraphs through-which I can usually cascade up a restriction:

mid3_restr = ((Leaf1 & restr) * Mid3).fetch(*Mid3.primary_key)

This restriction on Mid3 is the same whether Leaf1 is tracked up to Root2 or Root3. Tracking both Leaf1 and Leaf2 to Root2, however, results in two Mid2 restrictions that need to be compared and possibly joined, which has been a slow process. I'm working on reducing the redundancy of calculating Mid2 restrictions.

Q2: Can I assume that Leaf1 -> Mid2 restriction is the same as Leaf2 -> Mid2 restriction? Or might these be different subsets?

Peripheral nodes

graph TD;
    classDef lightFill fill:#DDD,stroke:#999,stroke-width:2px;
    R2(Root2) --> M1(Mid1);
    M1 --> M2(Mid2);
    M2 --> M3(Mid3);
    M3 --> L1(Leaf1);
    M2 --> L2(Leaf2);
    R3(Root3) --> M3;
    R1 --> M1;
    R1(Root1) --> P1(Periph1);
    style P1 stroke:#A00
    P1 -.-> M3;
    class L2 lightFill;
    class R2 lightFill;
    class R3 lightFill;
Loading

Spyglass has a handful of tables I've referred to as 'peripheral nodes' like IntervalList that are often inherited as secondary keys (dotted line above), which complicates the restriction process. Currently, I project (proj()) the join of leaf and parent to avoid instances of clashing secondary keys, but this grabbing all of Root1, rather than an appropriate restriction. To fix this, I could...

  1. Ensure the path accommodates for primary- vs. secondary-key inheritance during the join.
  2. Ban peripheral nodes from path-finding, forcing Leaf1 -> Mid1 -> Root1, and then add peripheral nodes as special case pseudo-roots. - I needed to ban these nodes previously when searching for paths that included multiple merge tables.
  3. Ask users for restrictions on IntervalList as a special case.

Q3. Are there any current patterns to when IntervalList is pk vs sk inherited?

Other peripheral nodes include common_nwbfile tables, Nwbfile and AnalysisNwbfile, but I use fetch_nwb to generate the file list for the export.

Q4: Will this upload include the Nwbfile, or just Analysis files? I remember it being the latter in our discussion.

@lfrank
Copy link
Contributor Author

lfrank commented Mar 19, 2024

To address only Q4, the upload will also include the Nwbfile.

@CBroz1
Copy link
Member

CBroz1 commented Mar 21, 2024

We discussed recording an export based on paper_id and analysis_id. This will allow users to revise an export based on an individual notebook/figure rather than starting over. It seems likely, however, that the MySQL export would be redundant across analyses.

Q5: Do users need to be able to generate separate mysqldump/docker containers for each analysis? Or is it safe to combine across analyses for a given paper export?

@lfrank
Copy link
Contributor Author

lfrank commented Mar 21, 2024 via email

@CBroz1
Copy link
Member

CBroz1 commented Mar 21, 2024

My current solution is to hook into dj.restrict with the mixin. I made some progress collecting and cascading restrictions to other tables, but I ran into the following issue: multiple restrictions on one line yield separate entries in my log.

  • table & {'a': 'b'} & 'c = "d"' logs as two entries, with rows that meet either restriction queued for export
  • table & 'a = "b" AND c = "d"' grabs intersection, with entries that meet both criteria queued for export

This means exporting much more of the table than we intend

To address this, I could ...

  1. Assume that two restrictions on a given table in succession were the first case
  2. Add this as a limitation of the process and provide tools to merge restrictions for the user
  3. Search for a new means of monitoring analyses that looked at the data tables return.

(1) would work with my existing design, but there would be an edge case where I would mistakenly merge something like the snippet below. So long as the restrictions were mutually exclusive (e.g., field = a, field = b), I think I could catch this edge case. How likely it something like the snippet below? How likely is it that restr1 and restr2 are not mutually exclusive?

step1 = my_table & restr1
step2 = my_table & restr2
do_something(step1, step2)

(2) ensures no such edge cases, but requires changes to users' analysis scripts. table & a & b -> table & dj.AndList(a,b). This requires more documentation/education on the limits of this process. This could include warnings like "I see you just restricted the same table twice. Remember to use AndList when..."

(3) will lead to more complicated restrictions based on fetched content, slowing down the process. Rather than being able to store the human-generated nwb_file_name LIKE "prefix%", I'd store nwb_file_name = 'prefix1' OR nwb_file_name = 'prefix2' as well as a more complicated monitoring process on the Mixin.

@samuelbray32
Copy link
Collaborator

Just brainstorming: Would it make sense to monitor at the fetch function instead of restrict?

@CBroz1
Copy link
Member

CBroz1 commented Mar 22, 2024

Is used data always fetched? I imagined that there might be operations without fetching, but, no I can try to hook into fetch

@samuelbray32
Copy link
Collaborator

I can't think of a case where I have applied the data in analysis without a fetch. This might be a good question to float during group meeting to confirm though

@lfrank
Copy link
Contributor Author

lfrank commented Mar 22, 2024 via email

@edeno
Copy link
Collaborator

edeno commented Mar 23, 2024

Unrelated to the immediate conversation above, but one thought from a conversation with @samuelbray32 is that we maybe should be recording the spyglass.__version__ somewhere in the analysis NWB file so that we know when that was created. This would store the version, git hash and date which is nice. This wouldn't work retroactively, but certainly would help with forensics in a data analysis later if someone were trying to reproduce an analysis.

@CBroz1
Copy link
Member

CBroz1 commented Mar 25, 2024

Do we have any preference for how these exports are organized? My current assumptions are ...

  • My process yields 1 export command per table (mysqldump shema table --where="restriction"). A set of commands will be appended to a single bash script: ExportSQL_{export_id}.sh
  • The resulting exports can all append to the same sql file (mysqldump ... >> Populate_{export_id}.sql)
  • These sh and sql files should be stored in a new spyglass-managed directory (e.g., SPYGLASS_EXPORT_DIR = /stelmo/nwb/export) with an export_id subfolder

Some protections are currently in place to prevent direct database access for all non-admin uses. This means that, at present, the export scripts would need to be run by an admin who would...

  • upload the sh script to the server hosting the database instance
  • log in and mount the data directory
  • run the script

Do we want to reduce these protections? Should a user be able to run their own export?

@lfrank
Copy link
Contributor Author

lfrank commented Mar 26, 2024 via email

@CBroz1 CBroz1 mentioned this issue Mar 28, 2024
4 tasks
@CBroz1
Copy link
Member

CBroz1 commented Mar 28, 2024

#875 🎉 progress

@samuelbray32
Copy link
Collaborator

A low priority Future Issue note I'm going to place here:

Our current Dandi upload plan assumes all intermediate files are available locally where the export is happening. You could imagine a case where collaborators on different networks do different parts of an analysis that need to be exported and shared together. Can definitely use kachery to aid in this once you know the analysis nwb's needed. Just putting the thought there for if/when the bug occurs

@samuelbray32
Copy link
Collaborator

Was working on (3) and hit a new issue with the export. While individual analysis files are valid with dandi, the collection contains an error because each analysis file from the same Session inherits the same original object_id.

A future solution could be to give each Analysis nwb file a unique object id during creation in AnalysisNwbfile.create()

I'm not aware of any issues that changing object_id in existing files would cause but would like to hear any feedback

@lfrank
Copy link
Contributor Author

lfrank commented May 3, 2024 via email

@samuelbray32
Copy link
Collaborator

samuelbray32 commented May 3, 2024

There is an object_id for the whole nwbfile which is inherited. There is also object_id for things like electrode groups that are inherited as well (essentially anything we don't pop off from the original file).

Need to verify, but the object_id we add at create should be unique. So at least from the spyglass side we should be ok

@lfrank
Copy link
Contributor Author

lfrank commented May 4, 2024 via email

@samuelbray32 samuelbray32 mentioned this issue May 7, 2024
7 tasks
@edeno
Copy link
Collaborator

edeno commented Jun 6, 2024

@samuelbray32 @CBroz1 as far as I'm aware the last steps here are:

  1. Docker image with database.
  2. Testing on the ms stim dataset
  3. Coordinating with DANDI

Does that sound right?

@edeno
Copy link
Collaborator

edeno commented Jul 30, 2024

Progress on step 1 in #1048

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request infrastructure Unix, MySQL, etc. settings/issues impacting users
Projects
None yet
Development

No branches or pull requests

4 participants