-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Features to allow sharing of full pipeline and results for paper publication #861
Comments
I think (1) the selective dump of mysql is readily achievable. It might involve generating a list of the tables associated with paper data, and then using datajoint to generate the restriction for use during the mysqldump process. I predict some 'devil-in-the-details' when it comes to replication with uploaded files. If the Analysis files are edited to be DANDI compliant, issues may arise in any portion of our pipeline/analysis/visualization code downstream of those edits. Is it fair to say that the DANDI compliance is a moving target? Is our Spyglass-native format fixed? I can envision the following approaches:
We took option 1 for spatial series changes in NWB v2.5. It seems to have worked out well, but I know the permutations could balloon out when we multiply across X data types, Y times across spyglass tables and Z NWB version changes over time. Option 2 carries a heavy maintenance burden across a large set of files that may never be needed. This is achievable for a subset of the data selected for a given upload, but would need to be accompanied by record keeping of when something was updated and for-which DANDI/NWB version to prevent issues with accessing existing data. I think @samuelbray32 has already started exploring this. Option 3 least aligns with open-science ideals, but it narrows the maintenance burden to a single tool designed for this purpose. It would allow spyglass development to continue without interruptions and would be easier to maintain |
Great points; a few responses below:
On Mar 7, 2024, at 12:37 PM, Chris Brozdowski ***@***.***> wrote:
This Message Is From an External Sender
This message came from outside your organization.
I think (1) the selective dump of mysql is readily achievable. It might involve generating a list of the tables associated with paper data, and then using datajoint to generate the restriction for use during the mysqldump process.
I predict some 'devil-in-the-details' when it comes to replication with uploaded files. If the Analysis files are edited to be DANDI compliant, issues may arise in any portion of our pipeline/analysis/visualization code downstream of those edits. Is it fair to say that the DANDI compliance is a moving target? Is our Spyglass-native format fixed?
A very good question. I think we need to schedule a conversation with the DANDI folks to get a better sense for this. In general I believe they are trying to keep any changes in requirements minimal so as to avoid annoying everyone. As far as our files go, the changes required, with the possible exception of the spatial series, should be fairly minimal.
I can envision the following approaches:
1. Periodic updates of all portions of Spyglass ingestion processes to detect format version and handle it accordingly.
2. Periodic maintenance to bring us into DANDI compliance across (a) Spyglass pipelines, (b) server files, (c) mysql checksum tables
3. A single bidirectional translation mechanism to edit files prior to upload and reverse edits on download
We took option 1 for spatial series changes in NWB v2.5. It seems to have worked out well, but I know the permutations could balloon out when we multiply across X data types, Y times across spyglass tables and Z NWB version changes over time.
Option 2 carries a heavy maintenance burden across a large set of files that may never be needed. This is achievable for a subset of the data selected for a given upload, but would need to be accompanied by record keeping of when something was updated and for-which DANDI/NWB version to prevent issues with accessing existing data. I think @samuelbray32<https://urldefense.com/v3/__https://github.com/samuelbray32__;!!LQC6Cpwp!tcaKTdq7u9Iy2i6nHPOtdX0tWwI4jtKILM5G9VAtz9r5yxrwLgFQAfSELiqeUtBRC2_2QwH8T5nO3WzEnF9qFGV4kNE$> has already started exploring this.
Option 3 least aligns with open-science ideals, but it narrows the maintenance burden to a single tool designed for this purpose. It would allow spyglass development to continue without interruptions and would be easier to maintain
Another comment on option 3 is that we need to be able to read directly from DANDI without download. That’s possible with pynwb, so in that case we need to make sure that the DANDI compatible files are also Spyglass compatible. Exactly how hard that will be is to be determined, but I remain optimistic. I’m therefore leaning toward option 2, but it’s a very slight lean.
—
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/LorenFrankLab/spyglass/issues/861*issuecomment-1984382998__;Iw!!LQC6Cpwp!tcaKTdq7u9Iy2i6nHPOtdX0tWwI4jtKILM5G9VAtz9r5yxrwLgFQAfSELiqeUtBRC2_2QwH8T5nO3WzEnF9qlr1Ssbs$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABV4PSJDK5NDVD7XBXFU7NDYXDFYHAVCNFSM6AAAAABELL2MMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBUGM4DEOJZHA__;!!LQC6Cpwp!tcaKTdq7u9Iy2i6nHPOtdX0tWwI4jtKILM5G9VAtz9r5yxrwLgFQAfSELiqeUtBRC2_2QwH8T5nO3WzEnF9qm_m0c0s$>.
You are receiving this because you authored the thread.
|
Thinking through some ideas here with the help of diagrams ... StrategyIn an effort to monitor an analysis and generate a corresponding export, I found I would identify 'leaf' nodes in the resulting subgraph, and the restriction applied to each leaf. If a user restricts a table while export is active, I can keep track of the table as a leaf and the restriction applied, making the assumption that downstream tables are not needed. This has two requirements for the user: (1) all custom tables must inherit Q1: Are these assumptions valid? Are these acceptable requirements? In order to generate the MySQL dump, I will need a list of all required tables in the graph and the restriction that gives me the subset of each table required for downstream fk references. Restrictionsgraph TD;
classDef lightFill fill:#DDD,stroke:#999,stroke-width:2px;
R2(Root2) --> M1(Mid1);
M1 --> M2(Mid2);
M2 --> M3(Mid3);
M3 --> L1(Leaf1);
M2 --> L2(Leaf2);
R3(Root3) --> M3;
R1 --> M1;
R1(Root1) --> P1(Periph1);
P1 -.-> M3;
A given leaf is dependent on all ancestors. By looking at the paths from a given leaf to each root in those ancestors, I have subgraphs through-which I can usually cascade up a restriction: mid3_restr = ((Leaf1 & restr) * Mid3).fetch(*Mid3.primary_key) This restriction on Mid3 is the same whether Leaf1 is tracked up to Root2 or Root3. Tracking both Leaf1 and Leaf2 to Root2, however, results in two Mid2 restrictions that need to be compared and possibly joined, which has been a slow process. I'm working on reducing the redundancy of calculating Mid2 restrictions. Q2: Can I assume that Leaf1 -> Mid2 restriction is the same as Leaf2 -> Mid2 restriction? Or might these be different subsets? Peripheral nodesgraph TD;
classDef lightFill fill:#DDD,stroke:#999,stroke-width:2px;
R2(Root2) --> M1(Mid1);
M1 --> M2(Mid2);
M2 --> M3(Mid3);
M3 --> L1(Leaf1);
M2 --> L2(Leaf2);
R3(Root3) --> M3;
R1 --> M1;
R1(Root1) --> P1(Periph1);
style P1 stroke:#A00
P1 -.-> M3;
class L2 lightFill;
class R2 lightFill;
class R3 lightFill;
Spyglass has a handful of tables I've referred to as 'peripheral nodes' like IntervalList that are often inherited as secondary keys (dotted line above), which complicates the restriction process. Currently, I project (
Q3. Are there any current patterns to when IntervalList is pk vs sk inherited? Other peripheral nodes include Q4: Will this upload include the Nwbfile, or just Analysis files? I remember it being the latter in our discussion. |
To address only Q4, the upload will also include the Nwbfile. |
We discussed recording an export based on Q5: Do users need to be able to generate separate mysqldump/docker containers for each analysis? Or is it safe to combine across analyses for a given paper export? |
We definitely want a single database for a paper, not for each analysis.
On Mar 21, 2024, at 7:39 AM, Chris Brozdowski ***@***.***> wrote:
We discussed recording an export based on paper_id and analysis_id. This will allow users to revise an export based on an individual notebook/figure rather than starting over. It seems likely, however, that the MySQL export would be redundant
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
We discussed recording an export based on paper_id and analysis_id. This will allow users to revise an export based on an individual notebook/figure rather than starting over. It seems likely, however, that the MySQL export would be redundant across analyses.
Q5: Do users need to be able to generate separate mysqldump/docker containers for each analysis? Or is it safe to combine across analyses for a given paper export?
—
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/LorenFrankLab/spyglass/issues/861*issuecomment-2012474112__;Iw!!LQC6Cpwp!tsWyixBELgTGKNnej0EovL1KWCKR_d_x3gxJq5MGhFxajbwO2ORbsJ7UJyRTLtfdfYVs5T1ROLsRZ29FMb1ciaDjBFg$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABV4PSMBJX6NZHVFWGHDOJTYZLWI5AVCNFSM6AAAAABELL2MMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJSGQ3TIMJRGI__;!!LQC6Cpwp!tsWyixBELgTGKNnej0EovL1KWCKR_d_x3gxJq5MGhFxajbwO2ORbsJ7UJyRTLtfdfYVs5T1ROLsRZ29FMb1cXAnyvBs$>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
My current solution is to hook into
This means exporting much more of the table than we intend To address this, I could ...
(1) would work with my existing design, but there would be an edge case where I would mistakenly merge something like the snippet below. So long as the restrictions were mutually exclusive (e.g., field = a, field = b), I think I could catch this edge case. How likely it something like the snippet below? How likely is it that restr1 and restr2 are not mutually exclusive? step1 = my_table & restr1
step2 = my_table & restr2
do_something(step1, step2) (2) ensures no such edge cases, but requires changes to users' analysis scripts. (3) will lead to more complicated restrictions based on fetched content, slowing down the process. Rather than being able to store the human-generated |
Just brainstorming: Would it make sense to monitor at the |
Is used data always fetched? I imagined that there might be operations without fetching, but, no I can try to hook into fetch |
I can't think of a case where I have applied the data in analysis without a fetch. This might be a good question to float during group meeting to confirm though |
I agree; as far as I can see, tracking fetches should be sufficient.
On Mar 22, 2024, at 8:16 AM, Samuel Bray ***@***.***> wrote:
This Message Is From an External Sender
This message came from outside your organization.
I can't think of a case where I have applied the data in analysis without a fetch. This might be a good question to float during group meeting to confirm though
—
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/LorenFrankLab/spyglass/issues/861*issuecomment-2015321680__;Iw!!LQC6Cpwp!uFWnAdEzygcFRBNnqn-y5VxczXwsmrm5WSx3b8cT3t9FHF30OXJhrkMk0eg8KKMMQxa7x-gGng0Q8jRz6TBqBB6pmxQ$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABV4PSJKN6HBEIBMCVJSSMLYZRDLJAVCNFSM6AAAAABELL2MMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJVGMZDCNRYGA__;!!LQC6Cpwp!uFWnAdEzygcFRBNnqn-y5VxczXwsmrm5WSx3b8cT3t9FHF30OXJhrkMk0eg8KKMMQxa7x-gGng0Q8jRz6TBqxpaScx8$>.
You are receiving this because you authored the thread.
|
Unrelated to the immediate conversation above, but one thought from a conversation with @samuelbray32 is that we maybe should be recording the |
Do we have any preference for how these exports are organized? My current assumptions are ...
Some protections are currently in place to prevent direct database access for all non-admin uses. This means that, at present, the export scripts would need to be run by an admin who would...
Do we want to reduce these protections? Should a user be able to run their own export? |
I think it would be much easier if we could run the mysqldump remotely on a system that already mounts the data directory. It looks to me like it’s possible to do that; is that right or is there a reason we can’t do that?
As far as the user goes, we’re only going to do this a few times a year, so having the person running the script have admin privileges seems okay to me.
And then I assume we’d need to load the the dumped files into a docker image or something similar, so at some point we should think about that...
On Mar 25, 2024, at 3:13 PM, Chris Brozdowski ***@***.***> wrote:
This Message Is From an External Sender
This message came from outside your organization.
Do we have any preference for how these exports are organized? My current assumptions are ...
* My process yields 1 export command per table (mysqldump shema table --where="restriction"). A set of commands will be appended to a single bash script: ExportSQL_{export_id}.sh
* The resulting exports can all append to the same sql file (mysqldump ... >> Populate_{export_id}.sql)
* These sh and sql files should be stored in a new spyglass-managed directory (e.g., SPYGLASS_EXPORT_DIR = /stelmo/nwb/export) with an export_id subfolder
Some protections are currently in place to prevent direct database access for all non-admin uses. This means that, at present, the export scripts would need to be run by an admin who would...
* upload the sh script to the server hosting the database instance
* log in and mount the data directory
* run the script
Do we want to reduce these protections? Should a user be able to run their own export?
—
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/LorenFrankLab/spyglass/issues/861*issuecomment-2019010766__;Iw!!LQC6Cpwp!tIgsJA66lkj0XwcR30srt6mSyppDQHvjlC1AosQUNM2lEug623DjZg6emrHMrnYyhwhq1YsZzWm2fu3zKAkLagmS2HE$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABV4PSNSLSZO6PEUNV7HV5TY2COPBAVCNFSM6AAAAABELL2MMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJZGAYTANZWGY__;!!LQC6Cpwp!tIgsJA66lkj0XwcR30srt6mSyppDQHvjlC1AosQUNM2lEug623DjZg6emrHMrnYyhwhq1YsZzWm2fu3zKAkLCbTPPQA$>.
You are receiving this because you authored the thread.
|
#875 🎉 progress |
A low priority Future Issue note I'm going to place here: Our current Dandi upload plan assumes all intermediate files are available locally where the export is happening. You could imagine a case where collaborators on different networks do different parts of an analysis that need to be exported and shared together. Can definitely use kachery to aid in this once you know the analysis nwb's needed. Just putting the thought there for if/when the bug occurs |
Was working on (3) and hit a new issue with the export. While individual analysis files are valid with dandi, the collection contains an error because each analysis file from the same Session inherits the same original A future solution could be to give each Analysis nwb file a unique I'm not aware of any issues that changing |
Just to clarify, which data element in the file has the same object_id, or is there an object_id for the whole file that I don’t know about?
As long is this is not the object ID of the data element that we add to that file, then there is no problem changing it.
On May 3, 2024, at 12:18 PM, Samuel Bray ***@***.***> wrote:
This Message Is From an External Sender
This message came from outside your organization.
Was working on (3) and hit a new issue with the export. While individual analysis files are valid with dandi, the collection contains an error because each analysis file from the same Session inherits the same original object_id.
A future solution could be to give each Analysis nwb file a unique object id during creation in AnalysisNwbfile.create()
I'm not aware of any issues that changing object_id in existing files would cause but would like to hear any feedback
—
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/LorenFrankLab/spyglass/issues/861*issuecomment-2093626476__;Iw!!LQC6Cpwp!s9DtkwcQM4m9myZKTB_t6kP_MlwCFcbOMEhvIAmPIU4saAbnCWahalQv8HIVqSCgJ-0fadHNzSDKtkE1KF_8RLDq0AI$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABV4PSIVQDU47KBMGBLASYTZAPPHPAVCNFSM6AAAAABELL2MMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJTGYZDMNBXGY__;!!LQC6Cpwp!s9DtkwcQM4m9myZKTB_t6kP_MlwCFcbOMEhvIAmPIU4saAbnCWahalQv8HIVqSCgJ-0fadHNzSDKtkE1KF_850nAimA$>.
You are receiving this because you authored the thread.
|
There is an object_id for the whole nwbfile which is inherited. There is also object_id for things like electrode groups that are inherited as well (essentially anything we don't Need to verify, but the object_id we add at |
In that case we’re fine changing the object_id for the file going forward.
On May 3, 2024, at 4:07 PM, Samuel Bray ***@***.***> wrote:
This Message Is From an External Sender
This message came from outside your organization.
There is an object_id for the overall nwbfile which is inherited. There is also object_id for things like electrode groups that are inherited as well (essentially anything we don't pop off from the original file.
Need to verify, but the object_id we add at create should be unique. So at least from the spyglass side we should be ok
—
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/LorenFrankLab/spyglass/issues/861*issuecomment-2093870820__;Iw!!LQC6Cpwp!rw0zX45pguRtwIfDivYy8eYG9ir6p9O5m1w5Hd_h23AJi98BI7yIKI_e-WSt2h84awGvq3ysji83m7WP3dafTE44R2M$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABV4PSM2NO6HESEMXALLEEDZAQKDBAVCNFSM6AAAAABELL2MMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJTHA3TAOBSGA__;!!LQC6Cpwp!rw0zX45pguRtwIfDivYy8eYG9ir6p9O5m1w5Hd_h23AJi98BI7yIKI_e-WSt2h84awGvq3ysji83m7WP3dafqNTpsh8$>.
You are receiving this because you authored the thread.
|
@samuelbray32 @CBroz1 as far as I'm aware the last steps here are:
Does that sound right? |
Progress on step 1 in #1048 |
I would like us to work toward a system that allows us to make the full set of pipelines and data files for a paper available publicly. We need to discuss the details, but the solution could look something like this.
(1) Select all table entries that are necessary to recreate the contents of a paper.
(2) Export either the full database or, ideally, a database with just those entries. Possibly create docker image for that database.
(3) Assemble and upload all of the raw and analysis NWB files to DANDI.
(4) Add table or functionality to allow reads of files from DANDI, respecting file name changes.
(5) Make repo for project specific notebooks / scripts public with notebooks that create figures.
(6) Work with DANDI to create hub that allows people to start up database and use Spyglass to replicate figures and other analyses.
The text was updated successfully, but these errors were encountered: