-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added the two datasets from the surf publication (#195) #199
Conversation
* Added the two datasets from the surf publication Borylation dataset and minisci dataset. * Create test_file Adding the file to make a commit. It will be subsequently removed in a new commit. I think this will reset the check target branch test. * Delete data/test_file removing this file because it is has served its purpose. The check target branch test was reset to see the correct target branch.
Change summary:
|
Change summary:
|
@connorcoley @skearnes I was able to pull this through into open-reaction-database/ord-data:#195 without getting approval. It should probably be checked at this stage before it is approved to go into main. |
Yes, that's expected; we don't protect any branches except for |
Change summary:
|
@qai222 do you have time to take a look at these for correctness? |
Change summary:
|
Change summary:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the pbtxt files from #195, specifically borylation_ord.pbtxt and minisci_ord.pbtxt.
- not sure why the analysis type for yield is
isolated
analyses { key: "product_1_isolated" value { type: CUSTOM details: "isolated" }
- some reactions have yield higher than 100
identifiers { type: CUSTOM details: "rxn_id from SURF table" value: "lit_pub_bo_9" } ... measurements { analysis_key: "product_1_GC" type: YIELD percentage { value: 137.0 } }
For item 2, this is how it is recorded in the SURF table file. Since it is a GC yield I suspect it is just a calibration error. While not ideal, I think we need to report it as it is written. There are already >100% yields in the ORD from the USPTO data. This particular reaction has come from this paper: https://pubs.acs.org/doi/10.1021/acscatal.0c00152. Unfortunately the rxn does not appear in the SI, and I don't have access to the paper. |
Yeah the paper says "Yields were determined by gas chromatography and are based on moles of B 2 pin 2." in table 2 caption. I agree we should report as it is. |
Change summary:
|
Change summary:
|
Change summary:
|
@qai222 I fixed the surf2ord program to correctly label "isolated" yield types as the WEIGHT measurement type enum in ORD. The problem was some rogue capitalization in their code. As discussed above, point 2 (the >100% yields) are left as they were reported. I made a bit of a mess when replacing the minisci dataset file, but it looks like it has been resolved now, and the PR only shows the 2 datasets and the reaction count is correct. I downloaded them to confirm they are the correct versions. |
Change summary:
|
Change summary:
|
Change summary:
|
Borylation and minisci datasets from the SURF publication (ChemRxiv, 2024, 10.26434/chemrxiv-2023-nfq7h-v2 D O I: 10.26434/chemrxiv-2023-nfq7h-v2 [opens in a new tab]). These are reactions which have been collected from the literature and summarised in SURF format by @alexarnimueller.
The Jupyter Notebook used to convert the datasets is located at bdeadman/surf/surf2ord_troubleshooting.ipynb. The surf2ord.py script has been modified to output data into the latest ord-schema version and preferred style.
Notes: