Data dump to GitHub #355

joncison · 2018-09-04T11:27:47Z

Nightly dump of all content (in XML and JSON formats?) to GitHub, as a convenience (or least to begin, just a one-off dump)

joncison · 2018-12-08T10:48:44Z

We already have a repo for this (https://github.com/bio-tools/bio.tools-content) but the names maybe a bit crappy? How about:

Preferences? I'll need to spell out this is strictly for experimental purposes (like what I said here already).

And in what format:

XML (we already have the schema == files can already be validated directly, biotoolsSchema 3.0.0 XML supported by bio.tools in next release) - my preference)
JSON (format natively supported by bio.tools) preferred by web devs? - requires shim for conversion to XML/validation)
YAML (format natively supported by bio.tools) most readable format? - requires shim for conversion to XML/validation)

Preferences?

I'd personally prefer XML because it will make the validation direct and easier (and avoid any drift to using not very rigorous JSON schema equivalents of biotoolsSchema etc.)

And what about the structure - I propose one folder per tool, where the folder name is the bio.tools toolID - which allows for adding other tool descriptors / files / formats under a common directory. Also one XML with everything in.

Preferences?

cc @bgruening @hmenager @hansioan : what do you think?

bgruening · 2018-12-08T10:56:22Z

My gut feeling is https://github.com/bio-tools/tools. Its bio.tools so tools makes a lot of sense :)

I would prefere YAML, as this is currently the most easiest format for people to edit in an editor or browser. This can change if we dump the final version and when we have an curation interface, but for now I would prefer YAML. The shim is hopefully not complicated to write and would be used on CI to 1) convert it to XML and 2) validate and changes.

Thanks @joncison for working on this.

joncison · 2018-12-08T11:01:49Z

OK thanks!

Just a note that https://dev.bio.tools/api/tool/ is currently serving a mess (in all of XML, JSON and YAML :) ) - but this has been sorted locally.

I plan to play more with shims next week, so let's see how this goes ...

And @bgruening - what about the directory structure; are you happy with folders as we talked about previously ?

scapella · 2018-12-08T20:22:35Z

IMHO I'd go either for https://github.com/bio-tools/resources or https://github.com/bio-tools/content as there is more content than just tools - Regarding the structure of the repo I'd suggest to have a general folder for bio.tools at the repo and then one per tool. How do you want to handle versions? different subfolders in the same tools folders? Cheers, Salva

…

On Sat, Dec 8, 2018 at 12:01 PM Jon Ison ***@***.***> wrote: OK thanks! Just a note that https://dev.bio.tools/api/tool/ is currently serving a mess (in all of XML, JSON and YAML :) ) - but this has been sorted locally. I plan to play more with shims next week, so let's see how this goes ... — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#355 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAH4hzwI0N1GoOBxtGaOdBrdOnKQpWTzks5u25wdgaJpZM4WYzEr> .

hansioan · 2018-12-10T11:39:29Z

@joncison @hmenager @bgruening
Why not all of them? I would prefer it to be JSON of course :) , but perhaps the best is to have all three (JSON, XML, YAML). bio.tools supports that.

https://bio.tools/api/t?page=1&format=json
https://bio.tools/api/t?page=1&format=yaml
https://bio.tools/api/t?page=1&format=xml

In the case of biotoolsSchema xml for now we only have that on a per tool basis (example shown on dev but will soon work on production too)
https://dev.bio.tools/api/signalp?format=xml

jlgelpi · 2018-12-10T11:45:11Z

I would go for a single format for the repository (one that can be easily checked against a schema). Having several formats may introduce inconsistences. Perhaps we can accept any input format (XML, json, yaml) and convert the data on the pull request.

joncison · 2018-12-10T11:55:20Z

Please let us know what you think @hmenager then I'll write back addressing all comments above ...

bgruening · 2018-12-10T11:59:26Z

And @bgruening - what about the directory structure; are you happy with folders as we talked about previously?

Yes. Folders are good.

How do you want to handle versions? different subfolders in the same tools folders?

Most likely. Would make sense. Whatever we do, we can change this easily later one. So nothing is set in stone imho.

I would prefer it to be JSON of course

@hansioan any reason? JSON is a subset of YAML so that should be fine for both worlds and conversion is easy.

Perhaps we can accept any input format (XML, json, yaml) and convert the data on the pull request.

I guess the idea was to accept only one format and then on CI add all validation. This validation could happen by intermediate conversion to XML if @joncison thinks that's best.
I would prefer only one format in the mast repo to not confuse users, but if other formats are needed we can have a bot that converts them automatically and syncs it so a bio.tools-json repo etc. ...

redmitry · 2018-12-10T12:10:35Z

Hello,

I know that for the mere human being the form ?page=1&format=json is a natural way, as it permits to use usual browser for the GET requests, but talking about REST architecture, it is better to use headers:

Accept: application/json
Range: tools=10-30
Response:
Content-Type: application/json
Content-Range: tools 10-30/20000

The advantage of standard http pagination is that a client knows from the beginning the total size (headers go before the body) and may calculate the number of pages in the table, while loading only one page only.

Of course nobody prevent someone to implement both forms.

hansioan · 2018-12-10T12:20:27Z

Regarding the versions... Technically in bio.tools we don't have a fine grained track of tool versions. In the new schema 3.0 which will go in this week if not today we allow multiple version annotations per tool. The reason for this is that if there is no difference in annotation for a set of tool versions, they all go in the same tool annotation.
If there IS a (significant) difference in tool functionality -> thus annotation between different tool versions, then that tool, along with the version will go into a separate tool entry, given its own tool id, with separate annotation and so on...

Thus I am not really sure if any of the folder structure is needed, at least not at the core of the tool descriptions. We can have the option of providing multiple copies of the same tool, separated by version, but that should be something that results from the initial structure, and not something that IS the initial structure.

scapella · 2018-12-10T12:49:56Z

@hansioan <https://github.com/hansioan> We have been arguing for quite a while about the versioning and which one is the best solution for tracking that information about. I think having that in mind in this effort from the very beginning might prevent having to invest much more time and efforts in a later stage. I agree with you that when there is no changes among versions, it is easy to handle. However, when there are major changes among versions for the same program, it should be modelled in the same entry rather than having an independent entry. For instance, if I look for trimAl and then go to bio.tools entry, I'd like to have access to the latest version. It is quite likely I'm not aware that there are two versions and would look for the generic tools name. Cheers, Salva

…

On Mon, Dec 10, 2018 at 1:20 PM Hans Ienasescu ***@***.***> wrote: Regarding the versions... Technically in bio.tools we don't have a fine grained track of tool versions. In the new schema 3.0 which will go in this week if not today we allow multiple version annotations per tool. The reason for this is that if there is no difference in annotation for a set of tool versions, they all go in the same tool annotation. If there IS a (significant) difference in tool functionality -> thus annotation between different tool versions, then that tool, along with the version will go into a separate tool entry, given its own tool id, with separate annotation and so on... Thus I am not really sure if any of the folder structure is needed, at least not at the core of the tool descriptions. We can have the option of providing multiple copies of the same tool, separated by version, but that should be something that results from the initial structure, and not something that IS the initial structure. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#355 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAH4h9vYJBBlLIxDg3iKq2zO3picxK4Jks5u3lGLgaJpZM4WYzEr> .

hansioan · 2018-12-10T13:15:32Z

Yes, but having version specific information for each tools gets us back to 2 years ago when tools were accessed like https://bio.tools/toolid/version

This way was basically creating a tool whenever a new version appeared, and in 90% of all cases there were no (zero) differences between the annotations, except for the version property. We had a very famous example of a tool that appeared over 10 times in bio.tools with the same annotation, because the people were just going in and updating the version information whenever they released a new version (e.g. new tool between tool version 1.2.23 and 1.2.24).

There is no good way to do separate versions for each tool except modeling this in the API request, and even if there was we would still have to store versioned tools in the database.

While this can certainly apply for things like conda, containers and other projects that require the exact versions, I don't think applies as much to bio.tools. We must remember that 90% of our users just want to find a tool that meets their scientific requirements (focus on find).

All this being said, I am not opposed to having a good solution that can work for everyone, it is just something which is complicated and not in our list of main tasks right now. We have opened the code and once all the remaining plumbing tasks are done and we are ready to accept pull requests, perhaps this can be one of the initial tasks for contributors.

bgruening · 2018-12-10T13:17:24Z

@redmitry Json will be served from the web service. But this question issue is about the data storage in GitHub, no one disagrees that Json will be served from the web service, imho.

@scapella @hansioan don't let the version discussion stop the initial drop, please. Versioning can be added at any time. Subfolders are easy to add for anyone if they care about the version or if the difference between the tools/benchmarks are too big. You simply need to adjust your bio.tools-github-parser to traverse recursively to all folders. And people that don't need this, can simply assume that the latest version is in the root dir. Really not a big deal. We can add this later if we need to.

hmenager · 2018-12-10T14:23:40Z

@joncison As far as I'm concerned YAML would probably be the best choice, because: - it is the easiest format to track changes with git, - it is easier to manually edit for many people. For the repository, I would go for either https://github.com/bio-tools/tools or https://github.com/bio-tools/content - not resources.

…

On Mon, Dec 10, 2018 at 2:17 PM Björn Grüning ***@***.***> wrote: @redmitry <https://github.com/redmitry> Json will be served from the web service. But this question issue is about the data storage in GitHub, no one disagrees that Json will be served from the web service, imho. @scapella <https://github.com/scapella> @hansioan <https://github.com/hansioan> don't let the version discussion stop the initial drop, please. Versioning can be added at any time. Subfolders are easy to add for anyone if they care about the version or if the difference between the tools/benchmarks are too big. You simply need to adjust your bio.tools-github-parser to traverse recursively to all folders. And people that don't need this, can simply assume that the latest version is in the root dir. Really not a big deal. We can add this later if we need to. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#355 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABFoAIThCXBZCOzQ8Ik1GNZF7tTWII02ks5u3l7lgaJpZM4WYzEr> .

scapella · 2018-12-10T15:33:34Z

Fine with me to have in the radar the versioning stuff but no to stop the dumping process. Salva On Mon, Dec 10, 2018 at 3:23 PM Hervé Ménager <[email protected]> wrote:

…

@joncison As far as I'm concerned YAML would probably be the best choice, because: - it is the easiest format to track changes with git, - it is easier to manually edit for many people. For the repository, I would go for either https://github.com/bio-tools/tools or https://github.com/bio-tools/content - not resources. On Mon, Dec 10, 2018 at 2:17 PM Björn Grüning ***@***.***> wrote: > @redmitry <https://github.com/redmitry> Json will be served from the web > service. But this question issue is about the data storage in GitHub, no > one disagrees that Json will be served from the web service, imho. > > @scapella <https://github.com/scapella> @hansioan > <https://github.com/hansioan> don't let the version discussion stop the > initial drop, please. Versioning can be added at any time. Subfolders are > easy to add for anyone if they care about the version or if the difference > between the tools/benchmarks are too big. You simply need to adjust your > bio.tools-github-parser to traverse recursively to all folders. And people > that don't need this, can simply assume that the latest version is in the > root dir. Really not a big deal. We can add this later if we need to. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > < #355 (comment) >, > or mute the thread > < https://github.com/notifications/unsubscribe-auth/ABFoAIThCXBZCOzQ8Ik1GNZF7tTWII02ks5u3l7lgaJpZM4WYzEr > > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#355 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAH4hwquwN3cFLP0K0_YpF77ys0C7GW1ks5u3m5tgaJpZM4WYzEr> .

joncison · 2018-12-18T14:04:00Z

Quick update - will be revisiting this in new year - but for now a few points:

repo name most likely https://github.com/bio-tools/content, reason being bio.tools scope is broad: "tool" covers many types of software; command-line tools, Web applications, database portals, workflows etc.
repo structure most likely one folder per tool, folder names will be biotoolsIDs. Folders will allow other files and sub-folders to be added in future as needed, e.g. alternative formats, or even other other tool descriptors, wrappers, test data etc.
metadata format will be biotoolsSchema 3.0.0-compatible XML to begin with, reason being, priority in 1st instance is to achieve content integration with other projects (BioConda, BioContainers, Galaxy etc) hence need to prioritise ease of validation and (I strongly suspect) updating biotoolsSchema to enable this integration
YAML format can come later, once integration use-case is advanced, and individual developers are more a priority. It has to wait for the shims which I'll play with soon-ish.
initial dump likely will be everything (easier)
version information there are well-established guidelines on how tool versions are currently handled. The current model allows specification of version information in a pragmatic / flexible way, including for the entry itself, relevant downloads and publications. There is certainly scope to improve this model, but let's take that discussion here in 1st instance - with view to a better rendering of version-specific info. in bio.tools. In future (with new content architecture), we could go further, but one thing at a time ...

Let's keep this issue for the data dump and use this for technical discussions about a GitHub-based content architecture.

Pls. bear in mind the priority on the DK side is getting the deployment and open-dev process sorted, critical / high priority issues scheduled for the 2019 Q1 release, the website redesign, and other features with direct impact on end-users.

The new content architecture under discussion would be awesome, but depends on other components including an independent curators interface e.g. based on edamToolAnnotator and independent validation mechanism, e.g. biotoolsLint. It's a lot of work, hence a matter of priorities.

joncison · 2019-01-25T15:42:36Z

This issue was moved to research-software-ecosystem/content#2

joncison added the content Concerns bio.tools content. label Sep 4, 2018

joncison self-assigned this Sep 4, 2018

joncison changed the title ~~Data dump from GitHub~~ Data dump to GitHub Sep 4, 2018

This was referenced Sep 4, 2018

Can't get over 10000 results from the API #310

Closed

Add per_page parameters to API #111

Open

joncison added data model / integrity / quality Concerns the underlying data model (verification, validity etc.) discussion General discussion around bio.tools. and removed data model / integrity / quality Concerns the underlying data model (verification, validity etc.) labels Dec 8, 2018

joncison mentioned this issue Dec 18, 2018

Technical issues around GitHub-based content (placeholder discussion) #399

Closed

This was referenced Jan 25, 2019

Technical issues around GitHub-based content management (placeholder) research-software-ecosystem/content#1

Open

Data dump to GitHub research-software-ecosystem/content#2

Open

joncison closed this as completed Jan 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data dump to GitHub #355

Data dump to GitHub #355

joncison commented Sep 4, 2018 •

edited

Loading

joncison commented Dec 8, 2018 •

edited

Loading

bgruening commented Dec 8, 2018

joncison commented Dec 8, 2018 •

edited

Loading

scapella commented Dec 8, 2018 via email

hansioan commented Dec 10, 2018

jlgelpi commented Dec 10, 2018

joncison commented Dec 10, 2018

bgruening commented Dec 10, 2018

redmitry commented Dec 10, 2018

hansioan commented Dec 10, 2018

scapella commented Dec 10, 2018 via email

hansioan commented Dec 10, 2018

bgruening commented Dec 10, 2018

hmenager commented Dec 10, 2018 via email

scapella commented Dec 10, 2018 via email

joncison commented Dec 18, 2018 •

edited

Loading

joncison commented Jan 25, 2019

Data dump to GitHub #355

Data dump to GitHub #355

Comments

joncison commented Sep 4, 2018 • edited Loading

joncison commented Dec 8, 2018 • edited Loading

bgruening commented Dec 8, 2018

joncison commented Dec 8, 2018 • edited Loading

scapella commented Dec 8, 2018 via email

hansioan commented Dec 10, 2018

jlgelpi commented Dec 10, 2018

joncison commented Dec 10, 2018

bgruening commented Dec 10, 2018

redmitry commented Dec 10, 2018

hansioan commented Dec 10, 2018

scapella commented Dec 10, 2018 via email

hansioan commented Dec 10, 2018

bgruening commented Dec 10, 2018

hmenager commented Dec 10, 2018 via email

scapella commented Dec 10, 2018 via email

joncison commented Dec 18, 2018 • edited Loading

joncison commented Jan 25, 2019

joncison commented Sep 4, 2018 •

edited

Loading

joncison commented Dec 8, 2018 •

edited

Loading

joncison commented Dec 8, 2018 •

edited

Loading

joncison commented Dec 18, 2018 •

edited

Loading