-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data dump to GitHub #355
Comments
We already have a repo for this (https://github.com/bio-tools/bio.tools-content) but the names maybe a bit crappy? How about:
Preferences? I'll need to spell out this is strictly for experimental purposes (like what I said here already). And in what format:
Preferences? I'd personally prefer XML because it will make the validation direct and easier (and avoid any drift to using not very rigorous JSON schema equivalents of biotoolsSchema etc.) And what about the structure - I propose one folder per tool, where the folder name is the bio.tools toolID - which allows for adding other tool descriptors / files / formats under a common directory. Also one XML with everything in. Preferences? cc @bgruening @hmenager @hansioan : what do you think? |
My gut feeling is I would prefere YAML, as this is currently the most easiest format for people to edit in an editor or browser. This can change if we dump the final version and when we have an curation interface, but for now I would prefer YAML. The shim is hopefully not complicated to write and would be used on CI to 1) convert it to XML and 2) validate and changes. Thanks @joncison for working on this. |
OK thanks! Just a note that https://dev.bio.tools/api/tool/ is currently serving a mess (in all of XML, JSON and YAML :) ) - but this has been sorted locally. I plan to play more with shims next week, so let's see how this goes ... And @bgruening - what about the directory structure; are you happy with folders as we talked about previously ? |
IMHO I'd go either for https://github.com/bio-tools/resources or
https://github.com/bio-tools/content as there is more content than just
tools -
Regarding the structure of the repo I'd suggest to have a general folder
for bio.tools at the repo and then one per tool.
How do you want to handle versions? different subfolders in the same tools
folders?
Cheers,
Salva
…On Sat, Dec 8, 2018 at 12:01 PM Jon Ison ***@***.***> wrote:
OK thanks!
Just a note that https://dev.bio.tools/api/tool/ is currently serving a
mess (in all of XML, JSON and YAML :) ) - but this has been sorted locally.
I plan to play more with shims next week, so let's see how this goes ...
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#355 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAH4hzwI0N1GoOBxtGaOdBrdOnKQpWTzks5u25wdgaJpZM4WYzEr>
.
|
@joncison @hmenager @bgruening https://bio.tools/api/t?page=1&format=json In the case of biotoolsSchema xml for now we only have that on a per tool basis (example shown on dev but will soon work on production too) |
I would go for a single format for the repository (one that can be easily checked against a schema). Having several formats may introduce inconsistences. Perhaps we can accept any input format (XML, json, yaml) and convert the data on the pull request. |
Please let us know what you think @hmenager then I'll write back addressing all comments above ... |
Yes. Folders are good.
Most likely. Would make sense. Whatever we do, we can change this easily later one. So nothing is set in stone imho.
@hansioan any reason? JSON is a subset of YAML so that should be fine for both worlds and conversion is easy.
I guess the idea was to accept only one format and then on CI add all validation. This validation could happen by intermediate conversion to XML if @joncison thinks that's best. |
Hello, I know that for the mere human being the form ?page=1&format=json is a natural way, as it permits to use usual browser for the GET requests, but talking about REST architecture, it is better to use headers:
The advantage of standard http pagination is that a client knows from the beginning the total size (headers go before the body) and may calculate the number of pages in the table, while loading only one page only. Of course nobody prevent someone to implement both forms. |
Regarding the versions... Technically in bio.tools we don't have a fine grained track of tool versions. In the new schema 3.0 which will go in this week if not today we allow multiple version annotations per tool. The reason for this is that if there is no difference in annotation for a set of tool versions, they all go in the same tool annotation. Thus I am not really sure if any of the folder structure is needed, at least not at the core of the tool descriptions. We can have the option of providing multiple copies of the same tool, separated by version, but that should be something that results from the initial structure, and not something that IS the initial structure. |
@hansioan <https://github.com/hansioan>
We have been arguing for quite a while about the versioning and which one
is the best solution for tracking that information about.
I think having that in mind in this effort from the very beginning might
prevent having to invest much more time and efforts in a later stage.
I agree with you that when there is no changes among versions, it is easy
to handle.
However, when there are major changes among versions for the same program,
it should be modelled in the same entry rather than having an independent
entry.
For instance, if I look for trimAl and then go to bio.tools entry, I'd like
to have access to the latest version. It is quite likely I'm not aware that
there are two versions and would look for the generic tools name.
Cheers,
Salva
…On Mon, Dec 10, 2018 at 1:20 PM Hans Ienasescu ***@***.***> wrote:
Regarding the versions... Technically in bio.tools we don't have a fine
grained track of tool versions. In the new schema 3.0 which will go in this
week if not today we allow multiple version annotations per tool. The
reason for this is that if there is no difference in annotation for a set
of tool versions, they all go in the same tool annotation.
If there IS a (significant) difference in tool functionality -> thus
annotation between different tool versions, then that tool, along with the
version will go into a separate tool entry, given its own tool id, with
separate annotation and so on...
Thus I am not really sure if any of the folder structure is needed, at
least not at the core of the tool descriptions. We can have the option of
providing multiple copies of the same tool, separated by version, but that
should be something that results from the initial structure, and not
something that IS the initial structure.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#355 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAH4h9vYJBBlLIxDg3iKq2zO3picxK4Jks5u3lGLgaJpZM4WYzEr>
.
|
Yes, but having version specific information for each tools gets us back to 2 years ago when tools were accessed like https://bio.tools/toolid/version This way was basically creating a tool whenever a new version appeared, and in 90% of all cases there were no (zero) differences between the annotations, except for the version property. We had a very famous example of a tool that appeared over 10 times in bio.tools with the same annotation, because the people were just going in and updating the version information whenever they released a new version (e.g. new tool between tool version 1.2.23 and 1.2.24). There is no good way to do separate versions for each tool except modeling this in the API request, and even if there was we would still have to store versioned tools in the database. While this can certainly apply for things like conda, containers and other projects that require the exact versions, I don't think applies as much to bio.tools. We must remember that 90% of our users just want to find a tool that meets their scientific requirements (focus on find). All this being said, I am not opposed to having a good solution that can work for everyone, it is just something which is complicated and not in our list of main tasks right now. We have opened the code and once all the remaining plumbing tasks are done and we are ready to accept pull requests, perhaps this can be one of the initial tasks for contributors. |
@redmitry Json will be served from the web service. But this question issue is about the data storage in GitHub, no one disagrees that Json will be served from the web service, imho. @scapella @hansioan don't let the version discussion stop the initial drop, please. Versioning can be added at any time. Subfolders are easy to add for anyone if they care about the version or if the difference between the tools/benchmarks are too big. You simply need to adjust your bio.tools-github-parser to traverse recursively to all folders. And people that don't need this, can simply assume that the latest version is in the root dir. Really not a big deal. We can add this later if we need to. |
@joncison As far as I'm concerned YAML would probably be the best choice,
because:
- it is the easiest format to track changes with git,
- it is easier to manually edit for many people.
For the repository, I would go for either https://github.com/bio-tools/tools
or https://github.com/bio-tools/content - not resources.
…On Mon, Dec 10, 2018 at 2:17 PM Björn Grüning ***@***.***> wrote:
@redmitry <https://github.com/redmitry> Json will be served from the web
service. But this question issue is about the data storage in GitHub, no
one disagrees that Json will be served from the web service, imho.
@scapella <https://github.com/scapella> @hansioan
<https://github.com/hansioan> don't let the version discussion stop the
initial drop, please. Versioning can be added at any time. Subfolders are
easy to add for anyone if they care about the version or if the difference
between the tools/benchmarks are too big. You simply need to adjust your
bio.tools-github-parser to traverse recursively to all folders. And people
that don't need this, can simply assume that the latest version is in the
root dir. Really not a big deal. We can add this later if we need to.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#355 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABFoAIThCXBZCOzQ8Ik1GNZF7tTWII02ks5u3l7lgaJpZM4WYzEr>
.
|
Fine with me to have in the radar the versioning stuff but no to stop the
dumping process.
Salva
On Mon, Dec 10, 2018 at 3:23 PM Hervé Ménager <[email protected]>
wrote:
… @joncison As far as I'm concerned YAML would probably be the best choice,
because:
- it is the easiest format to track changes with git,
- it is easier to manually edit for many people.
For the repository, I would go for either
https://github.com/bio-tools/tools
or https://github.com/bio-tools/content - not resources.
On Mon, Dec 10, 2018 at 2:17 PM Björn Grüning ***@***.***>
wrote:
> @redmitry <https://github.com/redmitry> Json will be served from the web
> service. But this question issue is about the data storage in GitHub, no
> one disagrees that Json will be served from the web service, imho.
>
> @scapella <https://github.com/scapella> @hansioan
> <https://github.com/hansioan> don't let the version discussion stop the
> initial drop, please. Versioning can be added at any time. Subfolders are
> easy to add for anyone if they care about the version or if the
difference
> between the tools/benchmarks are too big. You simply need to adjust your
> bio.tools-github-parser to traverse recursively to all folders. And
people
> that don't need this, can simply assume that the latest version is in the
> root dir. Really not a big deal. We can add this later if we need to.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <
#355 (comment)
>,
> or mute the thread
> <
https://github.com/notifications/unsubscribe-auth/ABFoAIThCXBZCOzQ8Ik1GNZF7tTWII02ks5u3l7lgaJpZM4WYzEr
>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#355 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAH4hwquwN3cFLP0K0_YpF77ys0C7GW1ks5u3m5tgaJpZM4WYzEr>
.
|
Quick update - will be revisiting this in new year - but for now a few points:
Let's keep this issue for the data dump and use this for technical discussions about a GitHub-based content architecture. Pls. bear in mind the priority on the DK side is getting the deployment and open-dev process sorted, critical / high priority issues scheduled for the 2019 Q1 release, the website redesign, and other features with direct impact on end-users. The new content architecture under discussion would be awesome, but depends on other components including an independent curators interface e.g. based on edamToolAnnotator and independent validation mechanism, e.g. biotoolsLint. It's a lot of work, hence a matter of priorities. |
This issue was moved to research-software-ecosystem/content#2 |
Nightly dump of all content (in XML and JSON formats?) to GitHub, as a convenience (or least to begin, just a one-off dump)
The text was updated successfully, but these errors were encountered: