-
Notifications
You must be signed in to change notification settings - Fork 0
Notes on the Korp developers’ meeting in Zoom on 2022 10 14
Jyrki Niemi edited this page Oct 21, 2022
·
2 revisions
Participants:
- Språkbanken Text:
- Martin Hammarstedt
- Maria Öhrman
- Kielipankki (The Language Bank of Finland):
- Anni Järvenpää (CSC)
- Martin Matthiesen (CSC)
- Jyrki Niemi (University of Helsinki)
- Martin H. and Maria told about Mink:
- Users can upload their own corpus material to Mink in the formats recognized by Sparv (XML, raw text, MS Word)
- A long-term goal is to make Mink support also lexica.
- Mink has a separate frontend.
- At this stage, Mink is tightly coupled with (the new) Sparv.
- The new Sparv has no frontend; instead, Mink will be used.
- The currently available Sparv frontend uses an older version of the Sparv pipeline.
- Martin M: Sparv could be something adopt in Kielipankki, too, maybe to replace
korp-make
.- Jyrki: Kielipankki’s corpus pipeline is less well integrated.
- Martin H: It would most likely possible to have a Sparv plugin for the Kielipankki pipeline.
- Martin M: How does Sparv handle cases in which the processing of a couple of files fail?
- Jyrki: The failed files should be fixed and reprocessed.
- Martin H: Sparv works like Make, so changed files will be reprocessed.
- Martin M: We wish to be able to use our HPC environment and parallel processing.
- Jyrki: We could try to adapt our tools to Sparv.
- Maria: At first some people inviited to use Mink.
- Martin H: At present, users can only have their corpora in a separate mode in Korp, requiring login. In the future, users could contact admins and wish for their corpora included in the public Korp.
- A separate instance of the Korp backend is set up for Mink.
- Maria: Eventually Shibboleth or similar will be used for authentication.
- Martin M: We shouldn’t have competing implementations for Shibboleth support in Korp except for a good reason.
- Jyrki: We can share Kielipankki’s solution, but I’ve been slow in that.
- Maria: Språkbanken uses Shibboleth with JWTs, which need not be used with Shibboleth.
- Martin M.: Shibboleth future is uncertain because of no API access, so Kielipankki is moving to OpenID Connect, which also uses JWTs.
- Martin M. described the solution of Kielipankki, with authorization via the Language Bank Rights service.
- Maria: Mink API uses tokens.
- Maria: Korp should be able to use different login solutions.
- Martin M: We should support OpenID Connect, which would also be a good thing from the CLARIN point of view.
- Martin. H has split the authentication and authorization parts of the Korp backend into plugins.
- It was agreed that we should collaborate more on authentication issues in the future.
- Maria: Support for specialized features could be plugins, but otherwise the changes made for Kielipankki can be incorporated in the code.
- Martin H. and Maria: Configuration keys should be more consistent, but no time to make them such.
- Martin M: Have the configurations been versioned?
- Maria: No, but that should be done.
- Maria: It would be better to have types for attributes: more focused on the kind that an attribute is, instead of what features it has; for example, an integer or an attribute with ranking.
- This might be taken up in the spring.
- Maria: Plugins would be good, so that repository forks wouldn’t be needed for customizing Korp for different sites.
- Jyrki: How to make the plugin facility general enough, say, for example for Iceland?
- Martin: CLARIN Technical Centre Standing Committee could be a place to spread information.
- Martin M: If possible, avoid having separate Finnish and Swedish plugins for Shibboleth.
- Martin M: Plugins may also encourage people to implement their own solutions even when not needed.
- Martin M. described the situation in Kielipankki:
- Using Korp in a teaching setting easily leads to performance problems.
- Korp is currently running on a server with hard disks, but it will be moved to one which SSDs, which should make it faster.
- The old CGI-Korp appeared faster than the current Korp using Gunicorn.
- Maria and Martin H: Språkbanken’s Korp is also slow, even ridiculously slow in recent months.
- Martin H: MySQL queries are very slow.
- Martin M: The Kielipankki Korp uses a separate database server with SSDs, so it’s not so slow.
- Martin H. has tried to investigate a problem in which the search progress bar on the Korp frontend goes on (and finishes) but no results appear.
- Martin M: Have you thought of load-balancing the backend?
- Martin H: A previous systems administrator played around with that, but it was buggy and nothing came out of that. But it’s something that should be explored more.
- Martin M: It would be nice to be able to add extra backends automatically or even manually for the times of courses. As the Korp backend virtually read-only, it should be easy.
- Martin M: What about getting rid of Corpus Workbench?
- Martin H: That would be a major change, as Korp is a frontend for CWB.
- Martin H: Some colleagues have been working on something that would be faster than CWB. It’s an alpha version, but lacks many CWB features.
- Martin M: Why is it difficult to parallelize Korp/CWB?
- Martin H: Statistics needs to summarize and build a huge data structure in Python, which may slow it down.
- Martin H: For Mink, a new Korp backend will be set up using OpenShift, in the spring or even later.
- If it works, the main Korp might also be moved to OpenShift.
- Martin M: The CSC OpenShift (Rahti) instances have their own issues with ClusterFS.
- It would allow scaling up and down, but that's currently not properly implemented.
- Rahti will be updated to OpenShift 4, which should solve some issues.
- Maria: Statistics could be disabled in the frontend by default, which would eliminate some backend processing. That should be easy to implement.
- Maria: Fewer or smaller corpora could be preselected.
- Jyrki: Kielipankki’s Korp now has no preselected corpora.
- Martin H: Don’t users then choose “select all corpora”?
- Martin M: It has helped anyway.
- Martin H: The default search in the extended search searches for all words.
- Martin M: Kielipankki is sharing corpus data as downloadable VRT files, with a converter to JSON, as the latter is preferred by many users.
- Martin H: Språkbanken is sharing corpus data as XML, which corresponds to VRT but with token attributes encoded as the attributes of an XML element (not TEI-compatible).
- Martin M: Automated testing for Korp would be great.
- Martin H: It hasn't happened yet, there never seems to be enough time time.
- Martin M: We could probably contribute to this; maybe Anni could do something.
- Martin M: With plugins, collaboration should be easier: less need to wait for each other.
- Jyrki: GitHub pull requests have worked well from my point of view.
- Maria: It's good to use GitHub facilities as much as possible, as information can be found there more easily than in emails.
- Martin M: A meeting like this is also very useful.
- It was agreed that the next meeting will be organized in January 2023; Maria promised to arrange it.