Notes on the Korp developers’ meeting in Zoom on 2022 10 14

Notes on the Korp developers’ meeting in Zoom on 2022-10-14

Participants:

Språkbanken Text:
- Martin Hammarstedt
- Maria Öhrman
Kielipankki (The Language Bank of Finland):
- Anni Järvenpää (CSC)
- Martin Matthiesen (CSC)
- Jyrki Niemi (University of Helsinki)

Mink, Sparv and Korp

Introduction to Mink

Martin H. and Maria told about Mink:
- Users can upload their own corpus material to Mink in the formats recognized by Sparv (XML, raw text, MS Word)
- A long-term goal is to make Mink support also lexica.
- Mink has a separate frontend.
- At this stage, Mink is tightly coupled with (the new) Sparv.
- The new Sparv has no frontend; instead, Mink will be used.
- The currently available Sparv frontend uses an older version of the Sparv pipeline.

Sparv and Kielipankki?

Martin M: Sparv could be something adopt in Kielipankki, too, maybe to replace korp-make.
- Jyrki: Kielipankki’s corpus pipeline is less well integrated.
- Martin H: It would most likely possible to have a Sparv plugin for the Kielipankki pipeline.
- Martin M: How does Sparv handle cases in which the processing of a couple of files fail?
  - Jyrki: The failed files should be fixed and reprocessed.
  - Martin H: Sparv works like Make, so changed files will be reprocessed.
- Martin M: We wish to be able to use our HPC environment and parallel processing.
- Jyrki: We could try to adapt our tools to Sparv.

Access to Mink

Maria: At first some people inviited to use Mink.
Martin H: At present, users can only have their corpora in a separate mode in Korp, requiring login. In the future, users could contact admins and wish for their corpora included in the public Korp.
A separate instance of the Korp backend is set up for Mink.

Authentication in Mink and Korp

Maria: Eventually Shibboleth or similar will be used for authentication.
Martin M: We shouldn’t have competing implementations for Shibboleth support in Korp except for a good reason.
Jyrki: We can share Kielipankki’s solution, but I’ve been slow in that.
Maria: Språkbanken uses Shibboleth with JWTs, which need not be used with Shibboleth.
Martin M.: Shibboleth future is uncertain because of no API access, so Kielipankki is moving to OpenID Connect, which also uses JWTs.
Martin M. described the solution of Kielipankki, with authorization via the Language Bank Rights service.
Maria: Mink API uses tokens.
Maria: Korp should be able to use different login solutions.
Martin M: We should support OpenID Connect, which would also be a good thing from the CLARIN point of view.
Martin. H has split the authentication and authorization parts of the Korp backend into plugins.
It was agreed that we should collaborate more on authentication issues in the future.

The YAML corpus configuration format

Maria: Support for specialized features could be plugins, but otherwise the changes made for Kielipankki can be incorporated in the code.
Martin H. and Maria: Configuration keys should be more consistent, but no time to make them such.
Martin M: Have the configurations been versioned?
- Maria: No, but that should be done.
Maria: It would be better to have types for attributes: more focused on the kind that an attribute is, instead of what features it has; for example, an integer or an attribute with ranking.
- This might be taken up in the spring.

The Korp plugin facilities implemented for Kielipankki

Maria: Plugins would be good, so that repository forks wouldn’t be needed for customizing Korp for different sites.
Jyrki: How to make the plugin facility general enough, say, for example for Iceland?
- Martin: CLARIN Technical Centre Standing Committee could be a place to spread information.
Martin M: If possible, avoid having separate Finnish and Swedish plugins for Shibboleth.
Martin M: Plugins may also encourage people to implement their own solutions even when not needed.

The slowness of Korp

Current situation

Martin M. described the situation in Kielipankki:
- Using Korp in a teaching setting easily leads to performance problems.
- Korp is currently running on a server with hard disks, but it will be moved to one which SSDs, which should make it faster.
- The old CGI-Korp appeared faster than the current Korp using Gunicorn.
Maria and Martin H: Språkbanken’s Korp is also slow, even ridiculously slow in recent months.
Martin H: MySQL queries are very slow.
- Martin M: The Kielipankki Korp uses a separate database server with SSDs, so it’s not so slow.
Martin H. has tried to investigate a problem in which the search progress bar on the Korp frontend goes on (and finishes) but no results appear.

Load balancing

Martin M: Have you thought of load-balancing the backend?
- Martin H: A previous systems administrator played around with that, but it was buggy and nothing came out of that. But it’s something that should be explored more.
- Martin M: It would be nice to be able to add extra backends automatically or even manually for the times of courses. As the Korp backend virtually read-only, it should be easy.

Replacing Corpus Workbench

Martin M: What about getting rid of Corpus Workbench?
- Martin H: That would be a major change, as Korp is a frontend for CWB.
- Martin H: Some colleagues have been working on something that would be faster than CWB. It’s an alpha version, but lacks many CWB features.

Parallelizing

Martin M: Why is it difficult to parallelize Korp/CWB?
- Martin H: Statistics needs to summarize and build a huge data structure in Python, which may slow it down.

OpenShift

Martin H: For Mink, a new Korp backend will be set up using OpenShift, in the spring or even later.
- If it works, the main Korp might also be moved to OpenShift.
- Martin M: The CSC OpenShift (Rahti) instances have their own issues with ClusterFS.
  - It would allow scaling up and down, but that's currently not properly implemented.
  - Rahti will be updated to OpenShift 4, which should solve some issues.

Default searches in the frontend

Maria: Statistics could be disabled in the frontend by default, which would eliminate some backend processing. That should be easy to implement.
Maria: Fewer or smaller corpora could be preselected.
- Jyrki: Kielipankki’s Korp now has no preselected corpora.
- Martin H: Don’t users then choose “select all corpora”?
- Martin M: It has helped anyway.
Martin H: The default search in the extended search searches for all words.

Sharing corpus data

Martin M: Kielipankki is sharing corpus data as downloadable VRT files, with a converter to JSON, as the latter is preferred by many users.
Martin H: Språkbanken is sharing corpus data as XML, which corresponds to VRT but with token attributes encoded as the attributes of an XML element (not TEI-compatible).

Testing Korp

Martin M: Automated testing for Korp would be great.
Martin H: It hasn't happened yet, there never seems to be enough time time.
Martin M: We could probably contribute to this; maybe Anni could do something.

General collaboration practices

Martin M: With plugins, collaboration should be easier: less need to wait for each other.
Jyrki: GitHub pull requests have worked well from my point of view.
Maria: It's good to use GitHub facilities as much as possible, as information can be found there more easily than in emails.
Martin M: A meeting like this is also very useful.
It was agreed that the next meeting will be organized in January 2023; Maria promised to arrange it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notes on the Korp developers’ meeting in Zoom on 2022 10 14

Notes on the Korp developers’ meeting in Zoom on 2022-10-14

Mink, Sparv and Korp

Introduction to Mink

Sparv and Kielipankki?

Access to Mink

Authentication in Mink and Korp

The YAML corpus configuration format

The Korp plugin facilities implemented for Kielipankki

The slowness of Korp

Current situation

Load balancing

Replacing Corpus Workbench

Parallelizing

OpenShift

Default searches in the frontend

Sharing corpus data

Testing Korp

General collaboration practices

Clone this wiki locally