Skip to content

Notes on the Korp developers’ meeting in Zoom on 2022 10 14

Jyrki Niemi edited this page Oct 21, 2022 · 2 revisions

Notes on the Korp developers’ meeting in Zoom on 2022-10-14

Participants:

  • Språkbanken Text:
    • Martin Hammarstedt
    • Maria Öhrman
  • Kielipankki (The Language Bank of Finland):
    • Anni Järvenpää (CSC)
    • Martin Matthiesen (CSC)
    • Jyrki Niemi (University of Helsinki)

Mink, Sparv and Korp

Introduction to Mink

  • Martin H. and Maria told about Mink:
    • Users can upload their own corpus material to Mink in the formats recognized by Sparv (XML, raw text, MS Word)
    • A long-term goal is to make Mink support also lexica.
    • Mink has a separate frontend.
    • At this stage, Mink is tightly coupled with (the new) Sparv.
    • The new Sparv has no frontend; instead, Mink will be used.
    • The currently available Sparv frontend uses an older version of the Sparv pipeline.

Sparv and Kielipankki?

  • Martin M: Sparv could be something adopt in Kielipankki, too, maybe to replace korp-make.
    • Jyrki: Kielipankki’s corpus pipeline is less well integrated.
    • Martin H: It would most likely possible to have a Sparv plugin for the Kielipankki pipeline.
    • Martin M: How does Sparv handle cases in which the processing of a couple of files fail?
      • Jyrki: The failed files should be fixed and reprocessed.
      • Martin H: Sparv works like Make, so changed files will be reprocessed.
    • Martin M: We wish to be able to use our HPC environment and parallel processing.
    • Jyrki: We could try to adapt our tools to Sparv.

Access to Mink

  • Maria: At first some people inviited to use Mink.
  • Martin H: At present, users can only have their corpora in a separate mode in Korp, requiring login. In the future, users could contact admins and wish for their corpora included in the public Korp.
  • A separate instance of the Korp backend is set up for Mink.

Authentication in Mink and Korp

  • Maria: Eventually Shibboleth or similar will be used for authentication.
  • Martin M: We shouldn’t have competing implementations for Shibboleth support in Korp except for a good reason.
  • Jyrki: We can share Kielipankki’s solution, but I’ve been slow in that.
  • Maria: Språkbanken uses Shibboleth with JWTs, which need not be used with Shibboleth.
  • Martin M.: Shibboleth future is uncertain because of no API access, so Kielipankki is moving to OpenID Connect, which also uses JWTs.
  • Martin M. described the solution of Kielipankki, with authorization via the Language Bank Rights service.
  • Maria: Mink API uses tokens.
  • Maria: Korp should be able to use different login solutions.
  • Martin M: We should support OpenID Connect, which would also be a good thing from the CLARIN point of view.
  • Martin. H has split the authentication and authorization parts of the Korp backend into plugins.
  • It was agreed that we should collaborate more on authentication issues in the future.

The YAML corpus configuration format

  • Maria: Support for specialized features could be plugins, but otherwise the changes made for Kielipankki can be incorporated in the code.
  • Martin H. and Maria: Configuration keys should be more consistent, but no time to make them such.
  • Martin M: Have the configurations been versioned?
    • Maria: No, but that should be done.
  • Maria: It would be better to have types for attributes: more focused on the kind that an attribute is, instead of what features it has; for example, an integer or an attribute with ranking.
    • This might be taken up in the spring.

The Korp plugin facilities implemented for Kielipankki

  • Maria: Plugins would be good, so that repository forks wouldn’t be needed for customizing Korp for different sites.
  • Jyrki: How to make the plugin facility general enough, say, for example for Iceland?
    • Martin: CLARIN Technical Centre Standing Committee could be a place to spread information.
  • Martin M: If possible, avoid having separate Finnish and Swedish plugins for Shibboleth.
  • Martin M: Plugins may also encourage people to implement their own solutions even when not needed.

The slowness of Korp

Current situation

  • Martin M. described the situation in Kielipankki:
    • Using Korp in a teaching setting easily leads to performance problems.
    • Korp is currently running on a server with hard disks, but it will be moved to one which SSDs, which should make it faster.
    • The old CGI-Korp appeared faster than the current Korp using Gunicorn.
  • Maria and Martin H: Språkbanken’s Korp is also slow, even ridiculously slow in recent months.
  • Martin H: MySQL queries are very slow.
    • Martin M: The Kielipankki Korp uses a separate database server with SSDs, so it’s not so slow.
  • Martin H. has tried to investigate a problem in which the search progress bar on the Korp frontend goes on (and finishes) but no results appear.

Load balancing

  • Martin M: Have you thought of load-balancing the backend?
    • Martin H: A previous systems administrator played around with that, but it was buggy and nothing came out of that. But it’s something that should be explored more.
    • Martin M: It would be nice to be able to add extra backends automatically or even manually for the times of courses. As the Korp backend virtually read-only, it should be easy.

Replacing Corpus Workbench

  • Martin M: What about getting rid of Corpus Workbench?
    • Martin H: That would be a major change, as Korp is a frontend for CWB.
    • Martin H: Some colleagues have been working on something that would be faster than CWB. It’s an alpha version, but lacks many CWB features.

Parallelizing

  • Martin M: Why is it difficult to parallelize Korp/CWB?
    • Martin H: Statistics needs to summarize and build a huge data structure in Python, which may slow it down.

OpenShift

  • Martin H: For Mink, a new Korp backend will be set up using OpenShift, in the spring or even later.
    • If it works, the main Korp might also be moved to OpenShift.
    • Martin M: The CSC OpenShift (Rahti) instances have their own issues with ClusterFS.
      • It would allow scaling up and down, but that's currently not properly implemented.
      • Rahti will be updated to OpenShift 4, which should solve some issues.

Default searches in the frontend

  • Maria: Statistics could be disabled in the frontend by default, which would eliminate some backend processing. That should be easy to implement.
  • Maria: Fewer or smaller corpora could be preselected.
    • Jyrki: Kielipankki’s Korp now has no preselected corpora.
    • Martin H: Don’t users then choose “select all corpora”?
    • Martin M: It has helped anyway.
  • Martin H: The default search in the extended search searches for all words.

Sharing corpus data

  • Martin M: Kielipankki is sharing corpus data as downloadable VRT files, with a converter to JSON, as the latter is preferred by many users.
  • Martin H: Språkbanken is sharing corpus data as XML, which corresponds to VRT but with token attributes encoded as the attributes of an XML element (not TEI-compatible).

Testing Korp

  • Martin M: Automated testing for Korp would be great.
  • Martin H: It hasn't happened yet, there never seems to be enough time time.
  • Martin M: We could probably contribute to this; maybe Anni could do something.

General collaboration practices

  • Martin M: With plugins, collaboration should be easier: less need to wait for each other.
  • Jyrki: GitHub pull requests have worked well from my point of view.
  • Maria: It's good to use GitHub facilities as much as possible, as information can be found there more easily than in emails.
  • Martin M: A meeting like this is also very useful.
  • It was agreed that the next meeting will be organized in January 2023; Maria promised to arrange it.