Skip to content

UB-Mannheim/paraly

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Paraly: An (annotated) dataset for exploring the concept of paralysis (fr. ‘paralysie’) in a digital corpus of French Literature

PRs Welcome Open Code Open Data Open Science

Table of contents

Structure

paraly/
├── data/
│   ├── training/
│   │   ├── train_fasttext_dataset.txt
│   │   ├── test_fasttext_dataset.txt
│   │   └── dev_fasttext_dataset.txt
│   ├── model/
│   │   ├── training.log
│   │   ├── test.tsv
│   │   ├── loss.tsv
│   │   └── dev.tsv
│   ├── corpus/
│   │   ├── 20_paraly_metadata.tsv
│   │   ├── 20_paraly_data_TEI.xml/
│   │   ├── 20_paraly_corpus.cec6
│   │   ├── 19_paraly_metadata.tsv
│   │   ├── 19_paraly_data_TEI.xml/
│   │   ├── 19_paraly_corpus.cec6
│   │   ├── 18_paraly_metadata.tsv
│   │   ├── 18_paraly_data_TEI.xml/
│   │   └── 18_paraly_corpus.cec6
│   └── annotations/
│       ├── 20_paraly_annotations.xlsx
│       ├── 20_paraly_annotations.csv
│       ├── 19_paraly_annotations.xlsx
│       ├── 19_paraly_annotations.csv
│       ├── 18_paraly_annotations.xlsx
│       └── 18_paraly_annotations.csv
├── code/
│   ├── training/
│   │   ├── train_fc.py
│   │   └── README_training.md
│   ├── splitting/
│   │   ├── prepare_training_data.py
│   │   └── README_splitting.md
│   ├── merging/
│   │   ├── merge.ipynb
│   │   └── README_merging.md
│   ├── extraction/
│   │   ├── starten.bat
│   │   ├── skript.cecs
│   │   ├── query.txt
│   │   └── README_extraction.md
│   └── collection/
│       ├── get_metadata_for_corpus.ipynb
│       ├── get_metadata_for_all_books.ipynb
│       ├── get_OCRed_books_from_gallica.ipynb
│       └── comment_metadata_in_html_files.ipynb
├── README.md
└── LICENSE

Collection

The whole digital collection for various centuries is at https://gallica.bnf.fr/html/und/litteratures/les-classiques-de-la-litterature-acces-par-periode?mode=desktop. Our focus is on the following collections:

The OCR-ed books and their metadata were downloaded using the scripts in ./code/collection/.

Annotation

The annotated data located in ./data/annotations/ was labeled as "c" (concrete), "f" (figurative), and “cf” (an “inter-“category).

Model

The multilabel classifier paraly_camembert_large_multilabel was trained using flair-library with a script in ./code/training/ and is openly available at Hugging Face.

App

The app for using the classifier is openly available via Hugging Face Spaces.

License

This work is licensed under the MIT license (code) and Creative Commons Attribution 4.0 International license (for everything else). You are free to share and adapt the material for any purpose, even commercially, as long as you provide attribution.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published