GitHub - LilianDK/DALLMAYER_ORGANIZATION: This repo is just for sharing and organizing stuff.

DALLMAYER_ORGANIZATION

This repo is just for sharing and organizing stuff.

📝 Table of Contents

General Information
Meeting Minutes

🧐 General Information

Timeline

1. Juli 2024, 16:15 Uhr: Kick-Off-Veranstaltung (digital) DONE WE WERE THERE (see minutes from kick-off meeting 04.07.2024)
1. Juli 2024, bis 23:59 Uhr: Registrierung DONE
1. Oktober 2024, 16:15 Uhr: Zwischenberichtsveranstaltung (digital)
1. November 2024, bis 23:59 Uhr: Abgabe der Resultate
1. Januar 2025, 16:15 Uhr: Abschlussveranstaltung und Bekanntgabe des Siegers (digital)
1. März 2025: Auszeichnung im Rahmen der DHd-Tagung 2025 in Bielefeld

Expected deliverables

PDF in DE or EN, 200-300 words "Begründung und Kontextualisierung der Fragestellung" ASAP
Jupyter Notebook with title "_code.ipynb" (restricted to Python, R , Java) IN NOVEMBER
PDF with title "_text.pdf" in DE or EN IN NOVEMBER

Evaluation criteria

Eine kreative geisteswissenschaftliche Fragestellung, die die Analyse von Big Data begründet und voraussetzt.
Eine kritische und facettenreiche Analyse des Datensatzes.
Einsatz von kreativen und möglichst generalisierbaren datenwissenschaftlichen Methoden zur Identifizierung und Beseitigung von Mängeln im Datensatz.
Für Ansatz A: Ein solides Argument für die fachwissenschaftliche Antwort auf die gestellte Frage.
Für Ansatz B: Entwicklung eines effizienten Plans für das Digitalisierungsverfahren der Zeitungen, der den FAIR-Prinzipien entspricht.

Further information

https://hermes-hub.de/formate/challenges/challenges-ausschreibungen/challenge24_1.html

Contacts

Data

German language news paper articles extracted text + OCR (optional) incl. metadata from 1914 to 1945 as Pickled Pandas Data Frames
https://hessenbox.uni-marburg.de/getlink/fi5WMibFaZX2ueh4xBvqwM/Datensatz (159 GB)
There is additional image data up to 2 TB

🧐 Meeting Minutes

All public information.

Minutes from kick-off meeting 04.07.2024

Data

We are receiving the raw data right after their OCR pre-processing on which we shall work
They recommend not to use the API because data is changing there. Instead they ask us to download the data from the given link
If we do not have enough disk space we can work with a fraction of the provided data instead of the total dataset
We must not enrich the data with other data sources because of licensing etc.

IP

Minutes from team kick-off meeting 16.07.2024

Research question

Everyone thinking about some interesting concepts that could be investigated for the next meeting

Organisation

Everything in Github Repo

Expected Deliverables

We have to deliver 2 things: (1) Forschungsfrage/Fragestellung und (2) Ansatz A oder Ansatz B

Minutes from meeting 26.07.2024 - specification research question

Data Cleansing

Due to the OCR there is a lot of noise in the data. Like useless pieces and fragements of characters. Therefore, with a sliding-window approach everything jibberish will be thrown away (DT)
Due to the huge amount of data we sub-set for a dataset to commit further investigations on. Therefore, the data will be facetted and a selection of journal and years will be derived from that (BS)

Data Pre-Processing

With LLMs or anything other usefull methods the articles are refined to achieve following variables: Journal Outlet (extracted), Publication Year (extracted), Title (generated), Summary of Article (generated), Keyword/Concept (generated) (MM)

Data Enrichment and Embedding Experiments

With LLMs or anything other usefull methods the data is enriched to achieve following variables: Keyword/Concept (generated) (LDK)
Following the data pre-processing embedding experiments as well as clustering experiments after that are conducted (all)

Data Visualization and testing against research questions

In the next meeting we see if we can specify the visualization

Sidenote

Larry J. Griffin and Robert R. Korstad (1998): Historical Inference and Event-Structure Analysis, https://www.jstor.org/stable/26405517

Minutes from meeting 08.08.2024 - status data preparation pipeline

Data Cleansing

Sliding-window and "beautification" of data works. Code moves slowly, therefore checking on that again. (DT -> BS)
Sub-set of data for initial tests is Süddeutsche, Kölnischer and Reichsanzeiger.

Data Pre-Processing

Pipeline is: cleaning > reduction through summaries and concepts creation with LLM > embedding
For the summaries and concepts creation a modification of chain of density prompt seems usefull. Checking output again for the selected subset (LDK)

Data Enrichment and Embedding Experiments

Following the data pre-processing embedding experiments as well as clustering experiments after that are conducted (all)

Data Visualization and testing against research questions

In the next meeting we see if we can specify the visualization

Sidenote

https://github.com/rspeer/python-ftfy

Minutes from meeting 29.08.2024 - status data preparation pipeline

Data Cleansing

Data cleansing will be the transformation from pandas pickles to JSON
Sliding-window approach to clear out random characters

Data Pre-Processing

Re-writing the "old German" text into "modern German" as part of the data preprocessing step is tested

Data Enrichment and Embedding Experiments

The cleansed and pre-processed data will be enriched with meta data and short summaries of the first page for each newspaper

Data Visualization and testing against research questions

xx

Sidenote

xx

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
README.md		README.md

LilianDK/DALLMAYER_ORGANIZATION

Folders and files

Latest commit

History

Repository files navigation

DALLMAYER_ORGANIZATION

📝 Table of Contents

🧐 General Information

Timeline

Expected deliverables

Evaluation criteria

Further information

Contacts

Data

🧐 Meeting Minutes

Minutes from kick-off meeting 04.07.2024

Data

IP

Minutes from team kick-off meeting 16.07.2024

Research question

Organisation

Expected Deliverables

Minutes from meeting 26.07.2024 - specification research question

Data Cleansing

Data Pre-Processing

Data Enrichment and Embedding Experiments

Data Visualization and testing against research questions

Sidenote

Minutes from meeting 08.08.2024 - status data preparation pipeline

Data Cleansing

Data Pre-Processing

Data Enrichment and Embedding Experiments

Data Visualization and testing against research questions

Sidenote

Minutes from meeting 29.08.2024 - status data preparation pipeline

Data Cleansing

Data Pre-Processing

Data Enrichment and Embedding Experiments

Data Visualization and testing against research questions

Sidenote

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages