Skip to content

LilianDK/DALLMAYER_ORGANIZATION

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 

Repository files navigation

DALLMAYER_ORGANIZATION


This repo is just for sharing and organizing stuff.

📝 Table of Contents

🧐 General Information

Timeline

    1. Juli 2024, 16:15 Uhr: Kick-Off-Veranstaltung (digital) DONE WE WERE THERE (see minutes from kick-off meeting 04.07.2024)
    1. Juli 2024, bis 23:59 Uhr: Registrierung DONE
    1. Oktober 2024, 16:15 Uhr: Zwischenberichtsveranstaltung (digital)
    1. November 2024, bis 23:59 Uhr: Abgabe der Resultate
    1. Januar 2025, 16:15 Uhr: Abschlussveranstaltung und Bekanntgabe des Siegers (digital)
    1. März 2025: Auszeichnung im Rahmen der DHd-Tagung 2025 in Bielefeld

Expected deliverables

  • PDF in DE or EN, 200-300 words "Begründung und Kontextualisierung der Fragestellung" ASAP
  • Jupyter Notebook with title "_code.ipynb" (restricted to Python, R , Java) IN NOVEMBER
  • PDF with title "_text.pdf" in DE or EN IN NOVEMBER

Evaluation criteria

  • Eine kreative geisteswissenschaftliche Fragestellung, die die Analyse von Big Data begründet und voraussetzt.
  • Eine kritische und facettenreiche Analyse des Datensatzes.
  • Einsatz von kreativen und möglichst generalisierbaren datenwissenschaftlichen Methoden zur Identifizierung und Beseitigung von Mängeln im Datensatz.
  • Für Ansatz A: Ein solides Argument für die fachwissenschaftliche Antwort auf die gestellte Frage.
  • Für Ansatz B: Entwicklung eines effizienten Plans für das Digitalisierungsverfahren der Zeitungen, der den FAIR-Prinzipien entspricht.

Further information

Contacts

Data

🧐 Meeting Minutes

All public information.

Minutes from kick-off meeting 04.07.2024

Data
  • We are receiving the raw data right after their OCR pre-processing on which we shall work
  • They recommend not to use the API because data is changing there. Instead they ask us to download the data from the given link
  • If we do not have enough disk space we can work with a fraction of the provided data instead of the total dataset
  • We must not enrich the data with other data sources because of licensing etc.
IP

Minutes from team kick-off meeting 16.07.2024

Research question
  • Everyone thinking about some interesting concepts that could be investigated for the next meeting
Organisation
  • Everything in Github Repo
Expected Deliverables
  • We have to deliver 2 things: (1) Forschungsfrage/Fragestellung und (2) Ansatz A oder Ansatz B

Minutes from meeting 26.07.2024 - specification research question

Data Cleansing
  • Due to the OCR there is a lot of noise in the data. Like useless pieces and fragements of characters. Therefore, with a sliding-window approach everything jibberish will be thrown away (DT)
  • Due to the huge amount of data we sub-set for a dataset to commit further investigations on. Therefore, the data will be facetted and a selection of journal and years will be derived from that (BS)
Data Pre-Processing
  • With LLMs or anything other usefull methods the articles are refined to achieve following variables: Journal Outlet (extracted), Publication Year (extracted), Title (generated), Summary of Article (generated), Keyword/Concept (generated) (MM)
Data Enrichment and Embedding Experiments
  • With LLMs or anything other usefull methods the data is enriched to achieve following variables: Keyword/Concept (generated) (LDK)
  • Following the data pre-processing embedding experiments as well as clustering experiments after that are conducted (all)
Data Visualization and testing against research questions
  • In the next meeting we see if we can specify the visualization
Sidenote

Minutes from meeting 08.08.2024 - status data preparation pipeline

Data Cleansing
  • Sliding-window and "beautification" of data works. Code moves slowly, therefore checking on that again. (DT -> BS)
  • Sub-set of data for initial tests is Süddeutsche, Kölnischer and Reichsanzeiger.
Data Pre-Processing
  • Pipeline is: cleaning > reduction through summaries and concepts creation with LLM > embedding
  • For the summaries and concepts creation a modification of chain of density prompt seems usefull. Checking output again for the selected subset (LDK)
Data Enrichment and Embedding Experiments
  • Following the data pre-processing embedding experiments as well as clustering experiments after that are conducted (all)
Data Visualization and testing against research questions
  • In the next meeting we see if we can specify the visualization
Sidenote

Minutes from meeting 29.08.2024 - status data preparation pipeline

Data Cleansing
  • Data cleansing will be the transformation from pandas pickles to JSON
  • Sliding-window approach to clear out random characters
Data Pre-Processing
  • Re-writing the "old German" text into "modern German" as part of the data preprocessing step is tested
Data Enrichment and Embedding Experiments
  • The cleansed and pre-processed data will be enriched with meta data and short summaries of the first page for each newspaper
Data Visualization and testing against research questions
  • xx
Sidenote
  • xx

About

This repo is just for sharing and organizing stuff.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published