This repo is just for sharing and organizing stuff.
-
- Juli 2024, 16:15 Uhr: Kick-Off-Veranstaltung (digital) DONE WE WERE THERE (see minutes from kick-off meeting 04.07.2024)
-
- Juli 2024, bis 23:59 Uhr: Registrierung DONE
-
- Oktober 2024, 16:15 Uhr: Zwischenberichtsveranstaltung (digital)
-
- November 2024, bis 23:59 Uhr: Abgabe der Resultate
-
- Januar 2025, 16:15 Uhr: Abschlussveranstaltung und Bekanntgabe des Siegers (digital)
-
- März 2025: Auszeichnung im Rahmen der DHd-Tagung 2025 in Bielefeld
- PDF in DE or EN, 200-300 words "Begründung und Kontextualisierung der Fragestellung" ASAP
- Jupyter Notebook with title "_code.ipynb" (restricted to Python, R , Java) IN NOVEMBER
- PDF with title "_text.pdf" in DE or EN IN NOVEMBER
- Eine kreative geisteswissenschaftliche Fragestellung, die die Analyse von Big Data begründet und voraussetzt.
- Eine kritische und facettenreiche Analyse des Datensatzes.
- Einsatz von kreativen und möglichst generalisierbaren datenwissenschaftlichen Methoden zur Identifizierung und Beseitigung von Mängeln im Datensatz.
- Für Ansatz A: Ein solides Argument für die fachwissenschaftliche Antwort auf die gestellte Frage.
- Für Ansatz B: Entwicklung eines effizienten Plans für das Digitalisierungsverfahren der Zeitungen, der den FAIR-Prinzipien entspricht.
- German language news paper articles extracted text + OCR (optional) incl. metadata from 1914 to 1945 as Pickled Pandas Data Frames
- https://hessenbox.uni-marburg.de/getlink/fi5WMibFaZX2ueh4xBvqwM/Datensatz (159 GB)
- There is additional image data up to 2 TB
All public information.
- We are receiving the raw data right after their OCR pre-processing on which we shall work
- They recommend not to use the API because data is changing there. Instead they ask us to download the data from the given link
- If we do not have enough disk space we can work with a fraction of the provided data instead of the total dataset
- We must not enrich the data with other data sources because of licensing etc.
- Everyone thinking about some interesting concepts that could be investigated for the next meeting
- Everything in Github Repo
- We have to deliver 2 things: (1) Forschungsfrage/Fragestellung und (2) Ansatz A oder Ansatz B
- Due to the OCR there is a lot of noise in the data. Like useless pieces and fragements of characters. Therefore, with a sliding-window approach everything jibberish will be thrown away (DT)
- Due to the huge amount of data we sub-set for a dataset to commit further investigations on. Therefore, the data will be facetted and a selection of journal and years will be derived from that (BS)
- With LLMs or anything other usefull methods the articles are refined to achieve following variables: Journal Outlet (extracted), Publication Year (extracted), Title (generated), Summary of Article (generated), Keyword/Concept (generated) (MM)
- With LLMs or anything other usefull methods the data is enriched to achieve following variables: Keyword/Concept (generated) (LDK)
- Following the data pre-processing embedding experiments as well as clustering experiments after that are conducted (all)
- In the next meeting we see if we can specify the visualization
- Larry J. Griffin and Robert R. Korstad (1998): Historical Inference and Event-Structure Analysis, https://www.jstor.org/stable/26405517
- Sliding-window and "beautification" of data works. Code moves slowly, therefore checking on that again. (DT -> BS)
- Sub-set of data for initial tests is Süddeutsche, Kölnischer and Reichsanzeiger.
- Pipeline is: cleaning > reduction through summaries and concepts creation with LLM > embedding
- For the summaries and concepts creation a modification of chain of density prompt seems usefull. Checking output again for the selected subset (LDK)
- Following the data pre-processing embedding experiments as well as clustering experiments after that are conducted (all)
- In the next meeting we see if we can specify the visualization
- Data cleansing will be the transformation from pandas pickles to JSON
- Sliding-window approach to clear out random characters
- Re-writing the "old German" text into "modern German" as part of the data preprocessing step is tested
- The cleansed and pre-processed data will be enriched with meta data and short summaries of the first page for each newspaper
- xx
- xx