Feature/1176 separate fetching of german common names; Data overrides (…

…#1186) - [X] Separate fetching of german common names and merging of datasets, fixes #1176 - [X] Add apply overrides functionality, fixes #726 - [x] Create PR with new data, overrides and README in [scraper-data repository](https://github.com/ElektraInitiative/scraper-data/) Corresponding scraper-data PR: ElektraInitiative/scraper-data#2 This PR supersedes #799  ## Basics  - [x] The PR is rebased with current master - [x] I added a line to [changelog.md](/doc/changelog.md) - [x] Details of what I changed are in the commit messages - [x] References to issues, e.g. `close #X`, are in the commit messages and changelog - [ ] The buildserver is happy  ## Checklist  - [x] I fully described what my PR does in the documentation - [x] I fixed all affected documentation - [ ] I fixed the introduction tour - [ ] I wrote migrations in a way that they are compatible with already present data - [ ] I fixed all affected decisions - [ ] I added automated tests or a [manual test protocol](../doc/tests/manual/protocol.md) - [x] I added code comments, logging, and assertions as appropriate - [ ] I translated all strings visible to the user - [ ] I mentioned [every code or binary](https://github.com/ElektraInitiative/PermaplanT/blob/master/.reuse/dep5) not directly written or done by me in [reuse syntax](https://reuse.software/) - [ ] I created left-over issues for things that are still to be done - [ ] Code is conforming to [our Architecture](/doc/architecture) - [ ] Code is conforming to [our Guidelines](/doc/guidelines) - [ ] Code is consistent to [our Design Decisions](/doc/decisions) - [ ] Exceptions to any guidelines are documented ## First Time Checklist  - [ ] I have installed and I am using [pre-commit hooks](../doc/contrib/README.md#Hooks) - [ ] I am using [Tailwind CSS Linting](https://tailwindcss.com/blog/introducing-linting-for-tailwindcss-intellisense) ## Review  - [ ] I've tested the code - [ ] I've read through the whole code - [ ] I've read through the whole documentation - [ ] I've checked conformity to guidelines - [ ] I've checked conformity to requirements - [ ] I've checked that the requirements are tested
ElektraInitiative · Apr 4, 2024 · 197207a · 197207a
2 parents 2737c9b + 815ebc3
commit 197207a
Show file tree

Hide file tree

Showing 12 changed files with 504 additions and 84 deletions.
diff --git a/ci/Jenkinsfile b/ci/Jenkinsfile
@@ -286,9 +286,7 @@ lock("${env.NODE_NAME}-exclusive") {
                 dir('scraper') {
                     sh 'npm ci'
                     sh 'mkdir ./data/'
-                    sh 'cp /nextcloud/Database/scraper-data/mergedDatasets.csv ./data/'
-                    sh 'cp /nextcloud/Database/scraper-data/Companions.csv ./data/'
-                    sh 'cp /nextcloud/Database/scraper-data/Antagonist.csv ./data/'
+                    sh 'cp /data/*.csv ./data/'
                     sh 'npm run insert'
                     sh 'rm -rf ./data/'
                     sh 'rm -rf ./node_modules/'

diff --git a/doc/changelog.md b/doc/changelog.md
@@ -59,6 +59,9 @@ Syntax: `- short text describing the change _(Your Name)_`
 - Fix the markdown so that mdbook tests pass _(Daniel Steinkogler)_
 - _()_
 - _()_
+- Improved the scraper: Fixed a bug and improved cleaning for German common names _(temmey)_
+- Scraper: Separate fetching of German common names from merging datasets _(Christoph Schreiner)_
+- Scraper: Allow applying overrides to merged dataset _(Christoph Schreiner)_
 - Make map geometry viewable and editable _(Moritz)_
 - _()_
 - Prevent propagating enft key on markdown editor _(Daniel Steinkogler)_

diff --git a/scraper/README.md b/scraper/README.md
@@ -29,7 +29,7 @@ cp .env.example .env.local
 
 ### Installation Option 1: With a single command
 
-The following command will fetch the data from the sources, merge the datasets and insert the data into the database:
+The following command will fetch the data from the sources, merge the datasets, apply the overrides and insert the data into the database:
 
 ```shell
 npm run start:full
@@ -46,6 +46,7 @@ npm run start
 1. `detail.csv` - scraped from PracticalPlants
 2. `permapeopleRawData.csv` - scraped from Permapeople
 3. `reinsaatRawData.csv` - scraped from Reinsaat and merged from `reinsaatRawDataEN.csv` and `reinsaatRawDataDE.csv`
+4. `germanCommonNames.csv` - scraped from wikidata
 
 ### Installation Option 2: Step by Step
 
@@ -76,6 +77,7 @@ The scraped data is stored in the `data` directory:
 - `reinsaatRawDataEN.csv`: This file contains the raw data scraped from the english version of the Reinsaat webpage.
 - `reinsaatRawDataDE.csv`: This file contains the raw data scraped from the german version of the Reinsaat webpage.
 - `reinsaatRawData.csv`: This file contains the merged data scraped from the english and german version of the Reinsaat webpage.
+- `germanCommonNames.csv`: This file contains the German common names fetched from https://www.wikidata.org
 
 2. Merge the scraped datasets
 
@@ -89,25 +91,41 @@ This can be done with the following command:
 npm run merge:datasets
 ```
 
-3. Correct data manually before the insertion (optional)
+3. Fetch German common names
+
+Goes through all unique names from mergedDatasets.csv and fetches the German common names from https://www.wikidata.org concurrently. Then merges them into `mergedDatasets.csv`
+
+If it starts throwing 429 errors, reduce MAX_CONCURRENT_REQUESTS to a lower number, such as 10.
+
+```shell
+npm run fetch:germannames && npm run merge:germannames
+```
+
+4. Apply overrides
 
 The scraped data can contain inconsistencies and errors.
-In order to correct these mistakes, we can manually correct the data i.e. change the values in the `mergedDatasets.csv` file.
-The corrected data in the new file should be stored in the same format as the generated data i.e. columns may not be changed.
+In order to correct these mistakes, we can create override files.
+`data/overrides` may contain any number of `csv` files, which are applied consecutively to `mergedDatasets.csv` to create `finalDataset.csv`
+
+For details see `data/overrides/README.md`
+
+```shell
+npm run apply:overrides
+```
 
-4. Insert the data into the database
+5. Insert the data into the database
 
 The scraper also inserts the scraped data into the database:
 
 ```shell
 npm run insert:plants
 ```
 
-5. Insert relations into the database
+6. Insert relations into the database
 
 The scraper inserts the relation data into the database.
 
-First you need to download the `Companions.csv` and `Antigonist.csv` file from the nextcloud server or export them yourself from the current `Plant_Relations.ods`.  
+First you need to download the `Companions.csv` and `Antagonist.csv` file from the nextcloud server or export them yourself from the current `Plant_Relations.ods`.  
 Copy them into the /data directory and run:
 
 ```shell

diff --git a/scraper/package-lock.json b/scraper/package-lock.json
diff --git a/scraper/package.json b/scraper/package.json
@@ -10,18 +10,22 @@
     "merge:datasets": "node src/merge_datasets.js",
     "merge:reinsaat": "node src/merge_reinsaat.js",
     "merge:csvfiles": "node src/helpers/merge_csv_files.js",
+    "fetch:germannames": "node src/fetch_german_names.js",
+    "merge:germannames": "node src/merge_german_names.js",
+    "apply:overrides": "node src/apply_overrides.js",
     "insert:plants": "node src/insert_plants.js",
     "insert:relations": "node src/insert_plant_relations.js",
     "insert": "npm run insert:plants && npm run insert:relations",
-    "start:full": "npm run fetch:permapeople && npm run fetch:practicalplants && npm run fetch:reinsaat && npm run merge:reinsaat && npm run merge:datasets && npm run insert:plants",
-    "start": "npm run merge:datasets && npm run insert:plants"
+    "start:full": "npm run fetch:permapeople && npm run fetch:practicalplants && npm run fetch:reinsaat && npm run merge:reinsaat && npm run merge:datasets && npm run fetch:germannames && npm run merge:germannames && apply:overrides && npm run insert:plants",
+    "start": "npm run merge:datasets && npm run merge:germannames && apply:overrides && npm run insert:plants"
   },
   "keywords": [],
   "author": "",
   "license": "ISC",
   "dependencies": {
     "@playwright/test": "^1.32.0",
     "axios": "^1.3.4",
+    "axios-retry": "^3.6.0",
     "csvtojson": "^2.0.10",
     "dotenv": "^16.0.3",
     "json2csv": "^6.0.0-alpha.2",

diff --git a/scraper/src/apply_overrides.js b/scraper/src/apply_overrides.js
@@ -0,0 +1,76 @@
+import fs from "fs";
+import path from "path";
+import { parse as json2csv } from "json2csv";
+import csv from "csvtojson";
+import { cleanUpJsonForCsv } from "./helpers/helpers.js";
+import { applyOverride } from "./helpers/override.js";
+
+const deletionsFile = "00_DELETIONS.csv";
+
+async function loadMergedDataset() {
+  return csv().fromFile("data/mergedDatasets.csv");
+}
+
+async function applyDeletions(plants) {
+  console.log(`[INFO] Deleting plants from data/overrides/${deletionsFile}`);
+
+  const deletePlants = await csv().fromFile(`data/overrides/${deletionsFile}`);
+
+  deletePlants.forEach((overridePlant) => {
+    // find the plant
+    const index = plants.findIndex(
+      (plant) => plant.unique_name === overridePlant.unique_name
+    );
+
+    if (index === -1) {
+      console.log(
+        `[INFO] Could not find plant with unique_name '${overridePlant.unique_name}' in merged dataset.`
+      );
+      return;
+    }
+
+    // delete the plant
+    plants.splice(index, 1);
+  });
+
+  return plants;
+}
+
+async function applyAllOverrides(plants) {
+  let overridesDir = "data/overrides";
+  if (!fs.existsSync(overridesDir)) {
+    fs.mkdirSync(overridesDir);
+  }
+
+  // list all csv files in data/overrides
+  const overrideFiles = fs.readdirSync(overridesDir);
+  overrideFiles.sort();
+
+  // apply all overrides
+  for (const file of overrideFiles) {
+    // deletions were handled separately
+    if (path.extname(file) !== ".csv" || file === deletionsFile) {
+      continue;
+    }
+    await applyOverride(plants, `${overridesDir}/${file}`);
+  }
+
+  return plants;
+}
+
+async function writePlantsToOverwriteCsv(plants) {
+  console.log(
+    `[INFO] Writing ${plants.length} plants to csv data/finalDataset.csv`
+  );
+  cleanUpJsonForCsv(plants);
+  const csvFile = json2csv(plants);
+  fs.writeFileSync("data/finalDataset.csv", csvFile);
+
+  return plants;
+}
+
+loadMergedDataset()
+  .then((plants) => applyDeletions(plants))
+  .then((plants) => applyAllOverrides(plants))
+  .then((plants) => writePlantsToOverwriteCsv(plants))
+  .catch((error) => console.error(error));