Skip to content
This repository has been archived by the owner on Apr 24, 2024. It is now read-only.

Feature/1176 separate fetching of german common names; Data overrides #1186

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
10356bd
implemented improved german name scrapper
temmey Aug 6, 2023
6c40095
add axios-retry to package.json
temmey Aug 6, 2023
37bd431
removed debug logging.
temmey Aug 6, 2023
07ee9cb
added information to readme
temmey Aug 6, 2023
626ffb3
add entry to changelog. closes #750
temmey Aug 6, 2023
366d037
Merge branch 'master' into improve-scraper-750
temmey Aug 8, 2023
cd68a86
fixed scraper readme
temmey Aug 7, 2023
005c1f6
improve scraper, faster, more names, cleaner
temmey Aug 8, 2023
a51939d
updated readme
temmey Aug 8, 2023
196ebd9
removed hybrid x, small cleanup
temmey Aug 8, 2023
fe7579e
replaced "" with null for csv
temmey Aug 8, 2023
f8ce8b2
Merge branch 'master' into improve-scraper-750
temmey Aug 12, 2023
13c934f
fixed grammer in readme
temmey Aug 12, 2023
7363353
applied requested changes
temmey Aug 12, 2023
709a4d0
Merge branch 'master' into improve-scraper-750
temmey Aug 15, 2023
625b852
Separate fetching of german common names and merging of datasets; Add…
chr-schr Feb 12, 2024
23919a6
German names separate from other overrides; updated changelog.md
chr-schr Feb 18, 2024
40fee76
Merged master
chr-schr Feb 18, 2024
1fed6a4
Fix broken merge; Make sure unique_name has no leading/trailing white…
chr-schr Feb 18, 2024
afdd157
Fixed new unique_name override not being applied
chr-schr Feb 29, 2024
dbc75d2
Merge branch 'master' into feature/1176-separate-fetching-of-german-c…
chr-schr Feb 29, 2024
a40bab2
Incorporate reviews: Removed commented code; Improved code documentation
chr-schr Mar 28, 2024
88f54cf
Merge branch 'master' into feature/1176-separate-fetching-of-german-c…
chr-schr Mar 28, 2024
e7f8a0f
Merge branch 'master' into feature/1176-separate-fetching-of-german-c…
chr-schr Mar 29, 2024
ffd12ea
Merge branch 'master' into feature/1176-separate-fetching-of-german-c…
markus2330 Apr 4, 2024
815ebc3
use new folder /data, simply copy all csvs
Apr 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 1 addition & 3 deletions ci/Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -286,9 +286,7 @@ lock("${env.NODE_NAME}-exclusive") {
dir('scraper') {
sh 'npm ci'
sh 'mkdir ./data/'
sh 'cp /nextcloud/Database/scraper-data/mergedDatasets.csv ./data/'
sh 'cp /nextcloud/Database/scraper-data/Companions.csv ./data/'
sh 'cp /nextcloud/Database/scraper-data/Antagonist.csv ./data/'
sh 'cp /data/*.csv ./data/'
sh 'npm run insert'
sh 'rm -rf ./data/'
sh 'rm -rf ./node_modules/'
Expand Down
3 changes: 3 additions & 0 deletions doc/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,9 @@ Syntax: `- short text describing the change _(Your Name)_`
- Fix the markdown so that mdbook tests pass _(Daniel Steinkogler)_
- _()_
- _()_
- Improved the scraper: Fixed a bug and improved cleaning for German common names _(temmey)_
- Scraper: Separate fetching of German common names from merging datasets _(Christoph Schreiner)_
- Scraper: Allow applying overrides to merged dataset _(Christoph Schreiner)_
- Make map geometry viewable and editable _(Moritz)_
- _()_
- Prevent propagating enft key on markdown editor _(Daniel Steinkogler)_
Expand Down
32 changes: 25 additions & 7 deletions scraper/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ cp .env.example .env.local

### Installation Option 1: With a single command

The following command will fetch the data from the sources, merge the datasets and insert the data into the database:
The following command will fetch the data from the sources, merge the datasets, apply the overrides and insert the data into the database:

```shell
npm run start:full
Expand All @@ -46,6 +46,7 @@ npm run start
1. `detail.csv` - scraped from PracticalPlants
2. `permapeopleRawData.csv` - scraped from Permapeople
3. `reinsaatRawData.csv` - scraped from Reinsaat and merged from `reinsaatRawDataEN.csv` and `reinsaatRawDataDE.csv`
4. `germanCommonNames.csv` - scraped from wikidata

### Installation Option 2: Step by Step

Expand Down Expand Up @@ -76,6 +77,7 @@ The scraped data is stored in the `data` directory:
- `reinsaatRawDataEN.csv`: This file contains the raw data scraped from the english version of the Reinsaat webpage.
- `reinsaatRawDataDE.csv`: This file contains the raw data scraped from the german version of the Reinsaat webpage.
- `reinsaatRawData.csv`: This file contains the merged data scraped from the english and german version of the Reinsaat webpage.
- `germanCommonNames.csv`: This file contains the German common names fetched from https://www.wikidata.org

2. Merge the scraped datasets

Expand All @@ -89,25 +91,41 @@ This can be done with the following command:
npm run merge:datasets
```

3. Correct data manually before the insertion (optional)
3. Fetch German common names

Goes through all unique names from mergedDatasets.csv and fetches the German common names from https://www.wikidata.org concurrently. Then merges them into `mergedDatasets.csv`

If it starts throwing 429 errors, reduce MAX_CONCURRENT_REQUESTS to a lower number, such as 10.

```shell
npm run fetch:germannames && npm run merge:germannames
```

4. Apply overrides

The scraped data can contain inconsistencies and errors.
In order to correct these mistakes, we can manually correct the data i.e. change the values in the `mergedDatasets.csv` file.
The corrected data in the new file should be stored in the same format as the generated data i.e. columns may not be changed.
In order to correct these mistakes, we can create override files.
`data/overrides` may contain any number of `csv` files, which are applied consecutively to `mergedDatasets.csv` to create `finalDataset.csv`

For details see `data/overrides/README.md`

```shell
npm run apply:overrides
```

4. Insert the data into the database
5. Insert the data into the database

The scraper also inserts the scraped data into the database:

```shell
npm run insert:plants
```

5. Insert relations into the database
6. Insert relations into the database

The scraper inserts the relation data into the database.

First you need to download the `Companions.csv` and `Antigonist.csv` file from the nextcloud server or export them yourself from the current `Plant_Relations.ods`.
First you need to download the `Companions.csv` and `Antagonist.csv` file from the nextcloud server or export them yourself from the current `Plant_Relations.ods`.
Copy them into the /data directory and run:

```shell
Expand Down
64 changes: 64 additions & 0 deletions scraper/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 6 additions & 2 deletions scraper/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,22 @@
"merge:datasets": "node src/merge_datasets.js",
"merge:reinsaat": "node src/merge_reinsaat.js",
"merge:csvfiles": "node src/helpers/merge_csv_files.js",
"fetch:germannames": "node src/fetch_german_names.js",
"merge:germannames": "node src/merge_german_names.js",
"apply:overrides": "node src/apply_overrides.js",
"insert:plants": "node src/insert_plants.js",
"insert:relations": "node src/insert_plant_relations.js",
"insert": "npm run insert:plants && npm run insert:relations",
"start:full": "npm run fetch:permapeople && npm run fetch:practicalplants && npm run fetch:reinsaat && npm run merge:reinsaat && npm run merge:datasets && npm run insert:plants",
"start": "npm run merge:datasets && npm run insert:plants"
"start:full": "npm run fetch:permapeople && npm run fetch:practicalplants && npm run fetch:reinsaat && npm run merge:reinsaat && npm run merge:datasets && npm run fetch:germannames && npm run merge:germannames && apply:overrides && npm run insert:plants",
"start": "npm run merge:datasets && npm run merge:germannames && apply:overrides && npm run insert:plants"
},
"keywords": [],
"author": "",
"license": "ISC",
"dependencies": {
"@playwright/test": "^1.32.0",
"axios": "^1.3.4",
"axios-retry": "^3.6.0",
"csvtojson": "^2.0.10",
"dotenv": "^16.0.3",
"json2csv": "^6.0.0-alpha.2",
Expand Down
76 changes: 76 additions & 0 deletions scraper/src/apply_overrides.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
import fs from "fs";
import path from "path";
import { parse as json2csv } from "json2csv";
import csv from "csvtojson";
import { cleanUpJsonForCsv } from "./helpers/helpers.js";
import { applyOverride } from "./helpers/override.js";

const deletionsFile = "00_DELETIONS.csv";

async function loadMergedDataset() {
return csv().fromFile("data/mergedDatasets.csv");
}

async function applyDeletions(plants) {
console.log(`[INFO] Deleting plants from data/overrides/${deletionsFile}`);

const deletePlants = await csv().fromFile(`data/overrides/${deletionsFile}`);

deletePlants.forEach((overridePlant) => {
// find the plant
const index = plants.findIndex(
(plant) => plant.unique_name === overridePlant.unique_name
);

if (index === -1) {
console.log(
`[INFO] Could not find plant with unique_name '${overridePlant.unique_name}' in merged dataset.`
);
return;
}

// delete the plant
plants.splice(index, 1);
});

return plants;
}

async function applyAllOverrides(plants) {
let overridesDir = "data/overrides";
if (!fs.existsSync(overridesDir)) {
fs.mkdirSync(overridesDir);
}

// list all csv files in data/overrides
const overrideFiles = fs.readdirSync(overridesDir);
overrideFiles.sort();

// apply all overrides
for (const file of overrideFiles) {
// deletions were handled separately
if (path.extname(file) !== ".csv" || file === deletionsFile) {
continue;
}
await applyOverride(plants, `${overridesDir}/${file}`);
}

return plants;
}

async function writePlantsToOverwriteCsv(plants) {
console.log(
`[INFO] Writing ${plants.length} plants to csv data/finalDataset.csv`
);
cleanUpJsonForCsv(plants);
const csvFile = json2csv(plants);
fs.writeFileSync("data/finalDataset.csv", csvFile);

return plants;
}

loadMergedDataset()
.then((plants) => applyDeletions(plants))
.then((plants) => applyAllOverrides(plants))
.then((plants) => writePlantsToOverwriteCsv(plants))
.catch((error) => console.error(error));
Loading