Skip to content
This repository has been archived by the owner on Apr 24, 2024. It is now read-only.

Commit

Permalink
Feature/1176 separate fetching of german common names; Data overrides (
Browse files Browse the repository at this point in the history
…#1186)

- [X] Separate fetching of german common names and merging of datasets,
fixes #1176
- [X] Add apply overrides functionality, fixes #726 
- [x] Create PR with new data, overrides and README in [scraper-data
repository](https://github.com/ElektraInitiative/scraper-data/)

Corresponding scraper-data PR:
ElektraInitiative/scraper-data#2

This PR supersedes #799 

<!--
Check relevant points but **please do not remove entries**.
-->

## Basics

<!--
These points need to be fulfilled for every PR.
-->

- [x] The PR is rebased with current master
- [x] I added a line to [changelog.md](/doc/changelog.md)
- [x] Details of what I changed are in the commit messages
- [x] References to issues, e.g. `close #X`, are in the commit messages
and changelog
- [ ] The buildserver is happy

<!--
If you have any troubles fulfilling these criteria, please write about
the trouble as comment in the PR.
We will help you, but we cannot accept PRs that do not fulfill the
basics.
-->

## Checklist

<!--
For documentation fixes, spell checking, and similar none of these
points below need to be checked.
Otherwise please check these points when getting a PR done:
-->

- [x] I fully described what my PR does in the documentation
- [x] I fixed all affected documentation
- [ ] I fixed the introduction tour
- [ ] I wrote migrations in a way that they are compatible with already
present data
- [ ] I fixed all affected decisions
- [ ] I added automated tests or a [manual test
protocol](../doc/tests/manual/protocol.md)
- [x] I added code comments, logging, and assertions as appropriate
- [ ] I translated all strings visible to the user
- [ ] I mentioned [every code or
binary](https://github.com/ElektraInitiative/PermaplanT/blob/master/.reuse/dep5)
not directly written or done by me in [reuse
syntax](https://reuse.software/)
- [ ] I created left-over issues for things that are still to be done
- [ ] Code is conforming to [our Architecture](/doc/architecture)
- [ ] Code is conforming to [our Guidelines](/doc/guidelines)
- [ ] Code is consistent to [our Design Decisions](/doc/decisions)
- [ ] Exceptions to any guidelines are documented

## First Time Checklist

<!--
These points are only relevant when creating a PR the first time.
-->

- [ ] I have installed and I am using [pre-commit
hooks](../doc/contrib/README.md#Hooks)
- [ ] I am using [Tailwind CSS
Linting](https://tailwindcss.com/blog/introducing-linting-for-tailwindcss-intellisense)

## Review

<!--
Reviewers can copy&check the following to their review.
Also the checklist above can be used.
But also the PR creator should check these points when getting a PR
done:
-->

- [ ] I've tested the code
- [ ] I've read through the whole code
- [ ] I've read through the whole documentation
- [ ] I've checked conformity to guidelines
- [ ] I've checked conformity to requirements
- [ ] I've checked that the requirements are tested
  • Loading branch information
markus2330 authored Apr 4, 2024
2 parents 2737c9b + 815ebc3 commit 197207a
Show file tree
Hide file tree
Showing 12 changed files with 504 additions and 84 deletions.
4 changes: 1 addition & 3 deletions ci/Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -286,9 +286,7 @@ lock("${env.NODE_NAME}-exclusive") {
dir('scraper') {
sh 'npm ci'
sh 'mkdir ./data/'
sh 'cp /nextcloud/Database/scraper-data/mergedDatasets.csv ./data/'
sh 'cp /nextcloud/Database/scraper-data/Companions.csv ./data/'
sh 'cp /nextcloud/Database/scraper-data/Antagonist.csv ./data/'
sh 'cp /data/*.csv ./data/'
sh 'npm run insert'
sh 'rm -rf ./data/'
sh 'rm -rf ./node_modules/'
Expand Down
3 changes: 3 additions & 0 deletions doc/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,9 @@ Syntax: `- short text describing the change _(Your Name)_`
- Fix the markdown so that mdbook tests pass _(Daniel Steinkogler)_
- _()_
- _()_
- Improved the scraper: Fixed a bug and improved cleaning for German common names _(temmey)_
- Scraper: Separate fetching of German common names from merging datasets _(Christoph Schreiner)_
- Scraper: Allow applying overrides to merged dataset _(Christoph Schreiner)_
- Make map geometry viewable and editable _(Moritz)_
- _()_
- Prevent propagating enft key on markdown editor _(Daniel Steinkogler)_
Expand Down
32 changes: 25 additions & 7 deletions scraper/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ cp .env.example .env.local

### Installation Option 1: With a single command

The following command will fetch the data from the sources, merge the datasets and insert the data into the database:
The following command will fetch the data from the sources, merge the datasets, apply the overrides and insert the data into the database:

```shell
npm run start:full
Expand All @@ -46,6 +46,7 @@ npm run start
1. `detail.csv` - scraped from PracticalPlants
2. `permapeopleRawData.csv` - scraped from Permapeople
3. `reinsaatRawData.csv` - scraped from Reinsaat and merged from `reinsaatRawDataEN.csv` and `reinsaatRawDataDE.csv`
4. `germanCommonNames.csv` - scraped from wikidata

### Installation Option 2: Step by Step

Expand Down Expand Up @@ -76,6 +77,7 @@ The scraped data is stored in the `data` directory:
- `reinsaatRawDataEN.csv`: This file contains the raw data scraped from the english version of the Reinsaat webpage.
- `reinsaatRawDataDE.csv`: This file contains the raw data scraped from the german version of the Reinsaat webpage.
- `reinsaatRawData.csv`: This file contains the merged data scraped from the english and german version of the Reinsaat webpage.
- `germanCommonNames.csv`: This file contains the German common names fetched from https://www.wikidata.org

2. Merge the scraped datasets

Expand All @@ -89,25 +91,41 @@ This can be done with the following command:
npm run merge:datasets
```

3. Correct data manually before the insertion (optional)
3. Fetch German common names

Goes through all unique names from mergedDatasets.csv and fetches the German common names from https://www.wikidata.org concurrently. Then merges them into `mergedDatasets.csv`

If it starts throwing 429 errors, reduce MAX_CONCURRENT_REQUESTS to a lower number, such as 10.

```shell
npm run fetch:germannames && npm run merge:germannames
```

4. Apply overrides

The scraped data can contain inconsistencies and errors.
In order to correct these mistakes, we can manually correct the data i.e. change the values in the `mergedDatasets.csv` file.
The corrected data in the new file should be stored in the same format as the generated data i.e. columns may not be changed.
In order to correct these mistakes, we can create override files.
`data/overrides` may contain any number of `csv` files, which are applied consecutively to `mergedDatasets.csv` to create `finalDataset.csv`

For details see `data/overrides/README.md`

```shell
npm run apply:overrides
```

4. Insert the data into the database
5. Insert the data into the database

The scraper also inserts the scraped data into the database:

```shell
npm run insert:plants
```

5. Insert relations into the database
6. Insert relations into the database

The scraper inserts the relation data into the database.

First you need to download the `Companions.csv` and `Antigonist.csv` file from the nextcloud server or export them yourself from the current `Plant_Relations.ods`.
First you need to download the `Companions.csv` and `Antagonist.csv` file from the nextcloud server or export them yourself from the current `Plant_Relations.ods`.
Copy them into the /data directory and run:

```shell
Expand Down
64 changes: 64 additions & 0 deletions scraper/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 6 additions & 2 deletions scraper/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,22 @@
"merge:datasets": "node src/merge_datasets.js",
"merge:reinsaat": "node src/merge_reinsaat.js",
"merge:csvfiles": "node src/helpers/merge_csv_files.js",
"fetch:germannames": "node src/fetch_german_names.js",
"merge:germannames": "node src/merge_german_names.js",
"apply:overrides": "node src/apply_overrides.js",
"insert:plants": "node src/insert_plants.js",
"insert:relations": "node src/insert_plant_relations.js",
"insert": "npm run insert:plants && npm run insert:relations",
"start:full": "npm run fetch:permapeople && npm run fetch:practicalplants && npm run fetch:reinsaat && npm run merge:reinsaat && npm run merge:datasets && npm run insert:plants",
"start": "npm run merge:datasets && npm run insert:plants"
"start:full": "npm run fetch:permapeople && npm run fetch:practicalplants && npm run fetch:reinsaat && npm run merge:reinsaat && npm run merge:datasets && npm run fetch:germannames && npm run merge:germannames && apply:overrides && npm run insert:plants",
"start": "npm run merge:datasets && npm run merge:germannames && apply:overrides && npm run insert:plants"
},
"keywords": [],
"author": "",
"license": "ISC",
"dependencies": {
"@playwright/test": "^1.32.0",
"axios": "^1.3.4",
"axios-retry": "^3.6.0",
"csvtojson": "^2.0.10",
"dotenv": "^16.0.3",
"json2csv": "^6.0.0-alpha.2",
Expand Down
76 changes: 76 additions & 0 deletions scraper/src/apply_overrides.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
import fs from "fs";
import path from "path";
import { parse as json2csv } from "json2csv";
import csv from "csvtojson";
import { cleanUpJsonForCsv } from "./helpers/helpers.js";
import { applyOverride } from "./helpers/override.js";

const deletionsFile = "00_DELETIONS.csv";

async function loadMergedDataset() {
return csv().fromFile("data/mergedDatasets.csv");
}

async function applyDeletions(plants) {
console.log(`[INFO] Deleting plants from data/overrides/${deletionsFile}`);

const deletePlants = await csv().fromFile(`data/overrides/${deletionsFile}`);

deletePlants.forEach((overridePlant) => {
// find the plant
const index = plants.findIndex(
(plant) => plant.unique_name === overridePlant.unique_name
);

if (index === -1) {
console.log(
`[INFO] Could not find plant with unique_name '${overridePlant.unique_name}' in merged dataset.`
);
return;
}

// delete the plant
plants.splice(index, 1);
});

return plants;
}

async function applyAllOverrides(plants) {
let overridesDir = "data/overrides";
if (!fs.existsSync(overridesDir)) {
fs.mkdirSync(overridesDir);
}

// list all csv files in data/overrides
const overrideFiles = fs.readdirSync(overridesDir);
overrideFiles.sort();

// apply all overrides
for (const file of overrideFiles) {
// deletions were handled separately
if (path.extname(file) !== ".csv" || file === deletionsFile) {
continue;
}
await applyOverride(plants, `${overridesDir}/${file}`);
}

return plants;
}

async function writePlantsToOverwriteCsv(plants) {
console.log(
`[INFO] Writing ${plants.length} plants to csv data/finalDataset.csv`
);
cleanUpJsonForCsv(plants);
const csvFile = json2csv(plants);
fs.writeFileSync("data/finalDataset.csv", csvFile);

return plants;
}

loadMergedDataset()
.then((plants) => applyDeletions(plants))
.then((plants) => applyAllOverrides(plants))
.then((plants) => writePlantsToOverwriteCsv(plants))
.catch((error) => console.error(error));
Loading

0 comments on commit 197207a

Please sign in to comment.