This subproject (part of QADO question answering dataset RDFizer) address the need of a good understanding of the characteristics of Knowledge Graph Question Answering (KGQA) benchmark datasets. For this purpose, a tool is provided here to create CSV files containing statistics about the RDFized data in a triplestore.
-
Docker needs to be installed on your system
-
your triplestore needs to be accessible via SPARQL endpoint (cf. Section Configuration)
./createSpreadsheet.sh
This script creates new CSV files with all statistics and stores them following the pattern QADO-statistics-${statistic_name}-${current_date}.csv
in the folder /tmp/QADO-statistics/
(by default).
To change the output directory, you can provide a parameter to the script, e.g.:
./createSpreadsheet.sh myOutputDirectory
You might edit the config.py
to select the triplestore where your data is stored.
A number of existing statistics are already implemented.
To create your own statistics, just store your SPARQL queries into the queries
directory.
This tool creates a new CSV file for each SPARQL query file.
The tool generates per default the following statistics for RDFized Question Answering benchmarks:
-
Query length per benchmark
-
Query modifiers per benchmark
-
Question length per
-
answer type and benchmark (boxplot)
-
benchmark (boxplot)
-
language and benchmark (boxplot)
-
-
Questions per
-
answer type
-
benchmark
-
language
-
language and benchmark
-
-
Question type per benchmark
-
Statistics of used resources inside the SPARQL queries per benchmark