Skip to content

Latest commit

 

History

History

SPARQL dataset statistics

This subproject (part of QADO question answering dataset RDFizer) address the need of a good understanding of the characteristics of Knowledge Graph Question Answering (KGQA) benchmark datasets. For this purpose, a tool is provided here to create CSV files containing statistics about the RDFized data in a triplestore.

Requirements

Usage

./createSpreadsheet.sh

This script creates new CSV files with all statistics and stores them following the pattern QADO-statistics-${statistic_name}-${current_date}.csv in the folder /tmp/QADO-statistics/ (by default). To change the output directory, you can provide a parameter to the script, e.g.:

./createSpreadsheet.sh myOutputDirectory

Configuration

Script configuration

You might edit the config.py to select the triplestore where your data is stored.

Statistics

A number of existing statistics are already implemented. To create your own statistics, just store your SPARQL queries into the queries directory. This tool creates a new CSV file for each SPARQL query file.

Boxplot data

If you want to generate a boxplot chart, you need to store your query with the file ending _boxplot.sparql. Additionally, the data for the boxplot chart needs to be selected via (GROUP_CONCAT(?YOUR_VARIABLE ; SEPARATOR = ",") AS ?concat).

Default statistics

The tool generates per default the following statistics for RDFized Question Answering benchmarks:

  • Query length per benchmark

  • Query modifiers per benchmark

  • Question length per

    • answer type and benchmark (boxplot)

    • benchmark (boxplot)

    • language and benchmark (boxplot)

  • Questions per

    • answer type

    • benchmark

    • language

    • language and benchmark

  • Question type per benchmark

  • Statistics of used resources inside the SPARQL queries per benchmark