SRTK: Subgraph Retrieval Toolkit

SRTK is a toolkit for semantic-relevant subgraph retrieval from large-scale knowledge graphs. It currently supports Wikidata, Freebase and DBpedia.

A minimum walkthrough of the retrieve process:

Prerequisite

Installations

pip install srtk

Local Deployment of Knowledge Graphs

Usage

There are mainly five subcommands of SRTK, which covers the whole pipeline of subgraph retrieval.

For retrieval:

srtk link: Link entity mentions in texts to a knowledge graph. Currently Wikidata and DBpedia are supported out of the box.
srtk retrieve: Retrieve semantic-relevant subgraphs from a knowledge graph with a trained retriever. It can also be used to evaluate a trained retriever.
srtk visualize: Visualize retrieved subgraphs using a graph visualization tool.

For training a retriever:

srtk preprocess: Preprocess a dataset for training a subgraph retrieval model.
srtk train: Train a subgraph retrieval model on a preprocessed dataset.

Use srtk [subcommand] --help to see the detailed usage of each subcommand.

A Tour of SRTK

Retrieve Subgraphs

Retrieve subgraphs with a trained scorer

srtk retrieve [-h] -i INPUT -o OUTPUT [-e SPARQL_ENDPOINT] -kg {freebase,wikidata,dbpedia}
              -m SCORER_MODEL_PATH [--beam-width BEAM_WIDTH] [--max-depth MAX_DEPTH]
              [--evaluate] [--include-qualifiers]

The scorer-model-path argument can be any huggingface pretrained encoder model. If it is a local path, please ensure the tokenizer is also saved along with the model.

Visualize retrieved subgraph

srtk visualize [-h] -i INPUT -o OUTPUT_DIR [-e SPARQL_ENDPOINT]
               [-kg {wikidata,freebase}] [--max-output MAX_OUTPUT]

Train a Retriever

A scorer is the model used to navigate the expanding path. At each expanding step, relations scored higher with scorer are picked as relations for the next hop.

The score is based on the embedding similarity of the to-be-expanded relation with the query (question + previous expanding path).

The model is trained in a distant supervised learning fashion. Given the question entities and the answer entities, the model uses the shortest paths along them as the supervision signal.

Preprocess a dataset

prepare training samples where question entities and answer entities are know.

The training data should be saved in a jsonl file (e.g. data/grounded.jsonl). Each training sample should come with the following format:

{
  "id": "sample-id",
  "question": "Which universities did Barack Obama graduate from?",
  "question_entities": [
    "Q76"
  ],
  "answer_entities": [
    "Q49122",
    "Q1346110",
    "Q4569677"
  ]
}

Preprocess the samples with srtk preprocess command.
```
srtk preprocess [-h] -i INPUT -o OUTPUT [--intermediate-dir INTERMEDIATE_DIR]
                -e SPARQL_ENDPOINT -kg {wikidata,freebase} [--search-path]
                [--metric {jaccard,recall}] [--num-negative NUM_NEGATIVE]
                [--positive-threshold POSITIVE_THRESHOLD]
```
Under the hood, it does four things:
1. Find the shortest paths between the question entities and the answer entities.
2. Score the searched paths with Jaccard scores with the answers.
3. Negative sampling. At each expanding step, the negative samples are those false relations connected to the tracked entities.
4. Generate training dataset as a jsonl file.

Train a sentence encoder

The scorer should be initialized from a pretrained encoder model from huggingface hub. Here I used intfloat/e5-small, which is a checkpoint of the BERT model.

srtk train --data-file data/train.jsonl \
    --model-name-or-path intfloat/e5-small \
    --save-model-path artifacts/scorer

Trained models

SRTK is compatible with any language encoder or encoder-decoder models from huggingface hub. You only need to specify the model name or path for arguments like --model-name-or-path or --scorer-model-path.

Here we provide some trained models for subgraph retrieval.

Model	Dataset	Base Model	Notes
`drt/srtk-scorer`	WebQSP, SimpleQuestionsWikidata, SimpleDBpediaQA	`roberta-base`	Jointly trained for Wikidata, Freebase and DBpedia.

Tutorials

Citations

@article{shen2023srtk,
  title={SRTK: A Toolkit for Semantic-relevant Subgraph Retrieval},
  author={Shen, Yuanchun},
  journal={arXiv preprint arXiv:2305.04101},
  year={2023}
}

@inproceedings{zhang2022subgraph,
  title={Subgraph Retrieval Enhanced Model for Multi-hop Knowledge Base Question Answering},
  author={Zhang, Jing and Zhang, Xiaokang and Yu, Jifan and Tang, Jian and Tang, Jie and Li, Cuiping and Chen, Hong},
  booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={5773--5784},
  year={2022}
}

License

This project is licensed under the terms of the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 195 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
src/srtk		src/srtk
tests		tests
tutorials		tutorials
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SRTK: Subgraph Retrieval Toolkit

Prerequisite

Installations

Local Deployment of Knowledge Graphs

Usage

A Tour of SRTK

Retrieve Subgraphs

Retrieve subgraphs with a trained scorer

Visualize retrieved subgraph

Train a Retriever

Preprocess a dataset

Train a sentence encoder

Trained models

Tutorials

Citations

License

About

Releases 5

Packages

Languages

License

happen2me/subgraph-retrieval-toolkit

Folders and files

Latest commit

History

Repository files navigation

SRTK: Subgraph Retrieval Toolkit

Prerequisite

Installations

Local Deployment of Knowledge Graphs

Usage

A Tour of SRTK

Retrieve Subgraphs

Retrieve subgraphs with a trained scorer

Visualize retrieved subgraph

Train a Retriever

Preprocess a dataset

Train a sentence encoder

Trained models

Tutorials

Citations

License

About

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Languages

Packages