Tim L edited this page Jan 17, 2014

What is first

What we will cover

This page describes the concept and implementation of a technique to summarize arbitrary RDF graphs. We'll summarize the named graphs in as a running example.

Let's get to it!

Invoking the summarizer wraps the Java invocation using the situate shell paths pattern.

In a separate Prizms node, we set up the dataset "sparql" with source "ieeevis-tw-rpi-edu" at directory data/source/ieeevis-tw-rpi-edu/sparql.

usage: RepositorySummarizer { -(sysin) [reportURI | .] |
                              -r(emote) serverURL repositoryID <reportURI | .> [context-to-summarize ...] |
                              -d(irectory) path/to/sesame-native-dir/ [context-to-summarize ...] |
                              -f(ile) path/to/a.rdf <reportURI | .> }
   -(sysin):     Summarize the RDF on standard in; print summary report to standard out.
                 If reportURI or . are provided, print TRiG instead of RDF/XML.
   -r(remote):   Summarize listed specimenContexts in repositoryID at serverURL. 
                 If no specimenContexts listed, summarize all contexts in repository.
   -d(irectory): Summarize listed specimenContexts in sesame native directory. 
                 If no specimenContexts listed, summarize all contexts in directory.
   -f(ile):      Summarize the RDF in file; print summary report to standard out.

( version: 2013-Apr-03 )


Summary description

Sketch of the summarization description. The implementation does it slightly differently.

@prefix sio: <> .

# We analyzed a graph with name that was
# provided by a SPARQL endpoint

   a sd:NamedGraph;
   sd:name <>;
   prov:hadLocation <>;

# We derived a few datasets during our analysis.

   a void:Dataset, vsr:SPOBalanceSet;
   void:subset <subjects>, <predicates>, <objects>;
   prov:wasDerivedFrom <>;
<resources> # This needs to be split up into S and O...
   a void:Dataset, vsr:ResourceSet;
   sio:count 99;
   sio:has-member <>,
                  ... 93 more ...

src/ wraps the call to

We use a Sesame Repository, which can be started by running tomcat: apache-tomcat-7.0.34/bin/

log.rtf contains implementation details.

Stereotyping predicate counts by preferred vocabularies

color by those predicates that occur in a given curated list of vocabulary namespaces.

Preferred vocabulary "word clouds"

Decorate the SPO balance with a "word cloud" of prefixes for the [preferred] namespaces that the graph uses. This aggregated information should be derivable from the SPO repository summary RDF description.


PREFIX rdfs: <>
PREFIX owl: <>
PREFIX sio: <>
PREFIX vsr: <>

select ?vocabulary ?predicate ?count
where {
       a vsr:SPODataset;
       void:subset [
          a vsr:PredicatesDataset;
          void:subset [
             a vsr:PredicateOccurrenceDataset;
             owl:hasValue ?predicate;
             sio:count    ?count
       ] .
    optional { ?predicate rdfs:isDefinedBy ?vocabulary }
group by ?vocabulary
order by ?vocabulary ?predicate ?count

The following query results:

PREFIX rdf:  <>
PREFIX rdfs: <>
PREFIX owl:  <>
PREFIX sio:  <>
PREFIX vsr:  <>

select distinct ?predicate ?count
where {
       a vsr:SPODataset;
       void:subset [    # </spo/p>
          a vsr:PredicatesDataset;
          void:subset [ # </spo/p/bin/1>, </spo/p/bin/2>, ...
             a vsr:PredicateOccurrenceDataset;
             owl:onProperty rdf:predicate;
             owl:hasValue  ?predicate;
             sio:count     ?count
       ] .
order by ?predicate ?count

If a dataset uses the following properties and frequencies, then we can model it as the following RDF. void:vocabulary,

# This was already provided by the SPO summary calculation:
    a vsr:Bin, vsr:Dataset, vsr:PredicateOccurrenceDataset;
    owl:onProperty rdf:predicate;
    owl:hasValue  <>;
    sio:count "1"^^xsd:int;
    a vsr:Bin, vsr:Dataset, vsr:PredicateOccurrenceDataset;
    owl:onProperty rdf:predicate;
    owl:hasValue  <>;
    sio:count "1"^^xsd:int;
    a vsr:Bin, vsr:Dataset, vsr:PredicateOccurrenceDataset;
    owl:onProperty rdf:predicate;
    owl:hasValue  <>;
    sio:count "2"^^xsd:int;
    a vsr:Bin, vsr:Dataset, vsr:PredicateOccurrenceDataset;
    owl:onProperty rdf:predicate;
    owl:hasValue  <>;
    sio:count "1"^^xsd:int;
    a vsr:Bin, vsr:Dataset, vsr:PredicateOccurrenceDataset;
    owl:onProperty rdf:predicate;
    owl:hasValue  <>;
    sio:count "6"^^xsd:int;
    a vsr:Bin, vsr:Dataset, vsr:PredicateOccurrenceDataset;
    owl:onProperty rdf:predicate;
    owl:hasValue  <>;
    sio:count "8"^^xsd:int;

<spo/p/ns/doap> # We'll start a new branch, and use prefixes when we have them, hash of ns o/w.
   owl:onProperty rdfs:isDefinedBy;
   owl:hasValue    <>;
   sio:count 4;
   a void:Dataset;
   void:vocabulary <>;
   void:propertyPartition </spo/p/bin/1>, # These predicate bins are already defined.

   owl:onProperty rdfs:isDefinedBy;
   owl:hasValue    <>;
   sio:count 15;
   a void:Dataset;
   void:vocabulary <>;
   void:propertyPartition </spo/p/bin/4>, # These predicate bins are already defined.

Feature space to cluster graphs by similarity

The node

What is next

