Skip to content

Metadata Database Schema

Mihhail Sokolov edited this page Aug 13, 2020 · 41 revisions

This page will explain the schema that is used to store the metadata of the callgraphs.

Database

We decided to use PostgreSQL version 12 or higher because it has some very useful features that we utilize (like extensive JSON data type support).

Schema

Depicted below is the E/R diagram of the metadata database schema we have devised: FASTEN Metadata Database Schema

Explanation

  • The packages table contains basic information about the packages/products like the name of the package, forge, repository URL and etc.
  • Each package can have multiple versions, thus the table package_versions references packages and stores information like version of the package.
  • Each package version can have multiple dependencies to other packages. Therefore table dependencies references package_versions.id as source package and packages.id as a dependent package and has a version_range which specifies acceptable versions of dependency.
  • Each package_version can have multiple modules in them. For some languages, these modules can be viewed as classes. Thus the table modules references package_versions and stores such attributes as the namespace of the module, SHA256 hash of the module, timestamp, and metadata about the module in form of JSON. Currently, in a module's metadata, the following attributes are stored: superClasses and superInterfaces.
  • Each module can consist of multiple methods - callables. Therefore, table callables stores reference to modules, FASTEN URI of the method, boolean value if this callable is resolved/internal and timestamp with metadata. However, for external callable, a reference to modules is -1 which is a default module for all external callables. Thus, when saving the graph, all external callables are inserted in the callables table without reference to modules and is_resolved_call set to false.
  • The table edges stores the graph's edges. Each edge has a source and target which both reference callables table allowing to store both internal and external calls. Each edge also stores some metadata about the call (like if the call invokes a virtual method and etc).
  • Each module can also contain multiple files and one file can contain multiple modules. Therefore this many-to-many relationship is realized by means of module_contents table.
  • Each package version also has binary modules for some of the languages. binary_modules are very similar to module, but they are supposed to store data for binary modules and link to binary files.
  • Similarly to modules, binary_modules has many-to-many relationship realized by the binary_module_contents table.

Additional metadata

From POM Analyzer

POM Analyzer extracts a list of dependencies of an artifact, dependencyManagement of that artifact, repository URL, and commit tag, packaging type, project name if they are present, and generates sources URL. Each dependency is put into the dependencies table with detailed metadata. Here is an example of dependency's metadata:

{
   "type":"",
   "scope":"compile",
   "groupId":"com.yahoo.athenz",
   "optional":false,
   "artifactId":"athenz-zms-java-client",
   "classifier":"",
   "exclusions":[
      {
         "groupId":"org.slf4j",
         "artifactId":"slf4j-api"
      },
      {
         "groupId":"org.glassfish.jersey.core",
         "artifactId":"jersey-client"
      },
      {
         "groupId":"org.glassfish.jersey.media",
         "artifactId":"jersey-media-json-jackson"
      },
      {
         "groupId":"com.fasterxml.jackson.core",
         "artifactId":"jackson-core"
      },
      {
         "groupId":"com.fasterxml.jackson.core",
         "artifactId":"jackson-databind"
      },
      {
         "groupId":"com.fasterxml.jackson.core",
         "artifactId":"jackson-annotations"
      }
   ],
   "versionConstraints":[
      {
         "lowerBound":"1.7.28",
         "upperBound":"1.7.28",
         "isLowerHardRequirement":false,
         "isUpperHardRequirement":false
      }
   ]
}

The dependencyManagement information is put into the package_versions.metadata along with commit tag (a tag in the repository which should signify certain version), link to Maven sources JAR file, and packaging type of the artifact. Here is an example of a package version's metadata:

{
   "commitTag":"2.5.0",
   "sourcesUrl":"https://repo.maven.apache.org/maven2/junit/junit/4.12/junit-4.12-sources.jar",
   "packagingType":"jar",
   "dependencyManagement":{
      "dependencies":[
         {
            "type":"",
            "scope":"",
            "groupId":"com.yahoo.elide",
            "optional":false,
            "artifactId":"elide-core",
            "classifier":"",
            "exclusions":[

            ],
            "versionConstraints":[
               {
                  "lowerBound":"2.5.0",
                  "upperBound":"2.5.0",
                  "isLowerHardRequirement":false,
                  "isUpperHardRequirement":false
               }
            ]
         }
      ]
   }
}

From Metadata Plugin

Metadata plugin inserts metadata produced by call graph generator and inserts this metadata into the callables, edges and modules tables. Here is an example of module's metadata:

{
    "access": "public",
    "final": false,
    "superInterfaces": [],
    "superClasses": ["/java.lang/Object"]
}

Here is an example of callable's metadata:

{
    "access": "public",
    "last": 3,
    "first": 3,
    "defined": true
}

Here is an example of edge's metadata:

"0": {
    "receiver": "/name.space/SingleSourceToTarget",
    "line": 6,
    "type": "invokestatic"
}

P.S. Key ("0") is a program counter (pc)

More detailed information about this metadata can be found here

SQL to create database tables

Here is the SQL code that is used to create all tables described above in the diagram with all required keys, indexes, and constraints.

CREATE TABLE packages
(
    id           BIGSERIAL PRIMARY KEY,
    package_name TEXT NOT NULL,
    forge        TEXT NOT NULL,
    project_name TEXT,
    repository   TEXT,
    created_at   TIMESTAMP
);

CREATE TABLE package_versions
(
    id           BIGSERIAL PRIMARY KEY,
    package_id   BIGINT NOT NULL REFERENCES packages (id),
    version      TEXT   NOT NULL,
    cg_generator TEXT   NOT NULL,
    created_at   TIMESTAMP,
    metadata     JSONB
);

CREATE TABLE dependencies
(
    package_version_id BIGINT NOT NULL REFERENCES package_versions (id),
    dependency_id      BIGINT NOT NULL REFERENCES packages (id),
    version_range      TEXT[] NOT NULL,
    metadata           JSONB
);

CREATE TABLE modules
(
    id                 BIGSERIAL PRIMARY KEY,
    package_version_id BIGINT NOT NULL REFERENCES package_versions (id),
    namespace          TEXT   NOT NULL,
    created_at         TIMESTAMP,
    metadata           JSONB
);

CREATE TABLE files
(
    id                 BIGSERIAL PRIMARY KEY,
    package_version_id BIGINT NOT NULL REFERENCES package_versions (id),
    path               TEXT   NOT NULL,
    checksum           BYTEA,
    created_at         TIMESTAMP,
    metadata           JSONB
);

CREATE TABLE module_contents
(
    module_id BIGINT NOT NULL REFERENCES modules (id),
    file_id   BIGINT NOT NULL REFERENCES files (id)
);

CREATE TABLE binary_modules
(
    id                 BIGSERIAL PRIMARY KEY,
    package_version_id BIGINT NOT NULL REFERENCES package_versions (id),
    name               TEXT   NOT NULL,
    created_at         TIMESTAMP,
    metadata           JSONB
);

CREATE TABLE binary_module_contents
(
    binary_module_id BIGINT NOT NULL REFERENCES binary_modules (id),
    file_id          BIGINT NOT NULL REFERENCES files (id)
);

CREATE TABLE callables
(
    id               BIGSERIAL PRIMARY KEY,
    module_id        BIGINT  NOT NULL REFERENCES modules (id),
    fasten_uri       TEXT    NOT NULL,
    is_internal_call BOOLEAN NOT NULL,
    created_at       TIMESTAMP,
    metadata         JSONB
);

CREATE TABLE edges
(
    source_id BIGINT NOT NULL REFERENCES callables (id),
    target_id BIGINT NOT NULL REFERENCES callables (id),
    metadata  JSONB  NOT NULL
);

-- CREATE INDEX CONCURRENTLY package_versions_package_id ON package_versions USING btree (package_id);
-- CREATE INDEX CONCURRENTLY dependencies_package_version_id ON dependencies USING btree (package_version_id);
-- CREATE INDEX CONCURRENTLY dependencies_dependency_id ON dependencies USING btree (dependency_id);
-- CREATE INDEX CONCURRENTLY files_package_version_id ON files USING btree (package_version_id);
-- CREATE INDEX CONCURRENTLY modules_package_version_id ON modules USING btree (package_version_id);
-- CREATE INDEX CONCURRENTLY module_contents_module_id ON module_contents USING btree (module_id);
-- CREATE INDEX CONCURRENTLY module_contents_file_id ON module_contents USING btree (file_id);
-- CREATE INDEX CONCURRENTLY binary_modules_package_version_id ON binary_modules USING btree (package_version_id);
-- CREATE INDEX CONCURRENTLY binary_module_contents_binary_module_id ON binary_module_contents USING btree (binary_module_id);
-- CREATE INDEX CONCURRENTLY binary_module_contents_file_id ON binary_module_contents USING btree (file_id);
-- CREATE INDEX CONCURRENTLY callables_module_id ON callables USING btree (module_id);
-- CREATE INDEX CONCURRENTLY edges_source_id ON edges USING btree (source_id);
-- CREATE INDEX CONCURRENTLY edges_target_id ON edges USING btree (target_id);

CREATE UNIQUE INDEX CONCURRENTLY unique_package_forge ON packages USING btree (package_name, forge);
ALTER TABLE packages
    ADD CONSTRAINT unique_package_forge UNIQUE USING INDEX unique_package_forge;

CREATE UNIQUE INDEX CONCURRENTLY unique_package_version_generator ON package_versions USING btree (package_id, version, cg_generator);
ALTER TABLE package_versions
    ADD CONSTRAINT unique_package_version_generator UNIQUE USING INDEX unique_package_version_generator;

CREATE UNIQUE INDEX CONCURRENTLY unique_version_dependency_range ON dependencies USING btree (package_version_id, dependency_id, version_range);
ALTER TABLE dependencies
    ADD CONSTRAINT unique_version_dependency_range UNIQUE USING INDEX unique_version_dependency_range;

CREATE UNIQUE INDEX CONCURRENTLY unique_version_namespace ON modules USING btree (package_version_id, namespace);
ALTER TABLE modules
    ADD CONSTRAINT unique_version_namespace UNIQUE USING INDEX unique_version_namespace;

CREATE UNIQUE INDEX CONCURRENTLY unique_module_file ON module_contents USING btree (module_id, file_id);
ALTER TABLE module_contents
    ADD CONSTRAINT unique_module_file UNIQUE USING INDEX unique_module_file;

CREATE UNIQUE INDEX CONCURRENTLY unique_version_name ON binary_modules USING btree (package_version_id, name);
ALTER TABLE binary_modules
    ADD CONSTRAINT unique_version_name UNIQUE USING INDEX unique_version_name;

CREATE UNIQUE INDEX CONCURRENTLY unique_binary_module_file ON binary_module_contents USING btree (binary_module_id, file_id);
ALTER TABLE binary_module_contents
    ADD CONSTRAINT unique_binary_module_file UNIQUE USING INDEX unique_binary_module_file;

CREATE UNIQUE INDEX CONCURRENTLY unique_version_path ON files USING btree (package_version_id, path);
ALTER TABLE files
    ADD CONSTRAINT unique_version_path UNIQUE USING INDEX unique_version_path;

CREATE UNIQUE INDEX CONCURRENTLY unique_uri_call ON callables USING btree (module_id, fasten_uri, is_internal_call);
ALTER TABLE callables
    ADD CONSTRAINT unique_uri_call UNIQUE USING INDEX unique_uri_call;

CREATE UNIQUE INDEX CONCURRENTLY unique_source_target ON edges USING btree (source_id, target_id);
ALTER TABLE edges
    ADD CONSTRAINT unique_source_target UNIQUE USING INDEX unique_source_target;

ALTER TABLE callables
    ADD CONSTRAINT check_module_id CHECK ((module_id = -1 AND is_internal_call IS false) OR
                                          (module_id IS NOT NULL AND is_internal_call IS true));

INSERT INTO packages (id, package_name, forge)
VALUES (-1, 'external_callables_library', 'mvn')
ON CONFLICT DO NOTHING;

INSERT INTO package_versions (id, package_id, version, cg_generator)
VALUES (-1, -1, '0.0.1', 'OPAL')
ON CONFLICT DO NOTHING;

INSERT INTO modules (id, package_version_id, namespace)
VALUES (-1, -1, 'global_external_callables')
ON CONFLICT DO NOTHING;

SQL to initialise users and permissions

The following need to be executed once on the master using the Postgres superuser (usually, postgres) to initialise users and databases.

-- Create default read/write user and databases

create user fasten with encrypted password PASSWD; -- change me
ALTER USER fasten with CREATEDB;

create database fasten_java ENCODING 'UTF8' LC_COLLATE = 'en_US.UTF-8' LC_CTYPE = 'en_US.UTF-8' TEMPLATE template0;
create database fasten_rust ENCODING 'UTF8' LC_COLLATE = 'en_US.UTF-8' LC_CTYPE = 'en_US.UTF-8' TEMPLATE template0;
create database fasten_c ENCODING 'UTF8' LC_COLLATE = 'en_US.UTF-8' LC_CTYPE = 'en_US.UTF-8' TEMPLATE template0;
create database fasten_python ENCODING 'UTF8' LC_COLLATE = 'en_US.UTF-8' LC_CTYPE = 'en_US.UTF-8' TEMPLATE template0;

grant all privileges on database fasten_java to fasten;
grant all privileges on database fasten_rust to fasten;
grant all privileges on database fasten_c to fasten;
grant all privileges on database fasten_python to fasten;
-- Create a readonly user for querying

CREATE USER fastenro WITH ENCRYPTED PASSWORD 'fasten';

GRANT CONNECT ON DATABASE fasten_java TO fastenro;
GRANT CONNECT ON DATABASE fasten_rust TO fastenro;
GRANT CONNECT ON DATABASE fasten_c TO fastenro;
GRANT CONNECT ON DATABASE fasten_python TO fastenro;

ALTER DATABASE fasten_java OWNER to fasten;
ALTER DATABASE fasten_rust OWNER to fasten;
ALTER DATABASE fasten_c OWNER to fasten;
ALTER DATABASE fasten_python OWNER to fasten;

\c fasten_java
GRANT USAGE ON SCHEMA public TO fastenro;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO fastenro;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO fastenro;
\c fasten_rust
GRANT USAGE ON SCHEMA public TO fastenro;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO fastenro;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO fastenro;

\c fasten_c
GRANT USAGE ON SCHEMA public TO fastenro;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO fastenro;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO fastenro;

\c fasten_python
GRANT USAGE ON SCHEMA public TO fastenro;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO fastenro;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO fastenro;