GitHub - jmw464/Vertex-GNN: Graph Neural Network for secondary vertex reconstructions in ATLAS experiment

jmw464 / Vertex-GNN Public

Notifications You must be signed in to change notification settings
Fork 2
Star 1

Graph Neural Network for secondary vertex reconstructions in ATLAS experiment

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
scripts		scripts
README		README
run_GNN.sh		run_GNN.sh
run_hdf5.sh		run_hdf5.sh
run_processing.sh		run_processing.sh

Repository files navigation

##########################
## Secondary Vertex GNN ##
##########################

-----------------------------------------------------------------------------

Prerequesites:

Make sure you have the following packages installed before attempting to run
the GNN scripts:

-> Python (min version 3.6)
-> ROOT (min version 6.18.0)
-> Pytorch (min version 1.4.0)
-> DGL (min version 0.5.3) - https://docs.dgl.ai/en/latest/
-> h5py (min version 2.10.0)

Running the GNN might work with older versions of some of these, but this
hasn't been tested (except for the fact that Python 3.6 is required by DGL).
ROOT and h5py are used for data processing and plotting while DGL (using
Pytorch as a backend) is the core package with which the network is built.
All of these are fairly standard, except for DGL for which installation
instructions can be found on the listed website.

-----------------------------------------------------------------------------

Running the GNN:

In order to generate input files, a modified version of the
"FlavourTagPerformanceFramework" is used (which can be found at
https://gitlab.cern.ch/jmwagner/FlavourTagPerformanceFramework). This
produces ROOT ntuples from DAODs that are directly compatible with the GNN.
Further instructions on how to generate these ntuples can be found in the
README of the aforementioned repository.

Once all desired ntuples are generated, running the GNN is relatively
straightforward. The whole pipeline consists of two steps: data processing
and training/evaluation. These can be run successively with the
"run_processing.sh" and "run_GNN.sh" scripts. All variable arguments are set
either within these two bash scripts or within the "scripts/options.py" file.
The bash scripts contain the paths to input/output data as well as some
higher level options that affect the pipeline in general (such as whether or
not input data should be normalized etc). The python script on the other hand
contains a large number of smaller options that allow for modification of the
behavior of the GNN. This is where hyperparameters and cuts are set as well
as some plotting options. Other than these three places there should be no
need to edit any other scripts other than to add features or modify GNN
behavior more significantly. All scripts are intended to be as modular and
flexible as possible in order to ensure ease of use.

-----------------------------------------------------------------------------

Description of individual scripts:

PROCESSING
o) process_ntuple.py: Processes a single ROOT ntuple, extracts information
   relevant to the GNN and saves this information in the form of HDF5 files.
   Track information is saved directly while truth particle information is
   used within the script to derive relevant information (such as track
   labels). Also performs cuts on data (cut jets are removed entirely,
   cut tracks are marked but kept for plotting purposes)..
o) create_graphs.py: Takes a single HDF5 file created with process_ntuple.py
   and turns it into DGL compatible graph file that can be used for training
   directly. Also saves text file containing some values derived from the data
   that are used further down the line.
o) combine_graphs.py: Imports and shuffles graphs from multiple DGL files and
   splits them into training, testing and validation sets, which are all
   saved in separate files. Note: because of this, this script is still used
   even if only a single ntuple is used to generate a dataset.
o) norm_graphs.py: This script is OPTIONAL. It performs normalization of all
   input features for the GNN in testing, training and validation files by
   subtracting the mean of each feature and dividing by its standard deviation,
   which are calculated from the training data only. Some features are
   normalized slightly differently (details are in the script).
o) prune_graphs.py: Final script to be run before GNN training. Fully removes
   all tracks marked as cut from graphs in training, testing and validation
   files and writes a text file containing some parameters of the training
   data relevant for GNN training (such as the proportion of true to false
   edges).
o) truth_functions.py: Contains functions that help with processing truth
   information (such as generating dictionaries of truth particles/tracks).
   Used extensively in process_ntuple.py.

GNN TRAINING
o) GNN_main.py: Contains core training and testing loops of the GNN. Uses
   DGL graph files as input and generates modified DGL graph files containing
   training results which are used in other evaluation scripts. Also
   calculates simple GNN performance metrics and generates some rudimentary
   performance plots.
o) GNN_model.py: Defines the actual GNN model used in GNN_main.py.
o) compare_performance.py: Turns GNN output labels into actual secondary
   vertices by associating tracks based on their edge scores via a
   deterministic algorithm. Based on these secondary vertices, this script
   then performs comparisons with SV1 (a different SV reconstruction
   algorithm) and generates a large number of plots. Note: this script is
   only written to work with the GNN in binary classification mode.
o) GNN_eval.py: Contains functions used to evaluate GNN performance and
   reconstruct secondary vertices (used in compare_performance.py).

PLOTTING
o) plot_data.py: Run automatically after data processing. Generates plots
   showing distribution of input features in the generated dataset
   categorized by truth labels for each track. Also makes plots showing
   effects of cuts on data.
o) plot_results.py: Run automatically after GNN training. Generates plots
   comparing feature distributions of jets the GNN performed poorly on with
   the rest of the dataset. This is again categorized by the truth labels of
   each track.
o) plot_functions.py: Contains functions to generate consistent plots of
   several types. Used in all other plotting scripts.

OTHER SCRIPTS
o) options.py: Defines global variables used across all scripts.

-----------------------------------------------------------------------------

written by Johannes Wagner ([email protected])