Skip to content

Git repository drift analysis based on merge distances

Notifications You must be signed in to change notification settings

convidev-tud/driftool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

driftool :: Git Branch Inconsistency Analysis

This repository is part of a research project of the Software Technology Group at Dresden University of Technology. Contact Karl Kegel (KKegel) for further information.

The driftool calculates the drift analysis for git repositories. It automatically simulates merges in between all branches to find potential conflicts. It generates both a scalar drift metric, as well as a 3D view of the repository drift. The pairwise distance is the number of lines of merge conflicts. The base measure of the point-cloud is the mean absolute deviation around a central point.

The results of the drift computation indicate how well or poorly a repository is managed. High drift indicates large inconsistencies in between branches. The drift is an absolute metric that always has to be interpreted in the context of the repository size. A repository with dozens of collaborateurs and branches naturally has more drift as a project with 3 developers working on. However, the evolution of the drift over time gives useful insigts about the project health.

⚡ Running the driftool without reading this README may cause severe problems.

Key Metrics

The driftool calculates the Drift measure g (gamma). Depending on the calculation strategy, different Drift Flavours are possible.

  • Statement Drift := a measure for the merge complexity based on the git merge conflict line count
  • Conflict Drift := a measure for the merge complexity based on the git merge conflict occurence count
  • File Drift := a measure for the merge complexity based on the git merge conflicting files count

In general, higher numbers (increase over time) indicate a more difficult repository management.

💡 Per default, a git merge is a symmetric operation, meaning A into B produces the same conflicts as B into A. However, through certain git operations (resets, force operations, messed-up history) happens that the merge is not symmetric. For example, A into B has 2 conflicing lines, although B into A has 4 conflicting lines. As this is extremely rare, we ignore this issue for performance reasons (2x speedup).

Report Generation

TODO

Getting Started

Important Terms

There are some important terms to keep in mind while reading the following instructions.

The driftool repository is this repository you are currently in (or after cloning it onto your system). The root of the dirftool repository is the place where this README.md is located.

The gradle root is the driftool_kt/ directory within the driftool repository. That's the place where the main implementation is rooted.

The repository under analysis is the repository you want to analyze with the driftool in git mode.

The volume is a folder called volume/ that is used as the input directory in docker mode. This folder is mounted automatically by the driftool docker container. The driftool repository comes with a default volume folder. If the driftool is used without docker, the volume folder does not play any role.

The volume path is the absolute path to the volume folder on your local system. If you use the default setup, it is the absolute path to the volume folder of the driftool repository.

The volume root is the space when you enter the volume. If, for example, there is a file .../volume/configs/config.yaml then its path within the volume is configs/config.yaml.

Docker Setup

The driftool runs as a docker container application. We recommend the docker setup for all user. It is - in theory - possible to run the driftool without docker on a linux system directly. If you wish to do so, please read the code documentation and have a look at the driftool_kt/testrun.sh script that starts a local run on a unix system based on example input.

Quick Setup

  1. Have docker (or docker desktop) installed and ready.
  2. Clone this repository.
  3. Make run.sh and build.sh executable via chmod
  4. Execute ./build.sh with sudo priviledges in the root directory of the driftool repository.
  5. Move the repository under analysis and the repository config into the ./volume directory of the driftool repository
  6. Execute ./run.sh [repo path (in volume)] [output path (in volume)] [config path (in volume)] [RAM alloc] [threads] [mode] with sudo priviledges.
  7. The analysis results are written to the ./volume/[output path]

Note that the analysis of a large repository (branch-number) can take from few seconds up to several hours! It grows quadratically with the number of branches. For example if 10 branches of your repository take 100 seconds, 30 branches would take 900 seconds.

For advanced users: you can exectue the run.sh wherever you want as long as the driftool image is available in your docker image list. In case of running the driftool outside the cloned repository, there must be a volume folder in the location where the driftool (through the run.sh) is executed.

Build & Run Scripts

  • build.sh is the buildscript of the docker image. Execute ./build.sh to generate the docker image. If you want to configure the image creation, e.g., name, tag or location, you can modify the build command defined in the file. You must execute this command only in the source root directory of this repository!
  • run.sh is the entry script to start the driftool container from the image. Beforehand, the build.sh must be executed once. If the build.sh was executed without modifications, the run.sh works out of the box. Otherwise, the script must be modified accordingly. The run script takes thre positional parameters (have a look into the run.sh file or the quick setup instructions above) Output reports are placed in the ./volume directory. The container is destroyed after each run.
  • (dockerfile) is the standard entry point for docker build and describes the composition of the docker image. If you are a driftool user, you do not have to execute or touch this file.
  • (deb_run.sh) is the entrypoint script of the created container. It prepares the ramdisk and starts the driftool's main method. If you are a driftool user, you do not have execute or touch this file.

run.sh Arguments

./run.sh [repo path (in volume)] [output path (in volume)] [config path (in volume)] [RAM alloc] [threads] [mode]

repo path is the path to the repository under analysis. It must be located in the volume.

output path is the path where the analysis reports are saved. It must be located in the volume.

config path is the path to the configuration .yaml file used for analysis. It must be located in the volume.

RAM the RAM size to allocate.

threads the number of threads for parallel analysis.

mode the analysis mode. Use "git" for default git repository analysis. If you already have a custom distance matrix, you can use mode="matrix". Right now, only git mode is supported!

Development Setup

To execute the driftool from sources on you local file system, execute the following command in the directory where the build.gradle file is located. This assumes you have a current version of gradle (8+) and Java (JDK 17+) installed on your system. In addtion, you need a python3 installation available as python in the path. The packages pip and scikit-learn must be globally available in the execution context of the driftool.

gradle run --args="..."

Please take the argument order and descriptions from the Main.kt file located in the driftool sources or just run the command without arguments and look at the error messages. A good first step is the execution of the unit testsuite via gradle test or gradle build.

A somewhat easier way is to use the already prepared testrun.sh scripts within the driftool_kt folder. There are a variety of options available. Before exectuing the testruns, make sure to execute the setup.sh located in the same place.

For meaningful repository tests, we provide an example repository with documented merge conflicts in https://github.com/convidev-tud/conflict-example

Environment Settings

Number of Threads

You can configure the number of threads used to parallelize the analysis. The analysis gets faster the more threads are used, particularly for large repositories. If the configured number of threads is larger as your system can physically provide, the analysis slows down. Take care of an appropriate RAM size.

RAM Size

You can specify the amount of RAM to be used by the container as the second argument of the run.sh. Note that this is not the actual RAM size. The docker container will at least use the amount of RAM it actually needs to run, even if the paramter is set to 0 (which is not recommended because it will lead to a runtime crash).

The configured additional RAM size must at least be the configured as: the number of threads times two multiplied with the repository size in GB. If your host system has less RAM, the number of threads must be reduced.

Alternatively, you can disable the additional RAM usage (which slows down the analysis at least by half). To do so, open the deb_run.sh and remove the lines marked by comments from the file and execute the build script again. You still need to provide the second argument while executing run.sh but it has no impact. However, if the container runs in the "no ramdisk" mode, the required space will be allocated on the default hard drive.

Configuration

Starting the driftool requires a repository-specific config file. All arguments are mandatory.

Example

Assume this directory structure on your system:

  • ``/home/
    • driftool/
    • your-repo/
    • your-repo.yaml
  1. Move copy the repository an config to the volume:
  • ``/home/
    • driftool/volume/
      • your-repo/
      • your-repo.yaml
    • ...
  1. Execute the run.sh in the /home/driftool directory (cd into it).

Assume 12 threads and 64GB of free RAM:

./run.sh "your-repo" "./" "your-repo.yaml" 65 12 "git"

  1. After successful execution, you find the reports in the specified directory
  • ``/home/
    • driftool/volume/
      • your-repo/
      • your-repo.yaml
      • report_your-repo.html
      • report_your-repo.json
    • ...

"git" Mode Configuration .yaml

  • jsonReport: Boolean If true, a JSON report will be generated and saved in the report directory.
  • htmlReport: Boolean If true, an HTML report will be generated and saved in the report directory.
  • ignoreBranches: List<String> List of branches that should be ignored. This is useful for branches that are not relevant for the analysis. The branch list can contain Regex patterns for which are searched in the branch name. Important: We use regex search (not regex match) to find the pattern anywhere in the branch name, e.g. the pattern "feature" will match "feature/branch" and "branch/feature". If the list is empty, no branches will be ignored.
  • fileWhiteList: List<String> List of files that should be analyzed exclusively. This is useful if only particular file types should be included in the analysis. The file whitelist is applied before the file blacklist. The file list can contain Regex patterns for which are searched in the file path. Important: We use regex search (not regex match) to find the pattern anywhere in the file path, e.g. the pattern "test" will match "src/test/file" and "file/test". If the list is empty, no files will be ignored.
  • fileBlackList: List<String>List of files that should be ignored. This is useful if particular file types should be excluded from the analysis. The file blacklist is applied after the file whitelist. The file list can contain Regex patterns for which are searched in the file path. Important: We use regex search (not regex match) to find the pattern anywhere in the file path, e.g. the pattern "test" will match "src/test/file" and "file/test". If the list is empty, no files will be ignored.
  • timeoutDays: Int The number of days a branch had to be active within to be included in the analysis. For example, if the timeoutDays is set to 30, only branches that were active in the last 30 days will be included. If the timeoutDays is set to 0, all branches will be included. This is useful to exclude dead branches as they might invalidate the analysis.
  • reportIdentifier: String The identifier (or title) for the report. If unset, a unique default identifier will be generated.
jsonReport: BOOL
htmlReport: BOOL
reportIdentifier: STRING
timeoutDays: INT
fileWhiteList:
    - STRING
fileBlackList: 
    - STRING
ignoreBranches:
    - STRING

Config Examples

jsonReport: true
htmlReport: true
reportIdentifier: Example Report
timeoutDays: 30
fileWhiteList: []
fileBlackList: 
    - "build\\/"
    - "dist\\/"
    - "gen\\/"
    - "\\.min\\.js"
    - "\\.lib\\.js"
    - "node\\-modules\\/"
    - "\\.pdf"
    - "javadoc\\/"
    - "\\.png"
ignoreBranches:
    - "^release\\-"
    - "^v\\."

**Whitelist Example**

The following partial example only analyses Java files and HTML templates adn ignores all other files.

```YAML
fileWhiteList:
    - "\\.java"
    - "\\.template\\.html"

Releases

No releases published

Packages

No packages published