TRIC uses a graph-based alignment strategy based on non-linear retention time correction to integrate information from all available runs. The input consists of a set of csv files derived from a targeted proteomics experiment generated by OpenSWATH (using either mProphet or pyProphet) or generated by Peakview.
There are two basic running modes available. The first one uses a reference-based alignment where a single run is chosen as a reference and all other runs are aligned to it. This is a useful choice for a small number of runs that are chromatographically similar. The second mode generates a guidance tree based on chromatographic similarity of the input runs and uses this tree to align the targeted proteomics runs (the nodes in the tree are runs and the edges are pairwise alignments). Generally this mode is better for a large number of runs or for chromatographically dissimilar samples.
The first step in the algorithm is to compute the alignment order. If the tree-based alignment is used, then first a set of of high-confidence anchor points is used to estimate the pairwise chromatographic distance between all runs. This distance matrix is then used to compute a guidance tree (minimum spanning tree, MST) where the nodes represent LC-MS/MS runs and the edges represent pairwise alignments. If a reference-based approach is used, then a reference run is selected first (the run with the most features) and a star-shaped tree is created with the reference run in the middle, connected to all other runs.
Then, for each edge in the tree, a pairwise non-linear transformation between the retention time (RT) domains of the two runs at the nodes is computed, using one of several available methods (e.g. local regression, spline fit, k-nearest neighbor).
Using the guidance tree from above (star-shaped or MST-based),
for each measured targeted proteomics assay, traversal of
the global guidance tree starts with a suitable starting point, or seed
identification (a identification below the --target_fdr
or
--fdr_cutoff
cutoff). During traversal each edge of the tree is visited
sequentially and a confident identification is mapped from one node
(run --max_fdr_quality
, it gets added to the
result.
The size of he RT window during confidence transfer is given by
--max_rt_diff
. However, as different parts of the tree may have different
alignment quality, it is possible to use adaptive
retention time windows, derived from the quality of the alignment. This
approach allows different parameters for confidence transfer on different
parts of the tree, increasing robustness and decreasing the influence of
outlier runs (see --mst:Stdev_multiplier
parameter).
TRIC contains a separate, optional requantification step where runs in the guidance tree where no peakgroup passed the confidence filter can be re-visited for re-quantification. In these cases, the software can infer the peak boundaries from the closest neighboring run and quantify the fragment ion signal within those boundaries, see TRIC requantification.
Please see the main README file for installation instructions.
To get an overview over all available options, please use
./analysis/alignment/feature_alignment.py --help
A sample run of the tool may look as follows:
./analysis/alignment/feature_alignment.py
--in file1_input.csv file2_input.csv file3_input.csv
--out aligned.csv
--method best_overall --realign_method diRT --max_rt_diff 90
--target_fdr 0.01 --max_fdr_quality 0.05
This command will run alignment on 3 files using the (initial) linear iRT alignment and pick an appropriate peakgroup in each run within the aligned window using a reference-based alignment. In order to be reported, each peptide is required to have at least an identification in at least one run below the 1 % q-value cutoff and each quantitative cell in the resulting data matrix is required to have a q-value below 5 %. The maximal RT deviation between the aligned runs is 90 seconds in the above example (you may choose a smaller value if you select one of the nonlinear alignment methods).
The individual parameters can be adjusted as follows:
--method
refers to using either a reference-based alignment or a tree-based alignment (see below).--realign_method
Refers to the (non)-linear alignment strategy employed (see below).--max_rt_diff
refers to the maximal shift in RT after alignment that is tolerated. If a peakgroup is shifted more than this amount, it is excluded from the result (except if its FDR is below the set FDR threshold and a non-global
strategy was selected in the reference-based approach). Note that this a difference, thus the RT window for alignment is twice the size of this parameter (e.g. the window considered is expectedRT +/-max_rt_diff
.--target_fdr
refers to the desired FDR on assay level.--max_fdr_quality
refers to the maximal FDR value a value in the data matrix may have to still be considered for quantitation.--file_format
Which input file format is used (openswath (default), mprophet or peakview). openswath is used for a file generated by the OpenSwath workflow (OpenSwath + mProphet / pyProphet) while mprophet is used for traditional SRM files generated by the mQuest + mProphet workflow. peakview is for PeakView files.
Several options are available for (non)-linear pairwise alignment. Generally, the alignment is performed by using a set of highly confident "anchor points" that are present in both runs and then compute a transformation function from the RT-space of one run into the RT-space of the other run.
The method for pairwise alignment can be selected using --realign_method
.
The recommended method is lowess
(or the faster lowess_cython
) or SmoothLLDMedian
.
The very simple or linear alignment methods are:
diRT
uses the difference to the expected elution time of the assay computed by OpenSWATHlinear
performs a linear alignment using the anchor points
The more complex, non-linear alignment methods are:
lowess
use Robust locally weighted regression for alignment (lowess smoother)splinePy
use Python native spline from scikits.datasmooth (slow!)nonCVSpline
compute a spline for alignment (no cross-validation)CVSpline
compute a spline for alignment (using cross-validation)CVSpline
compute a spline for alignment (using cross-validation)WeightedNearestNeighbour
weighted interpolation using local linear differences of the k nearest neighborsSmoothLLDMedian
local median interpolation using local linear differences of the k nearest neighbors
Several alignment methods require additional packages to be installed:
splineR
perform alignment using thesmooth.spline
function in R (needs therpy2
package)splineR_external
perform alignment using thesmooth.spline
function in R (starts an R process using the command line)Earth
use Multivariate Adaptive Regression Splines (needs thepy-earth
package)lowess_cython
uses a faster lowess implementation (see the main README file, "Fast lowess" for install instructions)
The reference-based alignment selects the run with the most features (identified peakgroups) as the reference. Then all other runs are aligned against the reference in a pairwise fashion.
This mode can be enabled by choosing --method
to be one of the following:
- best_overall
- best_cluster_score
- global_best_cluster_score
- global_best_overall
The recommended method is global_best_overall
. Note that the two global
options will align all peakgroups according to retention time whereas the
other two methods will keep peakgroups below the FDR cutoff in all cases. This
means that when using the global
option, peakgroups below the FDR cutoff may
be removed if they are not at the expected position in retention time (this is
useful to remove spurious identifications but may lead to low identification
numbers if the parameters are too strict).
The reference-based approach will try to automatically estimate a sensible
value if you set --max_rt_diff
to auto_3medianstdev
.
Alternatively, a tree-based alignment is available where the input runs are arranged in a guidance tree (the nodes in the tree are runs and the edges are pairwise alignments). This approach is reference-free and means that each alignment step is purely local and each run is only aligned against runs that are chromatographically close. Generally this mode is better for a large number of runs or for chromatographically dissimilar samples.
This mode can be enabled by choosing --method
to be one of the following:
- LocalMST
- LocalMSTAllCluster
The best choice here is to use LocalMST
which reports the best result for
each assay. If you want to have a full output where multiple results (multiple
clusters) per peakgroup may be reported, use LocalMSTAllCluster
.
The tree-based alignment has several options specific to it:
--mst:useRTCorrection
Use aligned peakgroup RT to continue threading in MST algorithm. It is highly recommend to set this to "True"--mst:Stdev_multiplier
Turn on adaptive RT tolerances: How many standard deviations the peakgroup can deviate in RT during the alignment (if less than max_rt_diff, then max_rt_diff is used). It is recommended to set this to a value between 2.0 and 4.0.--mst:useLocalStdev
Use standard deviation of local region of the chromatogram. This is experimental and may not work.
Adaptive RT tolerance can be very useful if not all alignments have the same quality. This allows the user to set an overall strict tolerance for the alignment while a few, particularly bad pairwise alignments are allowed to have a larger tolerance. These "bad" pairwise alignments may potentially be the edges in the tree that connect two sub-trees which may represent two batches for example.
Thus, a sample command for a tree-based alignment may look like this
./analysis/alignment/feature_alignment.py
--in file1_input.csv file2_input.csv file3_input.csv
--out aligned.csv
--method LocalMST --realign_method lowess_cython --max_rt_diff 60
--mst:useRTCorrection True --mst:Stdev_multiplier 3.0
--target_fdr 0.01 --max_fdr_quality 0.05
--disable_isotopic_grouping
Disable grouping of isotopic variants bypeptide_group_label
, thus disabling matching of isotopic variants of the same peptide across channels. If turned off, each isotopic channel will be matched independently of the other. If enabled, the more certain identification will be used to infer the location of the peak in the other channel.--use_dscore_filter
Enable the filter by d score (this is mainly for speedup)--dscore_cutoff
Quality cutoff to still consider a feature for alignment using the d_score: everything below this d-score is discarded (this is mainly for speedup)--nr_high_conf_exp
Number of experiments in which the peptide needs to be identified with high confidence (e.g. abovefdr_curoff
)--readmethod
Read full or minimal transition groups (minimal,full)--tmpdir
Temporary directory location--alignment_score
Minimal score needed for a feature to be considered for alignment between runs (e.g. score needed to be considered an "anchor point" for pairwise alignment)--fdr_cutoff
A fixed m-score cutoff which does not take into account the number of runs (usetarget_fdr
instead)
Even after alignment, a complete data matrix will not be achieved. There is a last step in the TRIC-based workflow that allows requantification of signal across the integration border derived from alignment. This is implemented as a second script after TRIC since for this step, the chromatograms generated by OpenSWATH are needed.
./analysis/alignment/requantAlignedValues.py
--do_single_run run_n_chromatograms.mzML
--peakgroups_infile aligned_peakgroups.csv
--out requantified_output.csv
--realign_runs linear
--method singleShortestPath
Note that the --do_single_run
input file is a chromatogram mzML file
generated by OpenSWATH. If you have n files, you should run the above command n
times for each mzML file and then concatenate the resulting output files.
The individual parameters can be adjusted as follows:
--peakgroups_infile
Infile containing peakgroups (outfile from feature_alignment.py--file_format
Which input file format is used for--peakgroups_infile
(openswath (default), mprophet or peakview). openswath is used for a file generated by the OpenSwath workflow (OpenSwath + mProphet / pyProphet) while mprophet is used for traditional SRM files generated by the mQuest + mProphet workflow. peakview is for PeakView files.--method
Which method to use (singleShortestPath or singleClosestRun are recommended)--realign_runs
Same asrealign_method
above, see (Non)-linear pairwise alignment options
There are multiple alignment approaches available, which can be controlled with --method
:
singleShortestPath
(tree-based alignment): The integration border are taken from the run that is closest to the current run in the guidance treesingleClosestRun
(tree-based alignment): The integration borders are transferred from the single closest run, disregarding the guidance treereference
(reference-based alignment): The integration borders are aggregated across all runs (see advanced parameters)
If singleShortestPath
or singleClosestRun
is given, a tree based alignment
is chosen while using reference
, a reference-based alignment is chosen.
Note that using a tree-based alignment. The reference
based approach is
currently not recommended.
--border_option
(only in effect when--method
isreference
): How to determine integration border for the aggregate alignment (possible values: max_width, mean, median). All integration borders will be computed across all runs and then an aggregate is computed, using either the maximal width, the mean or the median. Max width will use the maximal possible width (most conservative since it will overestimate the background signal).--cache_in_memory
Cache data from a single run in memory--disable_isotopic_grouping
Disable grouping of isotopic variants by peptide_group_label, thus disabling matching of isotopic variants of the same peptide across channels. If turned off, each isotopic channel will be matched independently of the other. If enabled, the more certain identification will be used to infer the location of the peak in the other channel.)--disable_isotopic_transfer
Disable the transfer of isotopic boundaries in all cases. If enabled (default), the best (best score) isotopic channel dictates the peak boundaries and all other channels use those boundaries. This ensures consistency in peak picking in all cases.)