Skip to content

subset and spaced seed design tool

License

BSD-3-Clause, Unknown licenses found

Licenses found

BSD-3-Clause
LICENSE
Unknown
COPYING
Notifications You must be signed in to change notification settings

bonsai-team/iedera

This branch is 5 commits ahead of laurentnoe/iedera:master.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

278f434 · May 28, 2024
Apr 5, 2024
Jun 22, 2023
Jun 22, 2023
Jan 8, 2017
Dec 17, 2016
May 2, 2024
Dec 17, 2016
Dec 17, 2016
May 2, 2024
Jan 15, 2017
May 7, 2022
May 7, 2022
Jun 22, 2023
May 2, 2024
May 7, 2022
May 13, 2022
May 13, 2022
Jun 22, 2023
Jun 22, 2023
Apr 10, 2018
Jun 29, 2021
Apr 10, 2018
Apr 10, 2018
Apr 10, 2018
May 23, 2022
May 20, 2022
May 20, 2022
Sep 30, 2020
Apr 10, 2018

Repository files navigation

Build Status Build Status Coverage Website

iedera

(more at <http://bioinfo.univ-lille.fr/yass/iedera.php>)

iedera is a tool to select and design spaced seeds, transition constrained spaced seeds, or more generally subset seeds, and vectorized subset seed patterns.

Installation

(more at <http://bioinfo.univ-lille.fr/yass/iedera.php#downloadiedera>)

Binaries for Windows (x64) and OS X (x64) are available at <https://github.com/laurentnoe/iedera/releases>.

Otherwise, you need a C++ compiler and the autotools. On Linux, you can install g++, autoconf, automake. On Mac, you can install xcode, or the command line developer tools (or you can use macports to install g++-mp-5 for example).

Using the command line, type:

git clone https://github.com/laurentnoe/iedera.git
cd iedera
./configure
make

or:

git clone https://github.com/laurentnoe/iedera.git
cd iedera
autoreconf
./configure
automake
make

you can install iedera to a standard /local/bin directory:

sudo make install

or copy the binary directly to your homedir:

cp src/iedera ~/.

Command-line

(more at <http://bioinfo.univ-lille.fr/yass/iedera.php#quick>)

First, use one of these two parameters :

-spaced for spaced seeds
-transitive for transitive spaced seeds

since they are shortcuts for quite long command lines.

Then you can change the weight, span, and number of seeds being designed:

-w <N,N> for the weight range, where N = [1..16] seems reasonable
-s <N,N> for the span range, where N = [1..32] seems reasonable
-n <N> for the number of seeds, where N = [1..32]

as well as the length of the alignment:

-l <N> where N = [1..64] seems reasonable

NOTE : since enumeration of all the combination of multiple seeds may take time, if "-n" is chosen with a value greater than one, please consider the two following:

-r <N> to run the tool on N randomly generated seed patterns
-k to activate the hill-climbing algorithm on previous parameter -r

(more at <http://bioinfo.univ-lille.fr/yass/iedera.php#details>)

Examples

Spaced seeds

A very small example where the seed weight is set to 11, and the span is at most 18 (full enumeration):

iedera -spaced -w 11,11 -s 11,18

will give the classical PatternHunter 1 spaced seed

###-#--#-#--##-###    0.999999761581      0.467122       0.532878
(SEED PATTERN)        (selectivity)       (SENSITIVITY)  (distance to 1,1)

A second example where the number of seeds is now set to 2, the alignment length is set to 50, and 10000 seeds will be tested with the hill-climbing algorithm activated:

iedera -spaced -n 2 -w 11,11 -s 11,22 -l 50 -r 10000 -k

Transition seeds

A very small example for transition seeds (hill climbing):

iedera -transitive -w 11,11 -s 11,22 -r 10000 -k

Lossless seeds

A very small example for lossless seeds (from Burkhard&Karkkainen) : find a lossless seed of weight 12, span at most 19, on alignments of length 25 with 2 mismatches:

iedera -spaced -s 12,19 -w 12,12 -l 25 -L 1,0 -X 2

A second example for lossless seeds (from Kucherov,Noe&Roytberg) on the previous problem, but with two seeds of weight 14, and span between 20 and 21 (to ease the search):

iedera -spaced -l 25 -L 1,0 -X 2 -n 2 -s 20,21 -w 14,14  -r 100..some.zeros..00 -k

IUPAC seeds

IUPAC filtered seeds could challenge minimizer based techniques <https://www.biorxiv.org/content/10.1101/2020.07.24.220616v2>, so we have extended the iedera tool to support such seeds

First getting the alignment probabilities, out of the TAM92 model <https://pubmed.ncbi.nlm.nih.gov/1630306/>, then launching the optimization for a starting shape, and with the given probabilities:

iedera -iupac -s 5,17 -m "RYYNNNNN,RRYNNNNN" -i shuffle  -r 10000 -k -z 100 -f  `./tam92.py -p 20 -k 1 --gc 50`

YNYRNNnnNN,RNYRNnnNNN       0.9999961853027 0.912921        0.087079

Here :

  • N is a mach symbol (equivalent to #)
  • n is a dont care symbol (equivalent to -)
  • R and Y (uppercase) are respectively Purine and Pyrimine Matches (e.g. R is A-A or T-T matches but not A-T or T-A; use downcase symbols to allow all)

Input/Ouput and reoptimization

Sometimes, it may be helpful to rerun several times the same experiment, and keep the best result of all runs. This can be easily done with input/ouput:

-e <filename> for input file (filename can be a non existing file)
-o <filename> for output file (filename may be of same name as input)

so running this command-line multiple times:

iedera -spaced -l 25 -L 1,0 -X 2 -n 2 -w 14,14 -s 20,21 -r 10000 -k -e file_n2_w14_l25_x2_lossless.txt -o file_n2_w14_l25_x2_lossless.txt

will probably find a lossless set of two seeds. Running this command-line multiple times:

iedera -spaced -l 64 -n 2 -w 11,11 -s 11,22 -r 10000 -k -e file_n2_w11_l64_lossy.txt -o file_n2_w11_l64_lossy.txt

will also probably improve the sensitivity result.

Polynomial form

Bernoulli model

When the probability p to generate a match is not fixed (for example p=0.7 was set in all the previous examples), Mak & Benson have proposed to use a polynomial form and select what they called dominant seeds. We have noticed that this dominance applies as well for any other i.i.d criteria as the Hit Integration (Chung & Park), for Lossless seeds, and several discrete models ... (see <http://doi.org/10.1186/s13015-017-0092-1>) so the flag:

-p to activate dominant selection and output polynomial coefficients

is added in the current commited version of iedera (master branch).

Other multivariate models

When the probabilitic model is more complex compared to a simple Bernoulli model on a binary alphabet, it is possible to compute the probability as a multivariate polynomial form. For a given seed provided with the -m parameter, the output will contain this polynomial form set in square brackets. Selection of the best seeds is left as an exercice for the reader. The flag -pF <filename> activates the output of the multivariate polynomial on the given model. The next example gives sensitivity of the seed 1101 on alignments of length 8

iedera -spaced -pF model_bernoulli_simple_x_xp.txt  -m "##-#" -l 8

on the bernoulli model provided by the file model_bernoulli_simple_x_xp.txt

2
   0   1
      0   1
         0   x
      1   1
         0   xp
   1   0
      0   1
         1   x
      1   1
         1   xp

Tools provided with iedera

The iedera binary is located in src/iedera. The scripts plot_spaced_seeds.py and plot_mow_seeds.py are provided to plot :

  • the sensitivity for a 1st hit, on alignments generated with a (parameter-free) bernoulli model,
  • the frequency for a 1st hit, on alignments generated with an increasing frequency of matches, for a set of given seeds.

spaced seeds sensitivity or frequency on 1st hit

References

how to cite this tool:

Kucherov G., Noe L., Roytberg, M., A unifying framework for seed sensitivity and its application to subset seeds, Journal of Bioinformatics and Computational Biology, 4(2):553-569, 2006 <http://doi.org/10.1142/S0219720006001977>

Noe L., Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds, Algorithms for Molecular Biology, 12(1). 2017 <http://doi.org/10.1186/s13015-017-0092-1>

About

subset and spaced seed design tool

Resources

License

BSD-3-Clause, Unknown licenses found

Licenses found

BSD-3-Clause
LICENSE
Unknown
COPYING

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 75.7%
  • Makefile 11.9%
  • Shell 9.4%
  • Python 2.3%
  • M4 0.7%