A command-line tool for scanning DNA sequences and predicting transcription factor binding sites.
- 🧬 Batch processing of sequence files
- 📊 PWM/EWM-based binding site analysis
- 🔍 Configurable occupancy threshold filtering
- 📈 Multiple output formats (CSV, Parquet)
- ⚡ Parallel processing for large datasets
cargo install motif-scanner
git clone https://github.com/peter6866/tf-binding-rs
cd tf-binding-rs
cargo install --path motif-scanner
Basic usage:
motif-scanner input.csv motifs.meme output.csv
With options:
motif-scanner input.csv motifs.meme output.parquet --cutoff 0.3 --mu 12
DATA_FILE
: Input CSV file containing sequences (must have a 'sequence' column)PWM_FILE
: MEME format file containing Position Weight MatricesOUTPUT_FILE
: Path for output file (.csv or .parquet format)--cutoff
: Minimum occupancy threshold (default: 0.2)--mu
: Chemical potential parameter (default: 9)
The input CSV file must contain a column named 'sequence' with DNA sequences:
id,sequence
seq1,ATCGATCGTGCTAGCTA
seq2,GCTAGCTAGCTAGCTAG
The tool generates a table with the following columns:
label
: Sequence index from input fileposition
: Position of the binding sitemotif
: Name of the transcription factorstrand
: Binding strand (F/R)length
: Length of the motifoccupancy
: Predicted occupancy score
# Scan sequences with default parameters
motif-scanner sequences.csv pwm.meme results.csv
# Use stricter threshold and higher chemical potential
motif-scanner sequences.csv pwm.meme results.parquet --cutoff 0.4 --mu 15
# Process and save as Parquet format
motif-scanner data.csv motifs.meme output.parquet
The tool uses parallel processing for efficient scanning of large sequence datasets. Memory usage scales with the number of input sequences and motifs being scanned.