Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.4.2 #47

Merged
merged 13 commits into from
Mar 21, 2024
12 changes: 12 additions & 0 deletions .github/ISSUE_TEMPLATE/enhancement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
name: Enhancement request
about: "Use this template to file enhancement requests for fibertools-rs. "
title: ""
labels: enhancement
assignees: ""
---

### Thanks for using `fibertools-rs`! To help with enhancement requests I need all the following items:

- A detailed description of the requested change to current behavior or development of a new command.
- A short description of your scientific use case for the enhancement.
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ Please feel free to open PRs! But first make sure you code passes tests, and ple
```bash
cargo test --all-features
```
Also format your code and check it with clingy:
Also format your code and check it with clippy before submitting a PR:
```bash
cargo fmt
cargo clippy --workspace
Expand Down
3 changes: 2 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ burn = ["dep:burn"]
# generated by 'cargo dist init'
[profile.dist]
inherits = "release"
split-debuginfo = "packed"
#split-debuginfo = "packed"

# generated by 'cargo wizard'
[profile.dev]
Expand All @@ -82,6 +82,7 @@ codegen-units = 256
incremental = true

[profile.release]
strip="debuginfo"
codegen-units = 1
debug = false
lto = true
Expand Down
31 changes: 9 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
mamba install -c conda-forge -c bioconda fibertools-rs
```

However, due to size constraints in `bioconda` this version does not support m6a prediction or GPU acceleration. If you would like to use m6A prediction and GPU acceleration, you will need to install using the directions in the [INSTALL.md](/INSTALL.md) file.
However, due to size constraints in `bioconda` this version does not support contain the pytorch libraries or GPU acceleration for m6A predictions. m6A predictions will still work in the bioconda version but may be much slower. If you would like to use m6A prediction and GPU acceleration, you will need to install using the directions in the [INSTALL.md](/INSTALL.md) file.

# Usage

Expand All @@ -33,25 +33,26 @@ ft --help

[Help page for fibertools](/docs/ft--help.md)

# Subcommands for `fibertools-rs`
# Highlighted subcommands for `fibertools-rs`

## `ft predict-m6a`
### `ft predict-m6a`

Predict m6A positions using HiFi kinetics data and encode the results in the MM and ML bam tags. [Help page for predict-m6a](/docs/ft-predict-m6a-help.md).

## `ft add-nucleosomes`
### `ft add-nucleosomes`

Add nucleosomes to a bam that file already contains m6a predictions. Note, this process is also run in the background during `predict-m6a`, so it is unnecessary to run independently unless you want to try new parameters for nucleosome calling. [Help page for add-nucleosomes](/docs/ft-add-nucleosomes-help.md).

## `ft extract`
### `ft extract`

Extracts Fiber-seq data from a bam file into plain text. [Help page for extract](/docs/extract.md).
![Extract](/assets/img/ft-extract-all.png)

## `ft center`
### `ft center`

Center Fiber-seq reads (bam) around reference position(s). [Help page for center](/docs/center.md).
![Center](/assets/img/center.png)

### `ft footprint`
Footprint Fiber-seq reads (bam) around reference motifs(s). [Help page for footprint](/docs/footprint.md).

# Python API (`pyft`)

Expand All @@ -61,19 +62,5 @@ The python API is still in development and not stable; however, you can find the

**Jha, A.**, **Bohaczuk, S. C.**, Mao, Y., Ranchalis, J., Mallory, B. J., Min, A. T., Hamm, M. O., Swanson, E., Finkbeiner, C., Li, T., Whittington, D., Stergachis, A. B., & **Vollger, M. R.** (2023). DNA-m6A calling and integrated long-read epigenetic and genetic analysis with **fibertools**. _bioRxiv_. https://doi.org/10.1101/2023.04.20.537673

# TODO items

- [ ] Use new iterator for `ft extract` and group writes to try and improve the speed
- [ ] long format extract command
- [ ] Improve progress bar for predict-m6a.
- [ ] Get size of bam, say how far we are through the bam in terms of MB/GB?
- [x] Add a python API (see py-ft for progress)
- [ ] add default data viz
- [ ] add conversion to pandas data frame or maybe anndata
- [x] GPU support
- [ ] see if I can simplify or statically link PyTorch to get it onto bioconda
- [ ] Detect GPU memory to set batch size dynamically.
- [ ] Add unaligned, secondary, supplemental reads to the test bam.

# Contributing
If you would like to contribute to `fibertools-rs`, please see the [CONTRIBUTING.md](/CONTRIBUTING.md) file for more information.
4 changes: 3 additions & 1 deletion build.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ fn main() {

// Generate the model code and state file from the ONNX file.
use burn_import::onnx::ModelGen;
use burn_import::onnx::RecordType;
for x in &[
"src/m6a_burn/two_zero.onnx",
"src/m6a_burn/two_two.onnx",
Expand All @@ -24,7 +25,8 @@ fn main() {
ModelGen::new()
.input(x) // Path to the ONNX model
.out_dir("m6a_burn/") // Directory for the generated Rust source file (under target/)
//.embed_states(true)
.record_type(RecordType::Bincode)
.embed_states(true)
.run_from_script();
}
}
14 changes: 4 additions & 10 deletions docs/ft--help.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,18 +4,12 @@ Fiber-seq toolkit in rust
Usage: ft [OPTIONS] <COMMAND>

Commands:
predict-m6a Predict m6A positions using HiFi kinetics data and encode the results in the MM and ML bam tags. Also
adds nucleosome (nl, ns) and MTase sensitive patches (al, as) [aliases: m6A, m6a]
predict-m6a Predict m6A positions using HiFi kinetics data and encode the results in the MM and ML bam tags. Also adds nucleosome (nl, ns) and MTase sensitive patches (al, as) [aliases: m6A, m6a]
add-nucleosomes Add nucleosomes to a bam file with m6a predictions
fire Add FIREs (Fiber-seq Inferred Regulatory Elements) to a bam file with m6a predictions
footprint Add footprints to a bam file with m6a predictions
extract Extract fiberseq data into plain text files. See
https://fiberseq.github.io/fibertools-rs/docs/extract.html for a description of the outputs [aliases:
ex, e]
center This command centers fiberseq data around given reference positions. This is useful for making
aggregate m6A and CpG observations, as well as visualization of SVs. See
https://fiberseq.github.io/fibertools-rs/docs/center.html for a description of the output [aliases:
c, ct]
extract Extract fiberseq data into plain text files [aliases: ex, e]
center This command centers fiberseq data around given reference positions. This is useful for making aggregate m6A and CpG observations, as well as visualization of SVs [aliases: c, ct]
footprint Infer footprints from fiberseq data
track-decorators Make decorated bed files for fiberseq data
clear-kinetics Remove HiFi kinetics tags from the input bam file
strip-basemods Strip out select base modifications
Expand Down
21 changes: 7 additions & 14 deletions docs/ft-add-nucleosomes-help.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,20 +8,13 @@ Arguments:
[OUT] Output bam file with nucleosome calls [default: -]

Options:
-n, --nucleosome-length <NUCLEOSOME_LENGTH>
Minium nucleosome length [default: 75]
-c, --combined-nucleosome-length <COMBINED_NUCLEOSOME_LENGTH>
Minium nucleosome length when combining over a single m6A [default: 100]
-m, --min-distance-added <MIN_DISTANCE_ADDED>
Minium distance needed to add to an already existing nuc by crossing an m6a [default: 25]
-d, --distance-from-end <DISTANCE_FROM_END>
Minimum distance from the end of a fiber to call a nucleosome or MSP [default: 45]
--min-ml-score <MIN_ML_SCORE>
Minium score in the ML tag to use in predicting nucleosomes [default: 125]
-h, --help
Print help
-V, --version
Print version
-n, --nucleosome-length <NUCLEOSOME_LENGTH> Minium nucleosome length [default: 75]
-c, --combined-nucleosome-length <COMBINED_NUCLEOSOME_LENGTH> Minium nucleosome length when combining over a single m6A [default: 100]
-m, --min-distance-added <MIN_DISTANCE_ADDED> Minium distance needed to add to an already existing nuc by crossing an m6a [default: 25]
-d, --distance-from-end <DISTANCE_FROM_END> Minimum distance from the end of a fiber to call a nucleosome or MSP [default: 45]
--min-ml-score <MIN_ML_SCORE> Minium score in the ML tag to use in predicting nucleosomes [default: 125]
-h, --help Print help
-V, --version Print version

Global-Options:
-t, --threads <THREADS> Threads [default: 8]
Expand Down
60 changes: 41 additions & 19 deletions docs/ft-center-help.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,53 @@
```
This command centers fiberseq data around given reference positions. This is useful for making aggregate m6A and CpG
observations, as well as visualization of SVs. See https://fiberseq.github.io/fibertools-rs/docs/center.html for a
description of the output
This command centers fiberseq data around given reference positions. This is useful for making aggregate m6A and CpG observations, as well as visualization of SVs.

See https://fiberseq.github.io/fibertools-rs/docs/center.html for a description of the output.

Usage: ft center [OPTIONS] <BAM> <BED>

Arguments:
<BAM> Aligned Fiber-seq bam file
<BED> Bed file on which to center fiberseq reads. Data is adjusted to the start position of the bed file and corrected for
strand if the strand is indicated in the 6th column of the bed file. The 4th column will also be checked for the
strand but only after the 6th is. If you include strand information in the 4th (or 6th) column it will orient data
accordingly and use the end position of bed record instead of the start if on the minus strand. This means that
profiles of motifs in both the forward and minus orientation will align to the same central position
<BAM>
Aligned Fiber-seq bam file

<BED>
Bed file on which to center fiberseq reads. Data is adjusted to the start position of the bed file and corrected for strand if the strand is indicated in the 6th column of the bed file. The 4th column will also be
checked for the strand but only after the 6th is. If you include strand information in the 4th (or 6th) column it will orient data accordingly and use the end position of bed record instead of the start if on the
minus strand. This means that profiles of motifs in both the forward and minus orientation will align to the same central position

Options:
-m, --min-ml-score <MIN_ML_SCORE> Minium score in the ML tag to include in the output [default: 125]
-d, --dist <DIST> Set a maximum distance from the start of the motif to keep a feature
-w, --wide Provide data in wide format, one row per read
-r, --reference Return relative reference position instead of relative molecular position
-s, --simplify Replace the sequence output column with just "N"
-h, --help Print help
-V, --version Print version
-m, --min-ml-score <MIN_ML_SCORE>
Minium score in the ML tag to include in the output

[default: 125]

-d, --dist <DIST>
Set a maximum distance from the start of the motif to keep a feature

-w, --wide
Provide data in wide format, one row per read

-r, --reference
Return relative reference position instead of relative molecular position

-s, --simplify
Replace the sequence output column with just "N"

-h, --help
Print help (see a summary with '-h')

-V, --version
Print version

Global-Options:
-t, --threads <THREADS> Threads [default: 8]
-t, --threads <THREADS>
Threads

[default: 8]

Debug-Options:
-v, --verbose... Logging level [-v: Info, -vv: Debug, -vvv: Trace]
--quiet Turn off all logging
-v, --verbose...
Logging level [-v: Info, -vv: Debug, -vvv: Trace]

--quiet
Turn off all logging
```
70 changes: 52 additions & 18 deletions docs/ft-extract-help.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,66 @@
```
Extract fiberseq data into plain text files. See https://fiberseq.github.io/fibertools-rs/docs/extract.html for a description
of the outputs
Extract fiberseq data into plain text files.

See https://fiberseq.github.io/fibertools-rs/docs/extract.html for a description of the outputs.

Usage: ft extract [OPTIONS] [BAM]

Arguments:
[BAM] Input fiberseq bam file. If no path is provided extract will read bam data from stdin [default: -]
[BAM]
Input fiberseq bam file. If no path is provided extract will read bam data from stdin

[default: -]

Options:
-r, --reference Report in reference sequence coordinates
--molecular Report positions in the molecular sequence coordinates
-m, --min-ml-score <MIN_ML_SCORE> Minium score in the ML tag to include in the output [default: 125]
--m6a <M6A> Output path for m6a bed12
-c, --cpg <CPG> Output path for 5mC (CpG, primrose) bed12
--msp <MSP> Output path for methylation sensitive patch (msp) bed12
-n, --nuc <NUC> Output path for nucleosome bed12
-a, --all <ALL> Output path for a tabular format including "all" fiberseq information in the bam
-h, --help Print help
-V, --version Print version
-r, --reference
Report in reference sequence coordinates

--molecular
Report positions in the molecular sequence coordinates

-m, --min-ml-score <MIN_ML_SCORE>
Minium score in the ML tag to include in the output

[default: 125]

--m6a <M6A>
Output path for m6a bed12

-c, --cpg <CPG>
Output path for 5mC (CpG, primrose) bed12

--msp <MSP>
Output path for methylation sensitive patch (msp) bed12

-n, --nuc <NUC>
Output path for nucleosome bed12

-a, --all <ALL>
Output path for a tabular format including "all" fiberseq information in the bam

-h, --help
Print help (see a summary with '-h')

-V, --version
Print version

All-Format-Options:
-q, --quality Include per base quality scores in "fiber_qual"
-s, --simplify Simplify output by remove fiber sequence
-q, --quality
Include per base quality scores in "fiber_qual"

-s, --simplify
Simplify output by remove fiber sequence

Global-Options:
-t, --threads <THREADS> Threads [default: 8]
-t, --threads <THREADS>
Threads

[default: 8]

Debug-Options:
-v, --verbose... Logging level [-v: Info, -vv: Debug, -vvv: Trace]
--quiet Turn off all logging
-v, --verbose...
Logging level [-v: Info, -vv: Debug, -vvv: Trace]

--quiet
Turn off all logging
```
43 changes: 14 additions & 29 deletions docs/ft-fire-help.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,35 +8,20 @@ Arguments:
[OUT] Output file (bam by default, table if --feats_to_text is used, and bed9 + if --extract is used) [default: -]

Options:
-e, --extract
Output just FIRE elements in bed9 format
-s, --skip-no-m6a
Don't write reads with no m6A calls to the output bam
--min-msp <MIN_MSP>
Skip reads without at least `N` MSP calls [default: 0]
--min-ave-msp-size <MIN_AVE_MSP_SIZE>
Skip reads without an average MSP size greater than `N` [default: 0]
-w, --width-bin <WIDTH_BIN>
Width of bin for feature collection [default: 40]
-b, --bin-num <BIN_NUM>
Number of bins to collect [default: 9]
--best-window-size <BEST_WINDOW_SIZE>
Calculate stats for the highest X bp window within each MSP Should be a fair amount higher than the expected linker
length [default: 100]
-u, --use-5mc
Use 5mC data in FIREs
-m, --min-msp-length-for-positive-fire-call <MIN_MSP_LENGTH_FOR_POSITIVE_FIRE_CALL>
Minium length of msp to call a FIRE [default: 85]
--model <MODEL>
optional path to a model json file
--fdr-table <FDR_TABLE>
Optional path to a FDR table
-f, --feats-to-text
Output FIREs features for training in a table format
-h, --help
Print help
-V, --version
Print version
-e, --extract Output just FIRE elements in bed9 format
-s, --skip-no-m6a Don't write reads with no m6A calls to the output bam
--min-msp <MIN_MSP> Skip reads without at least `N` MSP calls [default: 0]
--min-ave-msp-size <MIN_AVE_MSP_SIZE> Skip reads without an average MSP size greater than `N` [default: 0]
-w, --width-bin <WIDTH_BIN> Width of bin for feature collection [default: 40]
-b, --bin-num <BIN_NUM> Number of bins to collect [default: 9]
--best-window-size <BEST_WINDOW_SIZE> Calculate stats for the highest X bp window within each MSP Should be a fair amount higher than the expected linker length [default: 100]
-u, --use-5mc Use 5mC data in FIREs
-m, --min-msp-length-for-positive-fire-call <MIN_MSP_LENGTH_FOR_POSITIVE_FIRE_CALL> Minium length of msp to call a FIRE [default: 85]
--model <MODEL> optional path to a model json file
--fdr-table <FDR_TABLE> Optional path to a FDR table
-f, --feats-to-text Output FIREs features for training in a table format
-h, --help Print help
-V, --version Print version

Global-Options:
-t, --threads <THREADS> Threads [default: 8]
Expand Down
Loading
Loading