fiberseq · mrvollger · Mar 21, 2024 · Mar 21, 2024 · Mar 21, 2024 · Mar 21, 2024
diff --git a/.github/ISSUE_TEMPLATE/enhancement.md b/.github/ISSUE_TEMPLATE/enhancement.md
@@ -0,0 +1,12 @@
+---
+name: Enhancement request
+about: "Use this template to file enhancement requests for fibertools-rs. "
+title: ""
+labels: enhancement
+assignees: ""
+---
+
+### Thanks for using `fibertools-rs`! To help with enhancement requests I need all the following items:
+
+- A detailed description of the requested change to current behavior or development of a new command.
+- A short description of your scientific use case for the enhancement.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -3,7 +3,7 @@ Please feel free to open PRs! But first make sure you code passes tests, and ple
 ```bash
 cargo test --all-features
 ```
-Also format your code and check it with clingy:
+Also format your code and check it with clippy before submitting a PR:
 ```bash
 cargo fmt 
 cargo clippy --workspace

diff --git a/Cargo.toml b/Cargo.toml
@@ -66,7 +66,7 @@ burn = ["dep:burn"]
 # generated by 'cargo dist init'
 [profile.dist]
 inherits = "release"
-split-debuginfo = "packed"
+#split-debuginfo = "packed"
 
 # generated by 'cargo wizard'
 [profile.dev]
@@ -82,6 +82,7 @@ codegen-units = 256
 incremental = true
 
 [profile.release]
+strip="debuginfo"
 codegen-units = 1
 debug = false
 lto = true

diff --git a/README.md b/README.md
@@ -23,7 +23,7 @@
 mamba install -c conda-forge -c bioconda fibertools-rs
 ```
 
-However, due to size constraints in `bioconda` this version does not support m6a prediction or GPU acceleration. If you would like to use m6A prediction and GPU acceleration, you will need to install using the directions in the [INSTALL.md](/INSTALL.md) file.
+However, due to size constraints in `bioconda` this version does not support contain the pytorch libraries or GPU acceleration for m6A predictions. m6A predictions will still work in the bioconda version but may be much slower. If you would like to use m6A prediction and GPU acceleration, you will need to install using the directions in the [INSTALL.md](/INSTALL.md) file.
 
 # Usage
 
@@ -33,25 +33,26 @@ ft --help
 
 [Help page for fibertools](/docs/ft--help.md)
 
-# Subcommands for `fibertools-rs`
+# Highlighted subcommands for `fibertools-rs`
 
-## `ft predict-m6a`
+### `ft predict-m6a`
 
 Predict m6A positions using HiFi kinetics data and encode the results in the MM and ML bam tags. [Help page for predict-m6a](/docs/ft-predict-m6a-help.md).
 
-## `ft add-nucleosomes`
+### `ft add-nucleosomes`
 
 Add nucleosomes to a bam that file already contains m6a predictions. Note, this process is also run in the background during `predict-m6a`, so it is unnecessary to run independently unless you want to try new parameters for nucleosome calling. [Help page for add-nucleosomes](/docs/ft-add-nucleosomes-help.md).
 
-## `ft extract`
+### `ft extract`
 
 Extracts Fiber-seq data from a bam file into plain text. [Help page for extract](/docs/extract.md).
-![Extract](/assets/img/ft-extract-all.png)
 
-## `ft center`
+### `ft center`
 
 Center Fiber-seq reads (bam) around reference position(s). [Help page for center](/docs/center.md).
-![Center](/assets/img/center.png)
+
+### `ft footprint`
+Footprint Fiber-seq reads (bam) around reference motifs(s). [Help page for footprint](/docs/footprint.md).
 
 # Python API (`pyft`)
 
@@ -61,19 +62,5 @@ The python API is still in development and not stable; however, you can find the
 
 **Jha, A.**, **Bohaczuk, S. C.**, Mao, Y., Ranchalis, J., Mallory, B. J., Min, A. T., Hamm, M. O., Swanson, E., Finkbeiner, C., Li, T., Whittington, D., Stergachis, A. B., & **Vollger, M. R.** (2023). DNA-m6A calling and integrated long-read epigenetic and genetic analysis with **fibertools**. _bioRxiv_. https://doi.org/10.1101/2023.04.20.537673
 
-# TODO items
-
-- [ ] Use new iterator for `ft extract` and group writes to try and improve the speed
-- [ ] long format extract command
-- [ ] Improve progress bar for predict-m6a.
-  - [ ] Get size of bam, say how far we are through the bam in terms of MB/GB?
-- [x] Add a python API (see py-ft for progress)
-  - [ ] add default data viz
-  - [ ] add conversion to pandas data frame or maybe anndata
-- [x] GPU support
-  - [ ] see if I can simplify or statically link PyTorch to get it onto bioconda
-  - [ ] Detect GPU memory to set batch size dynamically.
-- [ ] Add unaligned, secondary, supplemental reads to the test bam.
-
 # Contributing
 If you would like to contribute to `fibertools-rs`, please see the [CONTRIBUTING.md](/CONTRIBUTING.md) file for more information.
diff --git a/build.rs b/build.rs
@@ -15,6 +15,7 @@ fn main() {
 
     // Generate the model code and state file from the ONNX file.
     use burn_import::onnx::ModelGen;
+    use burn_import::onnx::RecordType;
     for x in &[
         "src/m6a_burn/two_zero.onnx",
         "src/m6a_burn/two_two.onnx",
@@ -24,7 +25,8 @@ fn main() {
         ModelGen::new()
             .input(x) // Path to the ONNX model
             .out_dir("m6a_burn/") // Directory for the generated Rust source file (under target/)
-            //.embed_states(true)
+            .record_type(RecordType::Bincode)
+            .embed_states(true)
             .run_from_script();
     }
 }
diff --git a/docs/ft--help.md b/docs/ft--help.md
@@ -4,18 +4,12 @@ Fiber-seq toolkit in rust
 Usage: ft [OPTIONS] <COMMAND>
 
 Commands:
-  predict-m6a       Predict m6A positions using HiFi kinetics data and encode the results in the MM and ML bam tags. Also
-                        adds nucleosome (nl, ns) and MTase sensitive patches (al, as) [aliases: m6A, m6a]
+  predict-m6a       Predict m6A positions using HiFi kinetics data and encode the results in the MM and ML bam tags. Also adds nucleosome (nl, ns) and MTase sensitive patches (al, as) [aliases: m6A, m6a]
   add-nucleosomes   Add nucleosomes to a bam file with m6a predictions
   fire              Add FIREs (Fiber-seq Inferred Regulatory Elements) to a bam file with m6a predictions
-  footprint         Add footprints to a bam file with m6a predictions
-  extract           Extract fiberseq data into plain text files. See
-                        https://fiberseq.github.io/fibertools-rs/docs/extract.html for a description of the outputs [aliases:
-                        ex, e]
-  center            This command centers fiberseq data around given reference positions. This is useful for making
-                        aggregate m6A and CpG observations, as well as visualization of SVs. See
-                        https://fiberseq.github.io/fibertools-rs/docs/center.html for a description of the output [aliases:
-                        c, ct]
+  extract           Extract fiberseq data into plain text files [aliases: ex, e]
+  center            This command centers fiberseq data around given reference positions. This is useful for making aggregate m6A and CpG observations, as well as visualization of SVs [aliases: c, ct]
+  footprint         Infer footprints from fiberseq data
   track-decorators  Make decorated bed files for fiberseq data
   clear-kinetics    Remove HiFi kinetics tags from the input bam file
   strip-basemods    Strip out select base modifications

diff --git a/docs/ft-add-nucleosomes-help.md b/docs/ft-add-nucleosomes-help.md
@@ -8,20 +8,13 @@ Arguments:
   [OUT]  Output bam file with nucleosome calls [default: -]
 
 Options:
-  -n, --nucleosome-length <NUCLEOSOME_LENGTH>
-          Minium nucleosome length [default: 75]
-  -c, --combined-nucleosome-length <COMBINED_NUCLEOSOME_LENGTH>
-          Minium nucleosome length when combining over a single m6A [default: 100]
-  -m, --min-distance-added <MIN_DISTANCE_ADDED>
-          Minium distance needed to add to an already existing nuc by crossing an m6a [default: 25]
-  -d, --distance-from-end <DISTANCE_FROM_END>
-          Minimum distance from the end of a fiber to call a nucleosome or MSP [default: 45]
-      --min-ml-score <MIN_ML_SCORE>
-          Minium score in the ML tag to use in predicting nucleosomes [default: 125]
-  -h, --help
-          Print help
-  -V, --version
-          Print version
+  -n, --nucleosome-length <NUCLEOSOME_LENGTH>                    Minium nucleosome length [default: 75]
+  -c, --combined-nucleosome-length <COMBINED_NUCLEOSOME_LENGTH>  Minium nucleosome length when combining over a single m6A [default: 100]
+  -m, --min-distance-added <MIN_DISTANCE_ADDED>                  Minium distance needed to add to an already existing nuc by crossing an m6a [default: 25]
+  -d, --distance-from-end <DISTANCE_FROM_END>                    Minimum distance from the end of a fiber to call a nucleosome or MSP [default: 45]
+      --min-ml-score <MIN_ML_SCORE>                              Minium score in the ML tag to use in predicting nucleosomes [default: 125]
+  -h, --help                                                     Print help
+  -V, --version                                                  Print version
 
 Global-Options:
   -t, --threads <THREADS>  Threads [default: 8]

diff --git a/docs/ft-center-help.md b/docs/ft-center-help.md
@@ -1,31 +1,53 @@
 ```
-This command centers fiberseq data around given reference positions. This is useful for making aggregate m6A and CpG
-observations, as well as visualization of SVs. See https://fiberseq.github.io/fibertools-rs/docs/center.html for a
-description of the output
+This command centers fiberseq data around given reference positions. This is useful for making aggregate m6A and CpG observations, as well as visualization of SVs.
+
+See https://fiberseq.github.io/fibertools-rs/docs/center.html for a description of the output.
 
 Usage: ft center [OPTIONS] <BAM> <BED>
 
 Arguments:
-  <BAM>  Aligned Fiber-seq bam file
-  <BED>  Bed file on which to center fiberseq reads. Data is adjusted to the start position of the bed file and corrected for
-         strand if the strand is indicated in the 6th column of the bed file. The 4th column will also be checked for the
-         strand but only after the 6th is. If you include strand information in the 4th (or 6th) column it will orient data
-         accordingly and use the end position of bed record instead of the start if on the minus strand. This means that
-         profiles of motifs in both the forward and minus orientation will align to the same central position
+  <BAM>
+          Aligned Fiber-seq bam file
+
+  <BED>
+          Bed file on which to center fiberseq reads. Data is adjusted to the start position of the bed file and corrected for strand if the strand is indicated in the 6th column of the bed file. The 4th column will also be
+          checked for the strand but only after the 6th is. If you include strand information in the 4th (or 6th) column it will orient data accordingly and use the end position of bed record instead of the start if on the
+          minus strand. This means that profiles of motifs in both the forward and minus orientation will align to the same central position
 
 Options:
-  -m, --min-ml-score <MIN_ML_SCORE>  Minium score in the ML tag to include in the output [default: 125]
-  -d, --dist <DIST>                  Set a maximum distance from the start of the motif to keep a feature
-  -w, --wide                         Provide data in wide format, one row per read
-  -r, --reference                    Return relative reference position instead of relative molecular position
-  -s, --simplify                     Replace the sequence output column with just "N"
-  -h, --help                         Print help
-  -V, --version                      Print version
+  -m, --min-ml-score <MIN_ML_SCORE>
+          Minium score in the ML tag to include in the output
+
+          [default: 125]
+
+  -d, --dist <DIST>
+          Set a maximum distance from the start of the motif to keep a feature
+
+  -w, --wide
+          Provide data in wide format, one row per read
+
+  -r, --reference
+          Return relative reference position instead of relative molecular position
+
+  -s, --simplify
+          Replace the sequence output column with just "N"
+
+  -h, --help
+          Print help (see a summary with '-h')
+
+  -V, --version
+          Print version
 
 Global-Options:
-  -t, --threads <THREADS>  Threads [default: 8]
+  -t, --threads <THREADS>
+          Threads
+
+          [default: 8]
 
 Debug-Options:
-  -v, --verbose...  Logging level [-v: Info, -vv: Debug, -vvv: Trace]
-      --quiet       Turn off all logging
+  -v, --verbose...
+          Logging level [-v: Info, -vv: Debug, -vvv: Trace]
+
+      --quiet
+          Turn off all logging
 ```
diff --git a/docs/ft-extract-help.md b/docs/ft-extract-help.md
@@ -1,32 +1,66 @@
 ```
-Extract fiberseq data into plain text files. See https://fiberseq.github.io/fibertools-rs/docs/extract.html for a description
-of the outputs
+Extract fiberseq data into plain text files.
+
+See https://fiberseq.github.io/fibertools-rs/docs/extract.html for a description of the outputs.
 
 Usage: ft extract [OPTIONS] [BAM]
 
 Arguments:
-  [BAM]  Input fiberseq bam file. If no path is provided extract will read bam data from stdin [default: -]
+  [BAM]
+          Input fiberseq bam file. If no path is provided extract will read bam data from stdin
+
+          [default: -]
 
 Options:
-  -r, --reference                    Report in reference sequence coordinates
-      --molecular                    Report positions in the molecular sequence coordinates
-  -m, --min-ml-score <MIN_ML_SCORE>  Minium score in the ML tag to include in the output [default: 125]
-      --m6a <M6A>                    Output path for m6a bed12
-  -c, --cpg <CPG>                    Output path for 5mC (CpG, primrose) bed12
-      --msp <MSP>                    Output path for methylation sensitive patch (msp) bed12
-  -n, --nuc <NUC>                    Output path for nucleosome bed12
-  -a, --all <ALL>                    Output path for a tabular format including "all" fiberseq information in the bam
-  -h, --help                         Print help
-  -V, --version                      Print version
+  -r, --reference
+          Report in reference sequence coordinates
+
+      --molecular
+          Report positions in the molecular sequence coordinates
+
+  -m, --min-ml-score <MIN_ML_SCORE>
+          Minium score in the ML tag to include in the output
+
+          [default: 125]
+
+      --m6a <M6A>
+          Output path for m6a bed12
+
+  -c, --cpg <CPG>
+          Output path for 5mC (CpG, primrose) bed12
+
+      --msp <MSP>
+          Output path for methylation sensitive patch (msp) bed12
+
+  -n, --nuc <NUC>
+          Output path for nucleosome bed12
+
+  -a, --all <ALL>
+          Output path for a tabular format including "all" fiberseq information in the bam
+
+  -h, --help
+          Print help (see a summary with '-h')
+
+  -V, --version
+          Print version
 
 All-Format-Options:
-  -q, --quality   Include per base quality scores in "fiber_qual"
-  -s, --simplify  Simplify output by remove fiber sequence
+  -q, --quality
+          Include per base quality scores in "fiber_qual"
+
+  -s, --simplify
+          Simplify output by remove fiber sequence
 
 Global-Options:
-  -t, --threads <THREADS>  Threads [default: 8]
+  -t, --threads <THREADS>
+          Threads
+
+          [default: 8]
 
 Debug-Options:
-  -v, --verbose...  Logging level [-v: Info, -vv: Debug, -vvv: Trace]
-      --quiet       Turn off all logging
+  -v, --verbose...
+          Logging level [-v: Info, -vv: Debug, -vvv: Trace]
+
+      --quiet
+          Turn off all logging
 ```
diff --git a/docs/ft-fire-help.md b/docs/ft-fire-help.md
@@ -8,35 +8,20 @@ Arguments:
   [OUT]  Output file (bam by default, table if --feats_to_text is used, and bed9 + if --extract is used) [default: -]
 
 Options:
-  -e, --extract
-          Output just FIRE elements in bed9 format
-  -s, --skip-no-m6a
-          Don't write reads with no m6A calls to the output bam
-      --min-msp <MIN_MSP>
-          Skip reads without at least `N` MSP calls [default: 0]
-      --min-ave-msp-size <MIN_AVE_MSP_SIZE>
-          Skip reads without an average MSP size greater than `N` [default: 0]
-  -w, --width-bin <WIDTH_BIN>
-          Width of bin for feature collection [default: 40]
-  -b, --bin-num <BIN_NUM>
-          Number of bins to collect [default: 9]
-      --best-window-size <BEST_WINDOW_SIZE>
-          Calculate stats for the highest X bp window within each MSP Should be a fair amount higher than the expected linker
-          length [default: 100]
-  -u, --use-5mc
-          Use 5mC data in FIREs
-  -m, --min-msp-length-for-positive-fire-call <MIN_MSP_LENGTH_FOR_POSITIVE_FIRE_CALL>
-          Minium length of msp to call a FIRE [default: 85]
-      --model <MODEL>
-          optional path to a model json file
-      --fdr-table <FDR_TABLE>
-          Optional path to a FDR table
-  -f, --feats-to-text
-          Output FIREs features for training in a table format
-  -h, --help
-          Print help
-  -V, --version
-          Print version
+  -e, --extract                                                                        Output just FIRE elements in bed9 format
+  -s, --skip-no-m6a                                                                    Don't write reads with no m6A calls to the output bam
+      --min-msp <MIN_MSP>                                                              Skip reads without at least `N` MSP calls [default: 0]
+      --min-ave-msp-size <MIN_AVE_MSP_SIZE>                                            Skip reads without an average MSP size greater than `N` [default: 0]
+  -w, --width-bin <WIDTH_BIN>                                                          Width of bin for feature collection [default: 40]
+  -b, --bin-num <BIN_NUM>                                                              Number of bins to collect [default: 9]
+      --best-window-size <BEST_WINDOW_SIZE>                                            Calculate stats for the highest X bp window within each MSP Should be a fair amount higher than the expected linker length [default: 100]
+  -u, --use-5mc                                                                        Use 5mC data in FIREs
+  -m, --min-msp-length-for-positive-fire-call <MIN_MSP_LENGTH_FOR_POSITIVE_FIRE_CALL>  Minium length of msp to call a FIRE [default: 85]
+      --model <MODEL>                                                                  optional path to a model json file
+      --fdr-table <FDR_TABLE>                                                          Optional path to a FDR table
+  -f, --feats-to-text                                                                  Output FIREs features for training in a table format
+  -h, --help                                                                           Print help
+  -V, --version                                                                        Print version
 
 Global-Options:
   -t, --threads <THREADS>  Threads [default: 8]