lbcb-sci · nkkarpov · Apr 12, 2024 · Apr 15, 2024
diff --git a/README.md b/README.md
@@ -82,9 +82,15 @@ Note: Read ids can be obtained with seqkit: ```seqkit seq -ni <reads> > <read_id
 
 3. Error-correction
 ```shell
-herro inference --read-alns <directory_alignment_batches> -t <feat_gen_threads_per_device> -d <gpus> -m <model_path> -b <batch_size> <preprocessed_reads> <fasta_output> 
+herro inference --read-alns <directory_alignment_batches> -t <feat_gen_threads_per_device> -d <gpus> -m <model_path> -b <batch_size> [-c <cluster>] <preprocessed_reads> <fasta_output> 
 ```
-Note: GPUs are specified using their IDs. For example, if the value of the parameter -d is set to 0,1,3, herro will use the first, second, and fourth GPU cards. Parameter ```-t``` is given **per device** - e.g., if ```-t``` is set to ```8``` and 3 GPUs are used, herro will create 24 feature generation theads in total. Recommended batch size is 64 for GPUs with 40 GB (possibly also for 32 GB) of VRAM and 128 for GPUs with 80 GB of VRAM. 
+Note: GPUs are specified using their IDs. For example, if the value of the parameter -d is set to 0,1,3, herro will use the first, second, and fourth GPU cards. Parameter ```-t``` is given **per device** - e.g., if ```-t``` is set to ```8``` and 3 GPUs are used, herro will create 24 feature generation theads in total. Recommended batch size is 64 for GPUs with 40 GB (possibly also for 32 GB) of VRAM and 128 for GPUs with 80 GB of VRAM. With the flag `-c`, the app gets information about the cluster and its neighborhood from the specified file. It then loads only the reads associated with the cluster and its neighborhood into RAM, and outputs corrected reads for the cluster. The file should contain multiple lines formatted as `0\t<ID>` for IDs within the cluster and `1\t<ID>` for IDs in the neighborhood. The partitioning into clusters can be generated using the command:
+
+```shell
+zstdcat *.paf.zst | cut -f1,6 | python scripts/create_clusters.py
+```
+
+This command outputs files with clusters into the folder `clusters`.
 
 
 ## Results on HG002 data

diff --git a/src/lib.rs b/src/lib.rs
@@ -214,7 +214,7 @@ fn parse_reads<P: AsRef<Path>>(reads_path: P, window_size: u32, core: &Option<Fx
         let g = reads_path.as_ref().join("*").to_str().unwrap().to_owned();
         let reads : Vec<_> = glob(&g).unwrap()
             .filter_map(|p| p.ok().and_then(|path| path.to_str().map(|s| s.to_owned())))
-            .filter(|s| s.ends_with(".fastq") || s.ends_with(".fastq.gz"))
+            .filter(|s| s.ends_with(".fastq") || s.ends_with(".fastq.gz") || s.ends_with(".fq") || s.ends_with(".fq.gz"))
             .flat_map(|s| haec_io::get_reads(&s, window_size, core, neighbour))
             .collect();
         set_parse_reads_spinner_finish(reads.len(), spinner);