-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Good alternative to read_gtf? #427
Comments
I'm not aware of another GTF parsing utility in an R pacakge, but if you find one let us know. It's easy enough to load a GTF with I (and ChatGPT) took an initial stab in #428 but won't have time to test it for a while. The strategy the AI came up with to get the attributes doesn't seem to work. |
Ok, all good and thanks for the swift response. It's the attribute parsing that indeed was elegant with best, |
The read_gtf() function in valr was a very thin wrapper around a GTF reading function ( We had to remove this code because CRAN found some potential errors in the C code included within rtracklayer. The errors were unrelated to the GTF reading code. read_gtf <- function(path, zero_based = TRUE) {
gtf <- rtracklayer::import(path)
gtf <- as.data.frame(gtf)
gtf <- dplyr::mutate_if(gtf, is.factor, as.character)
res <- dplyr::rename(gtf, chrom = seqnames)
if (zero_based) {
res <- dplyr::mutate(res, start = start - 1L)
}
tibble::as_tibble(res)
}
# get path to rtracklayer test gtf/gff file
gtf_path <- system.file("tests/genes.gff3", package = "rtracklayer")
read_gtf(gtf_path)
#> # A tibble: 31 × 15
#> chrom start end width strand source type score phase ID Name geneName
#> <chr> <int> <int> <int> <chr> <chr> <chr> <dbl> <int> <chr> <chr> <chr>
#> 1 chr10 92827 95504 2677 - rtrack… gene 5 NA Gene… TUBB8 tubulin…
#> 2 chr10 92827 95178 2351 - rtrack… mRNA NA NA 873 TUBB8 <NA>
#> 3 chr10 92827 95504 2677 - rtrack… mRNA NA NA 872 TUBB8 <NA>
#> 4 chr10 92827 94054 1227 - rtrack… exon NA NA <NA> <NA> <NA>
#> 5 chr10 92996 94054 1058 - rtrack… CDS NA NA <NA> <NA> <NA>
#> 6 chr10 94554 94665 111 - rtrack… exon NA NA <NA> <NA> <NA>
#> 7 chr10 94554 94665 111 - rtrack… exon NA NA <NA> <NA> <NA>
#> 8 chr10 94554 94615 61 - rtrack… CDS NA NA <NA> <NA> <NA>
#> 9 chr10 94554 94665 111 - rtrack… CDS NA NA <NA> <NA> <NA>
#> 10 chr10 94743 94852 109 - rtrack… exon NA NA <NA> <NA> <NA>
#> # ℹ 21 more rows
#> # ℹ 3 more variables: Alias <list>, genome <chr>, Parent <list> Created on 2025-01-14 with reprex v2.1.1 |
I also took a stab at implementing read_gtf using Rcpp. I have a functional prototype but it was still 2x slower than rtracklayer's import (which is written in C) and couldn't handle all of the edge cases that rtracklayer has already solved (e.g it reads gff format also). I'd prefer to refer users to rtracklayer for the import functionality rather than adding more C++ code to maintain. |
Thanks a lot for the guidance, was just now looking at the small wrapper that uses Thanks again, |
Sounds good, closing for now. |
Hi,
Thanks for a very useful package, it's been a very valuable working horse for me. I just noticed the deprecation of
read_gtf
, a function I have really liked. I understand there are good reasons for its deprecation. Either way, do you by any chance know of any alternative ways to read a GTF, not having to do the transformation to BED route?kind regards,
Sigve
The text was updated successfully, but these errors were encountered: