Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Good alternative to read_gtf? #427

Closed
sigven opened this issue Jan 14, 2025 · 6 comments
Closed

Good alternative to read_gtf? #427

sigven opened this issue Jan 14, 2025 · 6 comments

Comments

@sigven
Copy link

sigven commented Jan 14, 2025

Hi,

Thanks for a very useful package, it's been a very valuable working horse for me. I just noticed the deprecation of read_gtf, a function I have really liked. I understand there are good reasons for its deprecation. Either way, do you by any chance know of any alternative ways to read a GTF, not having to do the transformation to BED route?

kind regards,
Sigve

@jayhesselberth
Copy link
Member

jayhesselberth commented Jan 14, 2025

I'm not aware of another GTF parsing utility in an R pacakge, but if you find one let us know.

It's easy enough to load a GTF with readr::read_tsv() but parsing the attributes is tedious.

I (and ChatGPT) took an initial stab in #428 but won't have time to test it for a while. The strategy the AI came up with to get the attributes doesn't seem to work.

@sigven
Copy link
Author

sigven commented Jan 14, 2025

Ok, all good and thanks for the swift response. It's the attribute parsing that indeed was elegant with valr::read_gtf. I'll have to make a workaround i guess:).

best,
Sigve

@kriemo
Copy link
Member

kriemo commented Jan 14, 2025

The read_gtf() function in valr was a very thin wrapper around a GTF reading function (import) from rtracklayer (available on Bioconductor). Shown below is the wrapper code if you'd like to restore functionality for your work.

We had to remove this code because CRAN found some potential errors in the C code included within rtracklayer. The errors were unrelated to the GTF reading code. rtracklayer is a heavily used dependency from bioconductor so should be reliable to use for your work (as long as you don't include it in a R package on CRAN with compiled code).

read_gtf <- function(path, zero_based = TRUE) {
  gtf <- rtracklayer::import(path)
  gtf <- as.data.frame(gtf)
  gtf <- dplyr::mutate_if(gtf, is.factor, as.character)
  res <- dplyr::rename(gtf, chrom = seqnames)
  
  if (zero_based) {
    res <- dplyr::mutate(res, start = start - 1L)
  }
  
  tibble::as_tibble(res)
}

# get path to rtracklayer test gtf/gff file 
gtf_path <- system.file("tests/genes.gff3", package = "rtracklayer")

read_gtf(gtf_path)
#> # A tibble: 31 × 15
#>    chrom start   end width strand source  type  score phase ID    Name  geneName
#>    <chr> <int> <int> <int> <chr>  <chr>   <chr> <dbl> <int> <chr> <chr> <chr>   
#>  1 chr10 92827 95504  2677 -      rtrack… gene      5    NA Gene… TUBB8 tubulin…
#>  2 chr10 92827 95178  2351 -      rtrack… mRNA     NA    NA 873   TUBB8 <NA>    
#>  3 chr10 92827 95504  2677 -      rtrack… mRNA     NA    NA 872   TUBB8 <NA>    
#>  4 chr10 92827 94054  1227 -      rtrack… exon     NA    NA <NA>  <NA>  <NA>    
#>  5 chr10 92996 94054  1058 -      rtrack… CDS      NA    NA <NA>  <NA>  <NA>    
#>  6 chr10 94554 94665   111 -      rtrack… exon     NA    NA <NA>  <NA>  <NA>    
#>  7 chr10 94554 94665   111 -      rtrack… exon     NA    NA <NA>  <NA>  <NA>    
#>  8 chr10 94554 94615    61 -      rtrack… CDS      NA    NA <NA>  <NA>  <NA>    
#>  9 chr10 94554 94665   111 -      rtrack… CDS      NA    NA <NA>  <NA>  <NA>    
#> 10 chr10 94743 94852   109 -      rtrack… exon     NA    NA <NA>  <NA>  <NA>    
#> # ℹ 21 more rows
#> # ℹ 3 more variables: Alias <list>, genome <chr>, Parent <list>

Created on 2025-01-14 with reprex v2.1.1

@kriemo
Copy link
Member

kriemo commented Jan 14, 2025

I also took a stab at implementing read_gtf using Rcpp. I have a functional prototype but it was still 2x slower than rtracklayer's import (which is written in C) and couldn't handle all of the edge cases that rtracklayer has already solved (e.g it reads gff format also). I'd prefer to refer users to rtracklayer for the import functionality rather than adding more C++ code to maintain.

@sigven
Copy link
Author

sigven commented Jan 14, 2025

Thanks a lot for the guidance, was just now looking at the small wrapper that uses rtracklayer, it still works nicely, so all good here. Will only use it for data preparation here, not going into any CRAN package.

Thanks again,
Sigve

@jayhesselberth
Copy link
Member

Sounds good, closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants