Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow conversion between data.tree and Newick format #92

Open
cysouw opened this issue Apr 15, 2017 · 4 comments
Open

Slow conversion between data.tree and Newick format #92

cysouw opened this issue Apr 15, 2017 · 4 comments

Comments

@cysouw
Copy link

cysouw commented Apr 15, 2017

Is there a way to speed up this conversion, specifically with ToNewick?

library(ape)
data(bird.families)
system.time(bf <- as.Node(bird.families))
system.time(n <- ToNewick(bf))
@gluc
Copy link
Owner

gluc commented Apr 20, 2017

That is indeed a bit slow :-(
Will look into it at some point, but no promises on when. Very open for pull requests though!

@cysouw
Copy link
Author

cysouw commented Apr 22, 2017

Apparently, recursive programming is not ideal inside R :-(, e.g. http://dirk.eddelbuettel.com/blog/2011/09/08/#rcpp_for_recursion

So, there should be some workaround possible ("memoization") but that would sort-of destroy the whole approach of the data.tree package. So, pass everything to C?

@gluc
Copy link
Owner

gluc commented May 5, 2017

As we are talking a few hundred nodes, recursion is not the problem. Rather, the current DefaultPlotHeight implementation is recursive, and on top it's called recursively by ToNewick. This, of course, is highly sub-optimal.

If you don't need the height in your Newick, the simplest solution is to avoid it, like so:

n <- ToNewick(bf, heightAttribute = NULL)

If you need the plot height, a workaround is to cache the plot height attribute on the nodes, like so:

library(ape)
library(data.tree)
data(bird.families)
bf <- as.Node(bird.families)

SetPlotHeight <- function(node, rootHeight = 100) {
  
  #traverse from leaves towards root to calculate the height and store it in height2
  #Note: cannot call it height, as that already exists on `Node`
  Set(node$leaves, height2 = 1)
  node$Do(function(x) x$height2 <- Aggregate(x, "height2", max) + 1, traversal = "post-order", filterFun = isNotLeaf)
  
  #traverse from root to leaves to calculate the plotHeight
  #(where plotHeight uses the semantics of dendrogram/phylo, where leaves should have height 0 and root height = rootHeight. This meaning is not the same as in the data.tree package)
  node$plotHeight <- rootHeight
  node$Do(function(x) x$plotHeight <- x$parent$plotHeight * (1 - 1 / x$height2), filterFun = isNotRoot)
}

SetPlotHeight(bf)

Then, instead of providing a heightAttribute function, provide the name of the cached attribute:

n <- ToNewick(bf, heightAttribute = "plotHeight")

I need to figure out whether there is a clean way to fix this for good in the DefaultPlotHeight function, thought I doubt it. At least it should be documented in the ToNewick, as.dendrogram.Node, and as.phylo.Node

@geekyi
Copy link

geekyi commented Mar 2, 2020

This is exactly what I was looking for!! This cut down the time to convert a tree with 5971 tips and 1631 internal nodes, from several(!!) minutes to just 2 secs!:

If you don't need the height in your Newick, the simplest solution is to avoid it

Perhaps this is the cause of slowdowns in other conversions like ToDataFrameTable as well? cf. this gist. I'm wondering about the slowness in conversion (of seconds or more) even after pruning to a depth of 2-6. Whereas just almost any other operation like leafCount for example is almost instant for the whole tree.

For context, I'm trying to convert a json file via data.tree to tidytree format to do some visualizations. Any other alternative would also be very helpful.

In the long term, I wonder how hard it would to implement some parts in rcpp? I'm hoping this can be my go to package for hierarchical data. There really is a lack of tools in R for stuff like json. I'd be willing to spend some time if needed to work on this. Since I really need a way to work with and visualize large json structures, any suggestions for alternatives is fine too..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants