Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

window blocking performance #70

Open
bengoehring opened this issue Apr 20, 2023 · 1 comment
Open

window blocking performance #70

bengoehring opened this issue Apr 20, 2023 · 1 comment

Comments

@bengoehring
Copy link

Hello,

Thank you for making and maintaining such a helpful package.

I am reaching out with a conceptual question about the window blocking option in blockData --- and a possible performance improvement suggestion. This all stems from trying and failing to window block a dataset with a few million rows and a dataset with about 10 million rows using a cluster with 10 cores and 180GB of RAM. It timed out after 24 hours.

Based on the documentation of window blocking (i.e., "a given observation in dataset A will be compared to all observations in dataset B where the value of the blocking variable is within ±K of the value of the same variable in dataset A"), I would expect the window blocking option to return a list of N lists --- where N refers to the number of unique values of the window blocking variable in dataset A. Each of the N lists will then contain two vectors of indices. The first vector will include the indices of dataset A where the window blocking variable equals the nth unique value of the window blocking variable in dataset A. The second vector will include the indices of dataset B where the window blocking variable is +/- K the nth unique value of the window blocking variable in dataset A.

I hope that makes sense.

It appears, however, that the window blocking option is doing something different. For instance, If I run:

library(tidyverse)
library(microbenchmark)
library(fastLink)

# make test datasets for window blocking
test_a <- diamonds
test_b <- arrange(diamonds, cut)

window_fast <- blockData(test_a, 
                         test_b,
                         varnames = "depth",
                         window.block = 'depth')

length(unique(test_a$depth))
#84

length(window_fast)
#3390

The number of separate blocks (3390) is much higher than I would expect (184). Would you be able to expand upon where I am misunderstanding?

I am guessing I am just misunderstanding something, but if the logic above is (miraculously) correct, I went ahead and implemented it in a separate function. It appears that it outperforms the default window blocking option in terms of speed (~50 times faster in this example). Please just let me know If I am onto something and you would like me to submit a pull request. My apologies if I am totally off base with this!!

my_window_blocking <- function(data_a,
                               data_b,
                               window_blocking_var,
                               window_size = 1) {
  
  # return the vector containing the window blocking variable in each dataset 
  data_a_window_values <- pull(select(data_a,
                                      {{window_blocking_var}}))
  data_b_window_values <- pull(select(data_b,
                                      {{window_blocking_var}}))
  
  # unique values of the first datasets window blocking variable for looping
  data_a_unique_vals <- sort(unique(data_a_window_values))
  
  # find a and b indices 
  out_a_indices <- vector('list',
                          length = length(data_a_unique_vals))
  out_b_indices <- vector('list',
                          length = length(data_a_unique_vals))
  
  for(i in 1:length(data_a_unique_vals)) {
    # a indices
    out_a_indices[[i]] <- which(data_a_window_values == data_a_unique_vals[i])
    
    # b indices
    min_window_value <- data_a_unique_vals[i] - window_size
    max_window_value <- data_a_unique_vals[i] + window_size
    
    out_b_indices[[i]] <- which(data_b_window_values %in% seq(min_window_value, 
                                                              max_window_value))
  }
  
  # return final data
  final_list <- vector('list',
                       length = length(data_a_unique_vals))
  
  for (k in 1:length(data_a_unique_vals)) {
    final_list[[k]][[1]] <- out_a_indices[[k]]
    final_list[[k]][[2]] <- out_b_indices[[k]]
    
    names(final_list[[k]]) <- c('dfA.inds', 
                                'dfB.inds')
  }
  return(final_list)
}

microbenchmark(my_window_blocking(test_a, 
                                  test_b,
                                  depth),
               blockData(test_a, 
                         test_b,
                         varnames = "depth",
                         window.block = 'depth'),
               times = 10)

Thank you for your time and all of your hard work maintaining this great package. It is much appreciated.

Best,
Ben

@tedenamorado
Copy link
Collaborator

Thanks so much for sharing this with us! We will try your function but feel free to make a pull request.

In blockData we are using a function that goes after not the unique values but the unique pairs of values, that is why we end up with more windows, but blocking operations are linear, not quadratic, so your approach is better.

We are revising many of our functions and I will make sure this issue is addressed in our next release.

As always, if anything, do not hesitate to let us know.

Ted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants