Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/issue 102 allow specify nt features and nt distances with partial name rather than a vector #114

Conversation

venpopov
Copy link
Owner

Summary

Allow users to specify model variable arguments such as nt_features and nt_distances via regular expressions.

  • Transform check_model into an S3 method
  • add regex argument to mixture3p and IMM models
  • add atribute 'regex_vars' to the models, which specifies which variables can be specified via regular expressions
  • the check_model.bmmmodel method calls a new function replace_regex_variables(model, data). If regex is FALSE or NULL or MISSING, the function just returns the model object, making it backwards compatible with all models regardless of whether they have a regex argument or not. if regex=TRUE, the functions checks which variables can be specified via regular expressions, and then looks in the data column names for matching variables
  • update vignettes and examples to document new option
  • tests for new functionality
  • fix some issues with IMM vignete's last figure mismatching parameters

Example usage:

data <- OberauerLin_2017
model <- IMMfull(resp_err = "dev_rad",
                 nt_features = 'col_nt',
                 nt_distances = 'dist_nt',
                 setsize = 'set_size',
                 regex = TRUE)

or

data <- OberauerLin_2017
model <- IMMfull(resp_err = "dev_rad",
                 nt_features = 'col_nt[1-7]',
                 nt_distances = 'dist_nt[1-7]',
                 setsize = 'set_size',
                 regex = TRUE)

or

data <- data.frame(item1_color = runif(10,-pi,pi),
                   item2_color = runif(10,-pi,pi),
                   item3_color = runif(10,-pi,pi),
                   dev_rad = rnorm(10,0,0.2))

model <- mixture3p(resp_err = 'dev_rad',
                   nt_features = 'item.*_color',
                   setsize = 4,
                   regex = T)

Tests

[x] Confirm that all tests passed
[x] Confirm that devtools::check() produces no errors

Release notes

Copy link
Collaborator

@GidonFrischkorn GidonFrischkorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, the changes look good and I got everything to run on my machine, too. However, I noticed that using the regex option can also use problems, for example if I were to fit the model only to setsize = 4 in the OberauerLin_2017 data then I would have to delete all col_nt and dist_nt vars for the nts 4 to 7. However with the paste0("col_nt",1:3) this does not cause problems.

Maybe we can add this when introducing the regex option. That this will only work if all variables following a certain naming scheme are used for the model specification. Otherwise, in the above mentioned example, I get an error that there are more nt_features than max(setsize) - 1

I think this is not too big of an issue, but users might still be confused about it when trying to run models only on specific conditions without adapting their dataset.

@GidonFrischkorn
Copy link
Collaborator

GidonFrischkorn commented Feb 22, 2024

I ran one more test and found another issue. In the case where variable are not sorted in the same order the regex syntax fails to order them in the correct order. As shown in the example below, if the variables are scrambled in order the nt_features variables are misaligned with the nt_distances. For example, mu2 ~ col_nt1 but theta2 uses dist_nt2 in the formula. I assume that is something very few users would be doing, as this would make for a pretty chaotic data set, but if possible it would be good to have a look if there is a solution for this.

image

image

@venpopov
Copy link
Owner Author

Generally, the changes look good and I got everything to run on my machine, too. However, I noticed that using the regex option can also use problems, for example if I were to fit the model only to setsize = 4 in the OberauerLin_2017 data then I would have to delete all col_nt and dist_nt vars for the nts 4 to 7. However with the paste0("col_nt",1:3) this does not cause problems.

Maybe we can add this when introducing the regex option. That this will only work if all variables following a certain naming scheme are used for the model specification. Otherwise, in the above mentioned example, I get an error that there are more nt_features than max(setsize) - 1

I think this is not too big of an issue, but users might still be confused about it when trying to run models only on specific conditions without adapting their dataset.

I thought about that too, but this is not a problem if you specify the regex with a limited number range. E.g., this will happen if you call

mixture3p(resp_err = 'dev_rad', nt_features = 'col_nt', setsize = 4, regex=T)

but not if you call

mixture3p(resp_err = 'dev_rad', nt_features = 'col_nt[1-3]', setsize = 4, regex=T)

This example is included in the vignettes. In general, we give the regex option, but it's not our task to educate people about regex - it is there to use for those who want it and know how to use (as in any other package that offers regex arguments, for example a few functions in brms, posterior and tidybayes)

@venpopov
Copy link
Owner Author

venpopov commented Feb 22, 2024

I ran one more test and found another issue. In the case where variable are not sorted in the same order the regex syntax fails to order them in the correct order. As shown in the example below, if the variables are scrambled in order the nt_features variables are misaligned with the nt_distances. For example, mu2 ~ col_nt1 but theta2 uses dist_nt2 in the formula. I assume that is something very few users would be doing, as this would make for a pretty chaotic data set, but if possible it would be good to have a look if there is a solution for this.

good point. Two thoughts:

  • As you pointed out this is an edge case that will rarely occur. People have to have this weird ordering in the dataset AND use regex expressions.
  • Any way we might try to handle this might introduce other problems. I thought that we might try to sort the variable names, but that will only work if they use coherent numbering, and it will fail if they have more than 10 setsizes and have coded as "col_1", "col_2", ... "col_11", as the sort will put col_11 between col_1 and col_2. Another option is to try to extract the numbers and sort by them, but there are so many ways peole might specify the numbers that it will be a nightmare and virtually impossible to tackle all, especially if they have other numbers in there that are not indexing the items. And even if we could handle all of that, maybe in some edge cases there is a reason people ordered them like that, and we will overwrite their choice.
  • I also thought about just giving a warning if we detect this issue, but the detection would suffer from the same problems above

So given that this is potentially a really rare case, that it is almost impossible to handle exhaustively, my vote is to leave the responsibility with the user. Maybe we can add a note in the @param section of those models that the order of columns passed to nt_features and nt_distances should match the corresponding items

@GidonFrischkorn
Copy link
Collaborator

I see your points and agree that both the cases I highlighted will likely be rare. So, let's merge this for now and if additional problems pop up the more we use the feature we can re-evaluate if we need to change something about this.

@GidonFrischkorn GidonFrischkorn merged commit 5bc2f8b into develop Feb 22, 2024
3 checks passed
@GidonFrischkorn GidonFrischkorn deleted the feature/issue-102-allow-specify-nt_features-and-nt_distances-with-partial-name-rather-than-a-vector branch February 22, 2024 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

allow specify nt_features and nt_distances with partial name rather than a vector?
2 participants