Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use StatsAPI #236

Open
Tracked by #212
alyst opened this issue Jan 15, 2025 · 4 comments
Open
Tracked by #212

Use StatsAPI #236

alyst opened this issue Jan 15, 2025 · 4 comments

Comments

@alyst
Copy link
Contributor

alyst commented Jan 15, 2025

One problem that I have run into is that both SEM and StatsAPI define params(), so if the user is using both in the same session (StatsAPI e.g. via Distributions), there would be a collision.
One solution is just to import StatsAPI and use their params() method.
(The context is very similar, however params() in StatsAPI is supposed to return the actual values of the parameters, whereas SEM one returns their names. There is also StatsAPI.coefnames() method.)

The other methods of StatsAPI could be reused as well, e.g. StatsAPI.dof() (instead of SEM.df()), pvalue(), "fit()*, etc.
Overall, that would make SEM.jl better integrated into the Julia's ecosystem of statistical models.

22 tasks
@aaronpeikert
Copy link
Collaborator

@alyst I want to get a bit more involved/helpfull in SEM.jl and this seems like a issue I could tackle. Any PR of me would probably require a careful eye. Should I take a stab at it?

@alyst
Copy link
Contributor Author

alyst commented Jan 24, 2025

@aaronpeikert Sure, that would be great! I can help you if you need my review.

@aaronpeikert
Copy link
Collaborator

aaronpeikert commented Jan 25, 2025

OK this seems a bit complicated and some substantive decisions need to be made:

  • StatisticalModel --> abstract type AbstractSem <: StatsAPI.StatisticalModel end
  • RegressionModel --> dont use. all methods except for fitted dont make any sense (assume a single dependent variable)
  • params: StatsAPI defines this as all parameters, SEM.jl as all identifiers --> conform to StatsAPI
  • coef: StatsAPI defines this as all parameters that are required for prediction (e.g., it excludes residual variance for a linear model); distinction to params makes no sense in SEM --> define as synonym
  • coeftable --> synonym to parametertable, requires confint
  • confint --> implement CI calculation, maybe only availible when Distribtions is loaded via package extention
  • coefnames: StatsAPI's name for SEM.jl's params --> conform to StatsAPI
  • paramsnames: does not exist in StatsAPI but should exist if the define coef === params --> synonym to coefnames
  • isfitted: Was the model fitted to the data? --> implement
  • fitted --> check isfitted and return coef(model) otherwise argerror
  • predict --> implement as synonym for imply
  • fit what was again the reason for sem_fit instead of fit --> rename sem_fit
  • pvalue --> come up with implementation that does not depend on Distributions.jl
  • loglikelihood(model::StatisticalModel, observation) the contribution of a single sample (here called observation). Not sure this is easy.
  • nullloglikelihood what definition of null model do we use? Is is trivial to autogenerate?
  • score: gradient of the log-likelihood with respect to the coefficients --> implement
  • nobs --> Throw argument error, we use nsample
  • dof --> rename df
  • mss --> only return for least square loss otherwise argument error
  • informationmatrix --> return if hessian is availible
  • vcov Return the variance-covariance matrix for the coefficients of the model. --> return if hessian is availible
  • BIC, AIC, AICc --> only the last needs to be newly implemented?
  • islinear --> dispatch on imply should suffice
  • ALL other functions defined for StatisticalModel should throw a MethodError

Should I go ahead with this plan?

@alyst
Copy link
Contributor Author

alyst commented Jan 25, 2025

It is a nice plan! Some comments:

  1. StatsAPI.jl support could be implemented in phases. We can classify the items according to these phases
    • The phase 1 would be to resolve current name clashes, like params, because if I'm using params from SEM.jl in my script/package, and then I add StatsAPI.jl (or one of the packages that re-export it), the script/package cannot be compiled.
    • The phase 2 would be to rename existing SEM.jl functions to match StatsAPI.jl or add synonyms to the StatsAPI.jl functions (e.g. SEM.df does not clash, but it could be StatsAPI.dof)
    • The phase 3 would be to implement as much of StatsAPI.jl functions as possible (where it makes sense).
  2. Some specific items:
  • AbstractSem <: StatisticalModel -- would be nice if some SEM.jl types are derived from StatsAPI, but there is also Sem and SemFit, so that may require more considerations.
    It looks like StatsAPI assumes that fitting modifies the existing statistical model (maybe that is inspired by R). That's not ideal from the flow of information PoV and complicates both the internal and the user script logic.
    I like how SEM.jl distinguishes between model definition Sem and fitted model SemFit, so SEM.jl may diverge from StatsAPI in that aspect.
  • SEM.params -> SEM.paramnames or SEM.coefnames: renaming to paramnames makes sense to me. The internals of SEM.jl (e.g. ParamsMatrix) do not rely on parameter IDs, rather their indices.
  • coef as params synonym: that's "phase 2", so we can decide on it later
  • confint -- would be nice to have, esp. since the manuscript tutorial has the user code to calculate it, but it would be more convenient to have it in the package
  • isfitted: Sem (as the model definition) can return false or throw an exception that one has to run fit(Sem); isfitted(SemFit) would be always true
  • fit: I like the idea that fit(Sem) returns SemFit
  • pvalue: why it should be independent from Distributions.jl?
  • score: phase 3
  • loglikelihood/nullloglikelihood: that's phase 3
  • nobs: could be synonym to nsamples (the reason SEM.jl uses nsamples is to disambiguate from observed variables). SEM.jl tutorials may use nsamples only, but the code may allow nobs for those familiar with StatsAPI
  • AICc: that's phase 3
  • predict: I think predict(SemFit, data) could be used to predict latent variable values for each sample. In fact, the support for predict is in my staging branch. imply/implied is SEM-specific, I think it should not be synonymous to predict.
  • throw MethodError for other functions: that would be necessary if SEM.jl types are derived from StatisticalModel, but this decision could be postponed until phase 3.
  1. It will also be nice to get the feedback from the StatsAPI.jl maintainers. While they may not have considered use-cases like SEM.jl originally, they may have some ideas how StatsAPI.jl could be tweaked to support it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants