Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [R] Replace vignettes and examples #11123

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

david-cortes
Copy link
Contributor

ref #9810
closes #10746

This PR adds a new introductory vignette which replaces most of the previous ones, and modifies the code examples throughout functions aimed at interactive usage to call xgboost() instead of xgb.train().

Motivation

Since the time that XGBoost was first published at CRAN, its adoption and mindshare have risen substantially, to the point that it has become the standard when it comes to boosted decision trees. In this day and age, I don't think the package needs to provide any introduction to the concepts of gradient-boosting, cross-validation, evaluation metrics, and so on - people who use R are already going to be familiar with those, and the things it compares against (like the package 'gbm') have become obsolete by now.

As well, the documentation and tutorials for XGBoost have mostly moved to the online docs - any R-specific documents become outdated rather soon, and are less likely to be seen by a random user. Most of the python examples and guides should in any event work with the R interface with very minimal modifications like dict->list.

Apart from becoming a standard-use library, the features supported by XGBoost have expanded over time, and lots of the materials that were there before, such as the first vignette, contained tips that are not applicable to the current state of the library, like manually one-hot encoding categorical features.

Hence, I decided to remove the previous vignettes and create a new one from scratch, which contains only examples around the usage of the R interface and its conventions.

Help needed

It would be ideal if this vignette could also get added to the online docs.

Thus, I created the vignette as a quarto file (.qmd), which has the option to render to both .html (what CRAN hosts) and .md (which can be included in the .rst files).

Only, getting it to render to .md required building the vignette with jupyter instead of knitr, which in turn requires installs of python, jupyter, ipykernel, and the "ir" kernel that runs R in jupyter, plus registering that kernel in the user-level config for jupyter. By adding that line "jupyter: ir", it additionally makes the default quarto render (e.g. as used by the "knit" button in RStudio) build the .html vignette using jupyter instead of knitr, which is most definitely not going to work out in CRAN servers. I don't know how to solve this.

Would also be nice if some CI job could be auto-building the .md file for the online docs from the .qmd source of the vignette.

(CCing @mayer79 and @jameslamb )

@jameslamb
Copy link
Contributor

jameslamb commented Dec 28, 2024

Thanks for the @. I haven't reviewed the content of this PR closely so am limiting my comment only to the topic of the "Help Needed" part above.

I know nothing about quarto or how to integrate it with CRAN's vignette-building and vignette-checking. Maybe these examples of other packages using it successfully on CRAN could lead to some good reference implementations: https://github.com/search?q=org%3Acran%20path%3A*.qmd&type=code

I can at least tell you how we're addressing these things LightGBM, maybe it will be helpful.

In {lightgbm}, we have a lightweight system that has been working well for us (which you alluded to in #9906 (comment)). Here's how we do this there:

  1. vignettes are written in .Rmd files with headmatter like the following, using {markdown} + {knitr}
output:
  markdown::html_format:
    options:
      toc: true
      number_sections: true
vignette: >
  %\VignetteIndexEntry{Basic Walkthrough}
  %\VignetteEngine{knitr::knitr}
  %\VignetteEncoding{UTF-8}

(code link)

  1. The main docs site is built via Sphinx, and the Sphinx build builds a {pkgdown} site with the roxygen docs + vignettes + demos (code link)
  2. those docs are integrated into the readthedocs site using an absolute hyperlink (code link)
  3. these are linked from the "R API" nav item at https://lightgbm.readthedocs.io/en/latest/index.html, which directs to https://lightgbm.readthedocs.io/en/latest/R/reference/

It's a little awkward that the R docs are not technically part of the readthedocs site, but overall this has worked pretty well for us. Some notes:

  • {markdown} + {knitr} is a lighter-weight combination than using {rmarkdown} or {quarto} (fewer required dependencies to install)
  • there are many examples of using .Rmd files + this combination on CRAN, so staying CRAN-compliant requires very little ongoing maintenance
  • {pkgdown} is customizable enough to meet our needs, and it keeps up with changes to the RMarkdown format

Even if you don't pursue this specific mix, I do recommend not checking the .md version of the vignette into source control here. Instead, you could put any commands to generate it as part of the docs-site building into doc/conf.py or doc/Makefile. I shared a link above to how to run custom commands at build-time via conf.py. Here's an example of how to do it via the Makefile generate by sphinx: https://github.com/rapidsai/legate-boost/blob/ba062bf666845f82f5e8ab4690c97db4469ab42f/docs/Makefile#L17-L29

@david-cortes
Copy link
Contributor Author

Thanks for the suggestions, but while LightGBM's solution works in the sense of providing rendered artifacts, I don't think it's an ideal solution here - the doc page URL differs from the Sphinx one and is not as easily indexable/searchable as an embedded .md or .ipynb file in the same sphinx page.

I've added a script and instructions to update the .md file here, although I'm not yet 100% sure that the .qmd vignette with the settings it has would render at CRAN without errors. Will hand over to @trivialfis from here. If this is still too complicated, maybe the logic could be switched to having the vignette as .Rmd and the script modifying the header and extension to .qmd to render the .md instead.

Still waiting from comments from @mayer79 if there are any.

@trivialfis
Copy link
Member

Let me update the branch to prevent CI hang on mgpu tests. It's a AWS instance.

@jameslamb
Copy link
Contributor

I don't think it's an ideal solution here - the doc page URL differs from the Sphinx one and is not as easily indexable/searchable as an embedded .md or .ipynb file in the same sphinx page.

Sure, those are good points. You have to balance whether those benefits you listed are worth the added build complexity, heavier set of dependencies, and increased risk of CRAN rejections. Not my call to make here in xgboost, and I don't have anything else to add.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R] Introductory vignette is outdated
3 participants