Table of Contents generated with DocToc
- Instruction notebook content
- Data file management
- Development with
renv
- Docker Image
- Automated Testing & Rendering
- Cheatsheets
Each notebook should begin with a "Learning objectives" section. This section contains a compact summary of the goals for the notebook, in the form of a bulleted list of objectives, preceded by the following header:
## Objectives
This notebook will demonstrate how to:
Each element of the objective list should be a specific concept and/or skill we expect learners to come away from the notebook having gained. Phrase the objective with an "action verb" followed by a description of the skill. The Carpentries Curriculum Development Handbook section on learning objectives has some useful discussion about how to write effective learning objectives for courses like ours.
For most notebooks, 3-5 objectives should be sufficient.
The objectives list should be followed by a horizontal rule for visual distinction:
---
More to come, but for now, we should generally follow the style conventions we established in refinebio-examples
for text.
All code chunks should be named. This is done to ease orientation and navigation during training.
Code chunks that will be blank at the start of a training session (for live coding) should be tagged with the live = TRUE
argument.
These chunks will be stripped of code (but not full line comments) when the -live.Rmd
version of the notebook is created by the make-live.R
script.
This script is not usually run independently; it will usually run as part of the Make Live Notebooks GitHub Action via aworkflow_dispatch
.
That action will file a pull request with changes to any -live.Rmd
notebooks and, optionally, rendered versions of the notebooks.
We try to maintain good scholarship, citing our sources! Often our sources are vignettes or web pages, for which we usually link directly to a web page. When possible we should include the author of the vignette in the text to give proper credit.
In the case of journal or preprint publications, we will follow the following citation conventions:
([Author Date](url))
with no reference to journal except in the url (the url should be a DOI if possible).- If there are two authors:
([Author1 and Author2 YEAR](url))
- If 3 or more:
([Author1 _et al._ YEAR](url))
- Citations in text:
As shown by [Author1 _et al._ (YEAR)](url)
.
Each instruction notebook should run from start to finish when run after the previous notebook(s) in its module have been run. This means that modules can have dependencies on each other, but those dependencies should be satisfied when run in order. In general, input files that are not present in this repository will be linked as part of setting up the repository or user folder, as described in the Data file management section of this document.
An exception is when a notebook relies on the completion of tasks run outside the notebook (salmon mapping, for example).
In that case, any extra required files should be uploaded to S3 via the syncup-s3.sh
script so that the notebook can run to completion during automated testing.
Each module should contain a setup
directory that includes instructions and scripts to download and prepare any input data files used by the notebooks in the module.
For scripts designed to be run on the RStudio server (rstudio.ccdatalab.org
) these input files will usually be placed in the /shared/data/training-modules
directory and then linked for use, as described below.
When trainings are run from the RStudio server (rstudio.ccdatalab.org
), we store large input data files in the /shared/data/training-modules
directory.
This directory is the implicit "point of truth" for modules
The organization within this directory should mirror the arrangement of the repository, so that we can easily link files and folders from this shared directory to mirrored paths within a clone of this repository.
Linking files from the shared directory to a cloned repository is done with the scripts/link-data.sh
bash script, so this script should be kept up to date as any needed files are added to the /shared/data/training-modules
directory.
When possible, link to enclosing directories rather than individual files to keep links simpler and allow users to browse a realistic directory context, but see an important caveat below.
Because this script is also used to set up directories for training, the links should not include all files needed for every notebook:
- Files that are created during a training session should not be included in this script.
- Directories that users will need to write to should not be links, or the user will not be able to write their own files.
To facilitate automated testing of training notebooks, all needed input files for training notebooks should be placed in the ccdl-training-data
bucket on S3 and made publicly accessible.
This is facilitated by the scripts/syncup-s3.sh
bash script, which includes the needed commands for upload/sync, and should include all directories and files needed to run the training notebooks.
You will need to set up your AWS credentials with aws configure
before running the syncup-s3.sh
script
As input files are added or change, those changes should be reflected in updates to syncup-s3.sh
We use renv
to manage R package dependencies for this project.
Using renv
allows us to keep R packages in sync during development in multiple scenarios – on the RStudio Server, using the project Docker image, or even locally – and generates a lockfile (renv.lock
) that we can use to install packages on the RStudio Server for participants when it's time for a workshop.
For renv
to work as intended, you will need to work within the training-modules
project.
Be careful not to open any module specific .Rproj
during development, as it will disrupt the renv
environment!
The steps for development are:
- Open up
training-modules.Rproj
. - If your library isn't set up or in sync with the lockfile, you'll be prompted to run
renv::restore()
. This will happen if you first clone the project or haven't been working within the Docker container in a bit. - Develop as normal.
- Run
renv::snapshot()
at the end of your session to capture the additional dependencies. Be careful ifrenv::snapshot()
suggests removing packages! If there have been additions torenv.lock
in another branch while you were working, you may need to runrenv::restore()
again beforerenv::snapshot()
. - If there are dependencies you might want that are not captured automatically by
renv::snapshot()
(this may happen if a package is "recommended" by another, but not required), add them tocomponents/dependencies.R
with a call tolibrary()
and an explanatory comment. Then rerunrenv::snapshot()
- Commit any changes to
renv.lock
anddependencies.R
. - File a pull request with only the changes to the
renv.lock
anddependencies.R
before other changes (see next section for tips on creating this PR). This is necessary because the automated render testing we do will fail if the Docker image has not been updated (see 'Pushing to Docker Hub via GitHub Actions' below).
Note that when you open up the training-modules.Rproj
, the .Rprofile
file makes it such that the renv
library is loaded and the repositories in the renv.lock
file will be set with options(repos = ...)
.
For most cases you can create your renv.lock-only
changes PR by following these steps:
- Create your
renv.lock-only
branch from the latestmaster
branch. - In your
renv.lock-only
branch, checkout the renv.lock file from your development branch (where you were generally doing steps 1-6 from the previous section) usinggit checkout devbranch renv.lock
. - Commit the renv.lock changes you just checked out.
- Push the changes and file your renv.lock only PR.
Note that the renv::snapshot()
command will skip adding a package to renv.lock if it isn't used in a notebook or script.
If there are changes happening on multiple branches that require renv.lock changes, you may need to follow a slightly different version of steps:
- Create your
renv.lock-only
branch from the latestmaster
branch. - Run
renv::restore()
. - Install the packages needed on both branches (
install.packages()
or etc). - Add those packages to
components/dependencies.R
. - Run
renv::snapshot()
. - Only commit the
renv.lock
changes to your branch. - Push the changes and file your renv.lock only PR.
We use the renv.lock
file in this repository to install R dependencies on the image per the renv
documentation for creating Docker images.
Specifically, we copy the renv.lock
file and run renv::restore()
, which installs the packages in the lockfile into the Docker image's system library.
In practice, this means that you will not need to add individual R packages to the Dockerfile
, but you may have to add system dependencies (e.g., via apt-get
) required for the installation of those R packages.
To use the Docker image for development, pull from Docker Hub with:
docker pull ccdl/training_rstudio:edge
To run the container and mount a local volume, use the following from the root of this repository:
docker run \
--mount type=bind,target=/home/rstudio/training-modules,source=$PWD \
-e PASSWORD=<PASSWORD> \
-p 8787:8787 \
ccdl/training_rstudio:edge
Replacing <PASSWORD>
with the password of your choice.
You can then navigate to localhost:8787
in your browser and login with username rstudio
and the password you just set via docker run
.
To work on the project, you should then open training-modules/training-modules.Rproj
on the RStudio server, then proceed as in the typical development workflow.
When a pull request changes either the Dockerfile
or renv.lock
, a GitHub Actions workflow (build-docker.yml
) will be run to test that the image will successfully build.
When a pull request is merged into master
, the build-docker.yml
GitHub Actions workflow will be triggered.
The project Docker image will be rebuilt and pushed as ccdl/training_rstudio:edge
.
When a new Git tag is created, the build-docker.yml
GitHub Actions workflow will also be triggered, and will push a version of the image to Docker Hub as ccdl/training_rstudio:<tag>
.
The most recent tagged version of the image will also be tagged as as ccdl/training_rstudio:latest
.
We perform spell checking for every pull request to master
as part of a GitHub Actions workflow (spell-check.yml
); it is designed to fail if any spelling errors are detected.
You can see what errors are detected in stdout
for the Run spell check
step of the workflow.
This workflow uses a script, scripts/spell-check.R
, to spell check .md
and completed .Rmd
files.
The custom dictionary we use for the project can be found at components/dictionary.txt
.
To run spell check locally, you can run the following from the root of the repository:
Rscript --vanilla scripts/spell-check.R
The spelling errors will be listed in spell_check_errors.tsv
in the root of the repo; this file is ignored by git.
Every pull request to master
that changes .Rmd
files (or one of the rendering scripts) will be tested via a GitHub Actions workflow (render-rmds.yml
) to ensure that the .Rmd
files can be run successfully.
This action first downloads input files for the notebooks from S3, so if there are changes to the input files, these should be made first, with associated changes as needed to syncup-s3.sh
(see Files stored on S3).
After a pull request with changes to notebook files has been merged to master, we use the make-live.yml
workflow to render current versions of the notebooks to html and to make the -live.Rmd
versions of the files for training sessions.
This workflow then files a PR to master
with the rendered and live files.
make-live.yml
is currently manually triggered, but will likely change to running automatically on each PR with changes to notebook files in the near future.
Training modules have corresponding cheatsheets in module-cheatsheets
.
When choosing documentation links to incorporate in cheatsheets, we prefer to use https://rdrr.io/
when possible for Base R and Bioconductor, and we prefer to use https://www.tidyverse.org/
for tidyverse
functions.
Cheatsheets are written in plain markdown and are converted to a shareable PDF format using the Node.js
package mdpdf
, with the default PDF style.
To render these packages, you will therefore have to first install npm
, the Node.js
package manager.
You can install Node.js
and npm
using Homebrew with brew install node
, or you can install into a Conda environment with conda install nodejs
With Node.js
installed, you can install the mdpdf
package into the default Node.js
library with:
npm install -g mdpdf
In addition, cheatsheet table of contents are created with the Node.js
library doctoc
, which can similarly be globally installed with:
npm install -g doctoc
To re-render a cheatsheet to PDF after making desired changes in its markdown, take the following steps:
- Navigate in terminal to the
module-cheatsheets
directory - Run
doctoc
on the file to update its table of contents:doctoc cheatsheet-file.md
- Alternatively, you can use a tool such as the VS Code Extension Markdown All in One. Either way, please ensure that the table of contents is not duplicated.
- Convert the markdown file to an updated PDF version:
mdpdf cheatsheet-file.md