How to use R on the Bioinformatics cluster¶
Correct social behaviour expected
DO NOT run treatments on frontal servers, you're going to be a nuisance to other users. Please, always use sbatch or srun.
It includes positron editor.
Before contacting the support, READ THE FAQ and Tutorials.
Objective
This tutorial aims at describing how to run R scripts and compile RMarkdown files on the Toulouse Bioinformatics cluster.
To do so, you need to have an account. Ask for an account if needed.
You can then connect to the cluster using the ssh command on Linux and Mac OS X and using Mobaxterm on Windows.
Similarly, you can copy files between the cluster and your computer using the scp command on Linux and Mac OS X and using OpenSSH on Windows. The login address is genobioinfo.toulouse.inrae.fr.
Once you are connected, you have two solutions to run a script: running it in batch mode or starting an interactive session. The script must never be run on the first server you connect to. Also, be careful that the programs that you can use from the cluster are not available until you have loaded the corresponding module. How to manage modules is explained in next section.
Use of modules¶
All programs are made available by loading the corresponding modules. These are the main useful commands to work with modules:
module avail: list all available modulessearch_module <TEXT>: find a module with keywordmodule load <MODULE_NAME>: to load a module (for instance to load R,module load statistics/R/4.3.0). This command is either used directly (in interactive mode) or included in the file that is used to run your R script in batch mode (see below)module purge: purge all previous loaded modules
Run an R script in batch mode¶
To launch an R script on the slurm cluster:
First, write an R script:
print("Hello world!")
Second, write a bash script:
#!/bin/bash
#SBATCH -J launchRscript
#SBATCH -o output.out
# Purge all previously loaded modules
module purge
# Load the R module
module load statistics/R/4.3.0
# The command lines that I want to run on the cluster
Rscript HelloWorld.R
Finally, launch the script with the sbatch command:
The scripts myscript.sh and HelloWorld.R are supposed to be located in the same directory from which the sbatch command is launched. For Rmd files, be careful that you cannot compile a document if the .Rmd file is not in a writable directory.
sbatch options¶
Jobs can be launched with customized options (more memory, for instance). There are two ways to handle sbatch options:
- [RECOMMENDED] at the beginning of the bash script with lines of the form:
#SBATCH <OPTION> <VALUE> - in the
sbatchcommand:sbatch <OPTION1> <VALUE1> <OPTION2> <VALUE2> [...] myscript.sh
Many options are available. To see all options use sbatch --help. Useful options:
-J,--job-name=jobname: name of job-e,--error=err: file for batch script's standard error-o,--output=out: file for batch script's standard output--mail-type=BEGIN,END,FAIL: send an email at the beginning, end or fail of the script (default email is your user email and can be changed with--mail-user=truc@bidule.fr, to use with care)-t,--time=HH:MM:SS: time limit (default to 04:00:00)--mem=XG: to change memory reservation (default to 4G)-c,--cpus-per-task=ncpus: number of cpus required per task (default to 1)--mem-per-cpu=XG: maximum amount of real memory per allocated cpu required by the job
Job management¶
After a job has been launched, you can monitor it with squeue -u <USERNAME> or squeue -j <JOB_ID> and also cancel it with scancel <JOB_ID>.
Use R in interactive mode¶
To use R in a console mode, use srun --pty bash to be connected to a node. Then, module load statistics/R/4.3.0 (for the latest R version) and R to launch R.
srun: job 17758928 has been allocated resourcesmodule load statistics/R/4.3.0R
R version 4.3.0 (2023-04-21) -- "Already Tomorrow"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
Note
srun can be run with the same options than sbatch (cpu and memory reservations).
X11 sessions¶
X11 sessions are useful to directly display plots in an interactive session. Prior their use, and if not exists, generate a ssh key on the cluster with ssh-keygen and add it in the authorized_keys file:
The interactive session is then launched by:
- Logging on the cluster with
ssh -X <USERNAME>@genobioinfo.toulouse.inrae.fr - Running an interactive session with
srun --x11 --pty bash
srun: job 17758928 has been allocated resourcesmodule load statistics/R/4.3.0R
R version 4.3.0 (2023-04-21) -- "Already Tomorrow"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
plot(1:10)
R in a parallel environment¶
To use R with a parallel environment, the -c (or --cpus-per-task) option for the sbatch and srun is needed. In the R script, the number of cores must be set to the SAME value.
Several packages, like doParallel, BiocParallel, or future, exist to use parallel calculation with R .
The following examples use doParallel and BiocParallel for 2 parallel jobs.
First, write a R script
- With
doParallelpackage:TestParallel.Rlibrary(doParallel) # specify the number of cores with makeCluster cl <- makeCluster(2) registerDoParallel(cl) foreach(i=1:3) %dopar% sqrt(i) - or, with
BiocParallelpackage:TestParallel.Rlibrary(BiocParallel) # specify the number of cores with workers = 2 bplapply(1:10, print, BPPARAM = MulticoreParam(workers = 2))
Second, write a bash script:
#! /bin/bash
#SBATCH -J lauchRscript
#SBATCH -o output.out
#SBATCH -c 2
#Purge any previous modules
module purge
#Load the application
module load statistics/R/4.3.0
# My command lines I want to run on the cluster
Rscript TestParallel.R
Finally, launch the script with the sbatch command:
Arguments in a script¶
External arguments can be passed to an R script. The basic method is described below but the packages argparser or optparse provide ways to handle external arguments à la Python.
First, write an R script:
args <- commandArgs(trailingOnly=TRUE)
print(args[1])
Second, write a bash script:
#! /bin/bash
#SBATCH -J lauchRscript
#SBATCH -o output.out
#Purge any previous modules
module purge
#Load the application
module load statistics/R/4.3.0
# My command lines I want to run on the cluster
Rscript --vanilla HelloWorld.R "Hi!"
Finally, launch the script with the sbatch command:
Install packages in your own environment¶
Once in an interactive R session, R packages are installed (in a personal library) using the standard install.packages command line.
srun: job 17758928 has been allocated resourcesmodule load statistics/R/4.3.0R
R version 4.3.0 (2023-04-21) -- "Already Tomorrow"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
install.packages("ggplot2")
Your personal library is usually located at the root of your personal directory (i.e. ~/R) whose allocated space is very limited. A simple solution consists in:
Some packages are already installed
A few R packages are already installed inside an R version. For example, the package dplyr is already installed inside the module statistics/R/4.3.0. You can check if a package is pre-installed using the command search_R_package like in this example.
Please, wait for output...
"dplyr 1.0.10" est installé dans statistics/R/3.4.3
...
"dplyr 1.1.4" est installé dans statistics/R/4.4.0
In case you would like to have an additional package pre-installed in a given R version, you could request it to support.
Create and compile .Rmd (RMarkdown) files on the cluster (batch mode)¶
To compile a .Rmd file, two packages are needed: rmarkdown and knitr. You also need to load the module tools/Pandoc/3.1.2.
As for an R script, you can pass external arguments to a .Rmd document.
First, write a .Rmd script called MyDocument.Rmd with parameters in the header:
---
title: My Document
output: html_document
params:
text: "Hi!"
---
What is your text?
```{r}
print(params$text)
```
Second, write a R script to pass parameters:
rmarkdown::render("MyDocument.Rmd",
params = list(text = "Hola!"))
Third, write a bash script:
#SBATCH -J lauchRscript
#SBATCH -o output.out
module purge
module load statistics/R/4.3.0
module load tools/Pandoc/3.1.2
Rscript --vanilla TestRmd.R
Finally, launch the script with the sbatch command:
Access R through conda¶
On the cluster, conda is a way to get additional versions of R. It is available through a module named devel/Miniforge/Miniforge3. The following commands are an example to how create a conda environment with R 4.2.0.
When finished, you can unload the env this way
Integrated editors¶
Keep in mind that the cluster purpose is not to write/design code, but to use already produced code.
Use RStudio with Open On Demand (OOD)¶
Look at the dedicated page
Use Positron¶
Positron enables to run R>=4.2. It is an alternative to RStudio.
You can follow the VSCode-like tutorial to run Positron on the cluster.
Load modules with remoteSSH plugin¶
If you use the remoteSSH plugin way to use Positron on the cluster. You will have some difficulties to use R versions available in modules.
Following this Positron issue, here the way to use them:
-
Look at at documentation How_to_use_SLURM_R to get which R versions are available, but also module required to run them. You need to load module
compilers/gcc/12.2.0in addition to module with R 4.4.x and modulecompilers/gcc/15.1.0with module R 4.5.x. -
On your own computer, edit your Positron settings this way to add the version of R you want from modules:
~/.config/Positron/User/settings.json{ ..., "positron.environmentModules.environments": { "R 4.2.2": { "languages": ["r"], "modules": [ "statistics/R/4.2.2" ] }, "R 4.3.3": { "languages": ["r"], "modules": [ "statistics/R/4.3.3" ] }, "R 4.4.3": { "languages": ["r"], "modules": [ "compilers/gcc/12.2.0", "statistics/R/4.4.3" ] }, "R 4.5.0": { "languages": ["r"], "modules": [ "compilers/gcc/15.1.0", "statistics/R/4.5.0" ] } }, ... } -
Now you can connect with remoteSSH and select an R versions available in modules as an "Interpreter Session". It must be tagged as a module in the list, like
R 4.5.0 (Module: R 4.5.0). If it doesn't appears, please wait that Positron finishes to list environments available. Sometime you need to quit and launch again Positron to refresh the list.
Troubleshooting:
When switching from a module to another module, or running another R session, you will get this error:
JEP 66 handshake failed for session r-db123456: Timeout waiting for handshake
You need to close Positron and run it again to fix the issue.
Use conda/pixi¶
You can install your own version of R with conda or pixi.
On the cluster, you can use those commands to create, from scratch, a reproducible project by using pixi and renv.
You must install pixi before hand and enable pixi environments in Positron.
Keep in mind that the cluster purpose is not to write/design code, but to use already produced code.
# We create a projet directory with pixi (pixi must be installed before hand)pixi init my-pixi-r-project
# we install R in itcd my-pixi-r-project
pixi add r-base
# (optional) We create a Rsudio compatible projectpixi add r-usethis
pixi run R -e 'usethis::create_project(".",rstudio=TRUE)'
pixi rm r-usethis # cleanup
# We use renv to track our dependenciespixi run R -e 'install.packages("renv", repos=c("https://cloud.r-project.org")); renv::init()'
You can then open this project in Positron and work as usual.