How to use R on the Bioinformatics cluster¶
Correct social behaviour expected
DO NOT run treatments on frontal servers, you're going to be a nuisance to other users. Please, always use sbatch or srun.
Before contacting the support, READ THE FAQ
Objective
This tutorial aims at describing how to run R scripts and compile RMarkdown files on the Toulouse Bioinformatics cluster.
To do so, you need to have an account. Ask for an account if needed.
You can then connect to the cluster using the ssh command on Linux and Mac OS X and using Mobaxterm on Windows.
Similarly, you can copy files between the cluster and your computer using the scp command on Linux and Mac OS X and using OpenSSH on Windows. The login address is genobioinfo.toulouse.inrae.fr.
Once you are connected, you have two solutions to run a script: running it in batch mode or starting an interactive session. The script must never be run on the first server you connect to. Also, be careful that the programs that you can use from the cluster are not available until you have loaded the corresponding module. How to manage modules is explained in next section.
Use of modules¶
All programs are made available by loading the corresponding modules. These are the main useful commands to work with modules:
module avail: list all available modulessearch_module <TEXT>: find a module with keywordmodule load <MODULE_NAME>: to load a module (for instance to load R,module load statistics/R/4.3.0). This command is either used directly (in interactive mode) or included in the file that is used to run your R script in batch mode (see below)module purge: purge all previous loaded modules
Run an R script in batch mode¶
To launch an R script on the slurm cluster:
First, write an R script:
print("Hello world!")
Second, write a bash script:
#!/bin/bash
#SBATCH -J launchRscript
#SBATCH -o output.out
# Purge all previously loaded modules
module purge
# Load the R module
module load statistics/R/4.3.0
# The command lines that I want to run on the cluster
Rscript HelloWorld.R
Finally, launch the script with the sbatch command:
The scripts myscript.sh and HelloWorld.R are supposed to be located in the same directory from which the sbatch command is launched. For Rmd files, be careful that you cannot compile a document if the .Rmd file is not in a writable directory.
sbatch options¶
Jobs can be launched with customized options (more memory, for instance). There are two ways to handle sbatch options:
- [RECOMMENDED] at the beginning of the bash script with lines of the form:
#SBATCH <OPTION> <VALUE> - in the
sbatchcommand:sbatch <OPTION1> <VALUE1> <OPTION2> <VALUE2> [...] myscript.sh
Many options are available. To see all options use sbatch --help. Useful options:
-J,--job-name=jobname: name of job-e,--error=err: file for batch script's standard error-o,--output=out: file for batch script's standard output--mail-type=BEGIN,END,FAIL: send an email at the beginning, end or fail of the script (default email is your user email and can be changed with--mail-user=truc@bidule.fr, to use with care)-t,--time=HH:MM:SS: time limit (default to 04:00:00)--mem=XG: to change memory reservation (default to 4G)-c,--cpus-per-task=ncpus: number of cpus required per task (default to 1)--mem-per-cpu=XG: maximum amount of real memory per allocated cpu required by the job
Job management¶
After a job has been launched, you can monitor it with squeue -u <USERNAME> or squeue -j <JOB_ID> and also cancel it with scancel <JOB_ID>.
Use R in interactive mode¶
To use R in a console mode, use srun --pty bash to be connected to a node. Then, module load statistics/R/4.3.0 (for the latest R version) and R to launch R.
srun: job 17758928 has been allocated resourcesmodule load statistics/R/4.3.0R
R version 4.3.0 (2023-04-21) -- "Already Tomorrow"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
Note
srun can be run with the same options than sbatch (cpu and memory reservations).
X11 sessions¶
X11 sessions are useful to directly display plots in an interactive session. Prior their use, and if not exists, generate a ssh key on the cluster with ssh-keygen and add it in the authorized_keys file:
The interactive session is then launched by:
- Logging on the cluster with
ssh -X <USERNAME>@genobioinfo.toulouse.inrae.fr - Running an interactive session with
srun --x11 --pty bash
srun: job 17758928 has been allocated resourcesmodule load statistics/R/4.3.0R
R version 4.3.0 (2023-04-21) -- "Already Tomorrow"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
plot(1:10)
R in a parallel environment¶
To use R with a parallel environment, the -c (or --cpus-per-task) option for the sbatch and srun is needed. In the R script, the number of cores must be set to the SAME value.
Several packages, like doParallel, BiocParallel, or future, exist to use parallel calculation with R .
The following examples use doParallel and BiocParallel for 2 parallel jobs.
First, write a R script
- With
doParallelpackage:TestParallel.Rlibrary(doParallel) # specify the number of cores with makeCluster cl <- makeCluster(2) registerDoParallel(cl) foreach(i=1:3) %dopar% sqrt(i) - or, with
BiocParallelpackage:TestParallel.Rlibrary(BiocParallel) # specify the number of cores with workers = 2 bplapply(1:10, print, BPPARAM = MulticoreParam(workers = 2))
Second, write a bash script:
#! /bin/bash
#SBATCH -J lauchRscript
#SBATCH -o output.out
#SBATCH -c 2
#Purge any previous modules
module purge
#Load the application
module load statistics/R/4.3.0
# My command lines I want to run on the cluster
Rscript TestParallel.R
Finally, launch the script with the sbatch command:
Arguments in a script¶
External arguments can be passed to an R script. The basic method is described below but the packages argparser or optparse provide ways to handle external arguments à la Python.
First, write an R script:
args <- commandArgs(trailingOnly=TRUE)
print(args[1])
Second, write a bash script:
#! /bin/bash
#SBATCH -J lauchRscript
#SBATCH -o output.out
#Purge any previous modules
module purge
#Load the application
module load statistics/R/4.3.0
# My command lines I want to run on the cluster
Rscript --vanilla HelloWorld.R "Hi!"
Finally, launch the script with the sbatch command:
Install packages in your own environment¶
Once in an interactive R session, R packages are installed (in a personal library) using the standard install.packages command line.
srun: job 17758928 has been allocated resourcesmodule load statistics/R/4.3.0R
R version 4.3.0 (2023-04-21) -- "Already Tomorrow"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
install.packages("ggplot2")
Your personal library is usually located at the root of your personal directory (i.e. ~/R) whose allocated space is very limited. A simple solution consists in:
Some packages are already installed
A few R packages are already installed inside an R version. For example, the package dplyr is already installed inside the module statistics/R/4.3.0. You can check if a package is pre-installed using the command search_R_package like in this example.
Please, wait for output...
"dplyr 1.0.10" est installé dans statistics/R/3.4.3
...
"dplyr 1.1.4" est installé dans statistics/R/4.4.0
In case you would like to have an additional package pre-installed in a given R version, you could request it to support.
Create and compile .Rmd (RMarkdown) files on the cluster (batch mode)¶
To compile a .Rmd file, two packages are needed: rmarkdown and knitr. You also need to load the module tools/Pandoc/3.1.2.
As for an R script, you can pass external arguments to a .Rmd document.
First, write a .Rmd script called MyDocument.Rmd with parameters in the header:
---
title: My Document
output: html_document
params:
text: "Hi!"
---
What is your text?
```{r}
print(params$text)
```
Second, write a R script to pass parameters:
rmarkdown::render("MyDocument.Rmd",
params = list(text = "Hola!"))
Third, write a bash script:
#SBATCH -J lauchRscript
#SBATCH -o output.out
module purge
module load statistics/R/4.3.0
module load tools/Pandoc/3.1.2
Rscript --vanilla TestRmd.R
Finally, launch the script with the sbatch command:
Access R through conda¶
On the cluster, conda is a way to get additional versions of R. It is available through a module named devel/Miniforge/Miniforge3. The following commands are an example to how create a conda environment with R 4.2.0.
When finished, you can unload the env this way