How to use R on Genotoul-Bioinfo cluster (advanced)¶
Objective
This tutorial aims at describing how R scripts, compile RMarkdown files and manage your own R version on the Genotoul-Bioinfo cluster.
Correct social behaviour expected
DO NOT run treatments on frontal servers, you're going to be a nuisance to other users. Please, always use sbatch or srun.
It includes positron editor.
Before contacting the support, READ THE FAQ and Tutorials.
prerequisite
You need to have an account. Ask for an account if needed.
Advanced scripts¶
Arguments in a script¶
External arguments can be passed to an R script. The basic method is described below but the packages argparser or optparse provide ways to handle external arguments à la Python.
First, write an R script:
| HelloWorld.R | |
|---|---|
1 2 3 | |
Second, write a bash script:
| myscript.sh | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 | |
Finally, launch the script with the sbatch command:
R in a parallel environment¶
To use R with a parallel environment, the -c (or --cpus-per-task) option for the sbatch and srun is needed. In the R script, the number of cores must be set to the SAME value.
Several packages, like doParallel, BiocParallel, or future, exist to use parallel calculation with R.
The following examples use doParallel and BiocParallel for 2 parallel jobs.
First, write a R script
- With
doParallelpackage:TestParallel.R 1 2 3 4 5 6
library(doParallel) # specify the number of cores with makeCluster cl <- makeCluster(2) registerDoParallel(cl) foreach(i=1:3) %dopar% sqrt(i) - or, with
BiocParallelpackage:TestParallel.R 1 2 3 4
library(BiocParallel) # specify the number of cores with workers = 2 bplapply(1:10, print, BPPARAM = MulticoreParam(workers = 2))
Second, write a bash script:
| myscript.sh | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
Finally, launch the script with the sbatch command:
Combine parallel enviromnent and arguments¶
One drawback of previous part is that we must change the number of cpus in both TestParallel.R and myscript.sh. By using arguments, we can automatically make TestParallel.R use this number set in myscript.sh:
-
The script
TestParallel.R 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
#!/usr/bin/env Rscript library(argparser, quietly=TRUE) library(BiocParallel, quietly=TRUE) # Create a parser p <- arg_parser("My super script") # Add command line arguments, here the --cpus argument p <- add_argument(p, "--cpus", help="number of cores used", default=1) # Parse the command line arguments argv <- parse_args(p) # Do something bplapply(1:10, print, BPPARAM = MulticoreParam(workers = argv$cpus)) -
The submission script
myscript.sh 1 2 3 4 5 6 7 8 9 10 11 12 13
#! /bin/bash #SBATCH -J lauchRscript #SBATCH -o output.out #SBATCH -c 2 #Purge any previous modules module purge #Load the application module load statistics/R/4.3.0 # My command lines I want to run on the cluster Rscript TestParallel.R --cpus $SLURM_CPUS_PER_TASK
Now each time we set the number of cpus used with sbatch, it will be set correctly in our R script.
How can I work with very big data on my own computer?¶
You can try Apache arrow package and its parquet format to manipulate efficiently data that don't fit in your computer memory (RAM). Pay attention that all big data manipulation must use this library, else you will fill your memory.
Arrow package can be used through dyplr.
RMarkdown (.Rmd) in batch mode¶
To compile a .Rmd file, two packages are needed: rmarkdown and knitr. You also need to load the module tools/Pandoc/3.1.2.
As for an R script, you can pass external arguments to a .Rmd document.
First, write a .Rmd script called MyDocument.Rmd with parameters in the header:
| MyDocument.Rmd | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 | |
Second, write a R script to pass parameters:
| TestRmd.R | |
|---|---|
1 2 | |
Third, write a bash script:
| myscript.sh | |
|---|---|
1 2 3 4 5 6 7 8 | |
Finally, launch the script with the sbatch command:
Manage your own R¶
Access R through conda¶
On the cluster, conda is a way to get additional versions of R. It is available through a module named devel/Miniforge/Miniforge3. The following commands are an example to how create a conda environment with R 4.2.0.
When finished, you can unload the env this way
How can I have same R version(s) on my computer and the cluster?¶
There is 2 ways:
- Use
rigon your computer to install the same R version than the ones used on the cluster. If you use Positron as an editor, you can also use a conda/pixi env. - Use conda/pixi on the cluster to install the same R version than the one available on our own computer.