Gaëlle Lefort, Alyssa Imbert, Nathalie Vialaneix, Genotoul-Bioinfo

CC BY-NC-SA

How to use R on Genotoul-Bioinfo cluster (advanced)¶

Objective

This tutorial aims at describing how R scripts, compile RMarkdown files and manage your own R version on the Genotoul-Bioinfo cluster.

Correct social behaviour expected

DO NOT run treatments on frontal servers, you're going to be a nuisance to other users. Please, always use sbatch or srun. It includes positron editor.

Before contacting the support, READ THE FAQ and Tutorials.

prerequisite

You need to have an account. Ask for an account if needed.

Advanced scripts¶

Arguments in a script¶

External arguments can be passed to an R script. The basic method is described below but the packages argparser or optparse provide ways to handle external arguments à la Python.

First, write an R script:

HelloWorld.R
args <- commandArgs(trailingOnly=TRUE)

print(args[1])

Second, write a bash script:

myscript.sh
#! /bin/bash
#SBATCH -J lauchRscript
#SBATCH -o output.out

#Purge any previous modules
module purge

#Load the application
module load statistics/R/4.3.0

# My command lines I want to run on the cluster
Rscript --vanilla HelloWorld.R "Hi!"

Finally, launch the script with the sbatch command:

sbatch myscript.sh

R in a parallel environment¶

To use R with a parallel environment, the -c (or --cpus-per-task) option for the sbatch and srun is needed. In the R script, the number of cores must be set to the SAME value.

Several packages, like doParallel, BiocParallel, or future, exist to use parallel calculation with R. The following examples use doParallel and BiocParallel for 2 parallel jobs.

First, write a R script

With doParallel package:

TestParallel.R
library(doParallel)
# specify the number of cores with makeCluster
cl <- makeCluster(2)
registerDoParallel(cl)

foreach(i=1:3) %dopar% sqrt(i)

or, with BiocParallel package:

TestParallel.R
library(BiocParallel)

# specify the number of cores with workers = 2
bplapply(1:10, print, BPPARAM = MulticoreParam(workers = 2))

Second, write a bash script:

myscript.sh
#! /bin/bash
#SBATCH -J lauchRscript
#SBATCH -o output.out
#SBATCH -c 2

#Purge any previous modules
module purge

#Load the application
module load statistics/R/4.3.0

# My command lines I want to run on the cluster
Rscript TestParallel.R

Finally, launch the script with the sbatch command:

sbatch myscript.sh

Combine parallel enviromnent and arguments¶

One drawback of previous part is that we must change the number of cpus in both TestParallel.R and myscript.sh. By using arguments, we can automatically make TestParallel.R use this number set in myscript.sh:

The script

TestParallel.R
#!/usr/bin/env Rscript

library(argparser, quietly=TRUE)
library(BiocParallel, quietly=TRUE)

# Create a parser
p <- arg_parser("My super script")

# Add command line arguments, here the --cpus argument
p <- add_argument(p, "--cpus", help="number of cores used", default=1)

# Parse the command line arguments
argv <- parse_args(p)

# Do something
bplapply(1:10, print, BPPARAM = MulticoreParam(workers = argv$cpus))

The submission script

myscript.sh
#! /bin/bash
#SBATCH -J lauchRscript
#SBATCH -o output.out
#SBATCH -c 2

#Purge any previous modules
module purge

#Load the application
module load statistics/R/4.3.0

# My command lines I want to run on the cluster
Rscript TestParallel.R --cpus $SLURM_CPUS_PER_TASK

Now each time we set the number of cpus used with sbatch, it will be set correctly in our R script.

# We submit your script with 4 cpussbatch -c 4 myscript.sh

How can I work with very big data on my own computer?¶

You can try Apache arrow package and its parquet format to manipulate efficiently data that don't fit in your computer memory (RAM). Pay attention that all big data manipulation must use this library, else you will fill your memory.

Arrow package can be used through dyplr.

RMarkdown (.Rmd) in batch mode¶

To compile a .Rmd file, two packages are needed: rmarkdown and knitr. You also need to load the module tools/Pandoc/3.1.2. As for an R script, you can pass external arguments to a .Rmd document.

First, write a .Rmd script called MyDocument.Rmd with parameters in the header:

MyDocument.Rmd
---
title: My Document
output: html_document
params:
    text: "Hi!"
---

What is your text?
```{r}
print(params$text)
```

Second, write a R script to pass parameters:

TestRmd.R
rmarkdown::render("MyDocument.Rmd", 
                  params = list(text = "Hola!"))

Third, write a bash script:

myscript.sh
#SBATCH -J lauchRscript
#SBATCH -o output.out

module purge
module load statistics/R/4.3.0
module load tools/Pandoc/3.1.2

Rscript --vanilla TestRmd.R

Finally, launch the script with the sbatch command:

sbatch myscript.sh

Manage your own R¶

Access R through `conda`¶

On the cluster, conda is a way to get additional versions of R. It is available through a module named devel/Miniforge/Miniforge3. The following commands are an example to how create a conda environment with R 4.2.0.

module load devel/Miniforge/Miniforge3# We search for which R versions are availableconda search -c conda-forge r-base# We create a conda env at ~/work/envs/r-4.2# You can choose another path after '-p'conda create -c conda-forge -p ~/work/envs/r-4.2 r-base=4.2.0# We make R available by loading the env# /!\ The command differs from the usual 'conda activate' command /!\source activate ~/work/envs/r-4.2# You can then launch RR

When finished, you can unload the env this way

conda deactivate

How can I have same `R` version(s) on my computer and the cluster?¶

There is 2 ways:

Use rig on your computer to install the same R version than the ones used on the cluster. If you use Positron as an editor, you can also use a conda/pixi env.
Use conda/pixi on the cluster to install the same R version than the one available on our own computer.

How to use R on Genotoul-Bioinfo cluster (advanced)¶

Advanced scripts¶

Arguments in a script¶

R in a parallel environment¶

Combine parallel enviromnent and arguments¶

How can I work with very big data on my own computer?¶

RMarkdown (.Rmd) in batch mode¶

Manage your own R¶

Access R through conda¶

How can I have same R version(s) on my computer and the cluster?¶

Access R through `conda`¶

How can I have same `R` version(s) on my computer and the cluster?¶