Bioinfo tips¶

How can I publish my sequencing reads?¶

If you want to publish results that implies sequencing, like long or short reads, you have to publish your data.

For genomic data we recommend to publish it to ENA, the European equivalent of SRA.
Count tables, like the ones obtained from RNASeq, can be submitted to ArrayExpress

Data can be alternatively published to NCBI. In this case, please use Short Read Archive (SRA) for sequences and Gene Expression Omnibus (GEO) for count tables.

How can I download dataset from ENA?¶

By using enaBrowserTools¶

If you have an SRA or an ENA id, you can use enaBrowserTools to fetch the dataset. In the following example, we will use the ENA id ERR2512031.

module load bioinfo/enaBrowserTools/1.7.1enaDataGet -f fastq -d ~/work/my-project/data/fastq ERR2512031Checking availability of https://www.ebi.ac.uk/ena/browser/api/xml/ERR2512031
Downloading file with md5 check:ftp.sra.ebi.ac.uk/vol1/fastq/ERR251/001/ERR2512031/ERR2512031_1.fastq.gz
Downloading file with md5 check:ftp.sra.ebi.ac.uk/vol1/fastq/ERR251/001/ERR2512031/ERR2512031_2.fastq.gz
Completedtree ~/work/my-project/data/fastq/~/work/my-project/data/fastq/
└── ERR2512031
├── ERR2512031_1.fastq.gz
└── ERR2512031_2.fastq.gz

You can also rely on aspera to download dataset:

module load bioinfo/enaBrowserTools/1.7.1enaDataGet -f fastq -as ${VIRTUAL_ENV}/../aspera_settings.ini \ -d ~/work/my-project/data/fastq ERR2512031...
Completedtree ~/work/my-project/data/fastq/~/work/my-project/data/fastq/
└── ERR2512031
├── ERR2512031_1.fastq.gz
├── ERR2512031_2.fastq.gz
└── logs
├── aspera-scp-transfer.0.log
└── aspera-scp-transfer.log

By using ftp¶

Find your dataset at https://www.ebi.ac.uk/ena/.

Here we will use the dataset SRR1045853 as an example.

Find the FASTQ url in columns. When writing this documentation, the column name was "Generated FASTQ files: FTP". Right click on the link and copy it.

For the SRR1045853 example, it was:

ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR104/003/SRR1045853/SRR1045853.fastq.gz

From this link, you can download the file on Genotoul cluster

with wget:

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR104/003/SRR1045853/SRR1045853.fastq.gz

Or with aspera:

module load tools/aspera/4.1.3.93ascp -QT -P33001 \ -i /tools/others_tools/aspera/aspera-connect_4.1.3.93/connect/etc/asperaweb_id_dsa.openssh \ era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR104/003/SRR1045853/SRR1045853.fastq.gz .

How can I download and convert SRA dataset?¶

If you have a SRA run id, you can use SRA Toolkit to fetch a dataset. In the following example, we will use the SRA id SRR1045854.

# Load module `SRA Toolkit`module load bioinfo/SRA-Toolkit/3.0.2# Download dataset in ~/work/my-project/data/SRA/prefetch -O ~/work/my-project/data/SRA SRR1045854...
1970-01-01T00:00:01 prefetch.3.0.2: 1) 'SRR1045854' was downloaded successfully# Your data will be available `~/work/my-project/data/SRA/SRR1045854/SRR1045854.sra`# You can convert .sra files to .fastq files this way:fastq-dump -O ~/work/my-project/data/fastq \ ~/work/my-project/data/SRA/SRR1045854/SRR1045854.sraRead 171355474 spots for /home/<username>/work/my-project/data/SRA/SRR1045854/SRR1045854.sra
Written 171355474 spots for /home/<username>/work/my-project/data/SRA/SRR1045854/SRR1045854.sra

After conversion, remove the .sra files to save disk space. You should use file from ENA as the protocol is much much much faster and files are already in fastq format.

Read the doc

There are some limitations like the maximum-size limit. Please read the SRA Toolkit documentation to know how to overcome them.

How can I generate a `sarray` for single fastq files?¶

This version is for single fastq files. For paired fastq files, please use the paired files version defined below.

Goal¶

The goal is to create a file with one job per line. It implies that all commands of a job must be written on the same line. The following example aims to explain how to construct such a file.

The construction logic is:

Get dependencies a job needs, like required modules.
For one fastqc file, compose the list of commands of the job and check it on one (small) file.
Transform the job into a template by using variables and check it again with the same (small) file.
Create a for loop to generate all the job from the template.
Run the jobs with sarray

Get dependencies¶

First, we search for module we need (here the FastQC module).

search_module fastqcbioinfo/FastQC/0.12.1

Compose job for one file¶

Then, we compose the commands for one file as usual.

# Load FastQC modulemodule load bioinfo/FastQC/0.12.1# We run the fastqc command but we stop it with CTRL+C after few seconds# We want to check that the command syntax is correctfastqc MT_rep1_1_Ch6.fastq.gz^C

From those commands, we create a script file, named generate_my_sarray.sh in this example, that does the same thing but in one line. For this purpose, we use the && between commands, which means the next command is run if the current one succeed. In code sections, click on (1) to get comments about commands.

When clicked, a comment appears.

generate_my_sarray.sh
#!/bin/bash
module load bioinfo/FastQC/0.12.1 && fastqc MT_rep1_1_Ch6.fastq.gz # (1)!

With the &&, fastqc MT_rep1_1_Ch6.fastq.gz command is only run if command module load bioinfo/FastQC/0.12.1 succeed.

We check that our script work as excepted.

# Make the script runablechmod +x ./generate_my_sarray.sh# As in previous step, we stop the script after few seconds.# The aim is to check that the command syntax is correct./generate_my_sarray.sh^C

Transform the job into a template¶

We transform now the script into a template by using a variable in place of the filename:

generate_my_sarray.sh
#!/bin/bash
FASTQ_FILE="MT_rep1_1_Ch6.fastq.gz" # (1)!
module load bioinfo/FastQC/0.12.1 && fastqc ${FASTQ_FILE} # (2)!

We declare a variable FASTQ_FILE that is set to the previous filename for testing purpose.
Same command as previous, but that uses the variable FASTQ_FILE instead the filename directly.

We check that the job always works correctly.

# As in previous step, we stop the script after few seconds.# The aim is (again) to check that the command syntax is correct./generate_my_sarray.sh^C

Now we confirm that our one liner job is correct, we want to get its computed list of commands. For this purpose, we use echo to display it.

generate_my_sarray.sh
#!/bin/bash
FASTQ_FILE="MT_rep1_1_Ch6.fastq.gz"
echo "module load bioinfo/FastQC/0.12.1 && fastqc ${FASTQ_FILE}" # (1)!

We surround our commands with double quotes " and put echo before.

We check that the command displayed is the one we except.

./generate_my_sarray.shmodule load bioinfo/FastQC/0.12.1 && fastqc MT_rep1_1_Ch6.fastq.gz

Apply template to all files¶

Now we can apply this template to all our fastq files by using a for loop.

generate_my_sarray.sh
#!/bin/bash
for FASTQ_FILE in *.fastq.gz; do  # (1)!
  echo "module load bioinfo/FastQC/0.12.1 && fastqc ${FASTQ_FILE}"
done

The part *.fastq.gz will list all the file ending by .fastq.gz. For each iteration of the for loop, the variable FASTQ_FILE will takes a value of this list.

We generate all the commands and write them into a script file named my_fastqc_commands.sh by using the shell redirection >. It enables to redirect the content display by generate_my_sarray.sh into the file my_fastqc_commands.sh instead of terminal.

./generate_my_sarray.sh > my_fastqc_commands.sh

Open the my_fastqc_commands.sh file content and check everything is good. Each line of the file must be the list of commands of one job as defined in template

Run the jobs¶

Finally, submit the fastqc commands in parallel with sarray.

sarray -J fastqcjob -o %j.out -e %j.err -t 01:00:00 --mem=2G \ --mail-type=BEGIN,END,FAIL my_fastqc_commands.sh

As usual, submission command must be adapted to your needs, and execution can be checked with squeue command

squeue -u <username>

How can I generate a `sarray` for paired fastq files?¶

This version is for paired fastq files. For single fastq files, please use the single file version defined above.

Goal¶

As in single fastq files version, the goal is to create a file with one job per line. It implies that all commands of a job must be written on the same line. The following example aims to explain how to construct such a file.

The construction logic is:

Get dependencies a job needs, like required modules.
For one fastqc file, compose the list of commands of the job and check it on one (small) file.
Transform the job into a template by using variables and check it again with the same (small) file.
Create a for loop to generate all the job from the template.
Run the jobs with sarray

Get dependencies¶

First, we search for module we need (here the STAR module).

search_module STARbioinfo/STAR/2.7.11b
bioinfo/STAR-Fusion/1.5.0

Compose job for one file¶

Then, we compose the commands for one file as usual.

# Load STAR modulemodule load bioinfo/STAR/2.7.11b# We run the STAR command but we stop it with CTRL+C after few seconds# We want to check that the command syntax is correctSTAR --genomeDir star-index \ --readFilesIn samplename.R1.fastq.gz samplename.R2.fastq.gz \ --outFileNamePrefix samplename [options]...^C

From those commands, we create a script file, named generate_my_sarray.sh in this example, that does the same thing but in one line. For this purpose, we use:

&& between commands, which means the next command is run if the current one succeed,
\ at the end of lines, which allows to spread a command over several lines and improves readability.

generate_my_sarray.sh
#!/bin/bash
module load bioinfo/STAR/2.7.11b && STAR --genomeDir star-index \
  --readFilesIn samplename.R1.fastq.gz samplename.R2.fastq.gz \
  --outFileNamePrefix samplename [options]...

We check that our script work as excepted.

# Make the script runablechmod +x ./generate_my_sarray.sh# As in previous step, we stop the script after few seconds.# The aim is to check that the command syntax is correct./generate_my_sarray.sh^C

Transform the job into a template¶

We transform now the script into a template by using a variable in place of the filename:

generate_my_sarray.sh
#!/bin/bash
R1="samplename.R1.fastq.gz" # (1)!
R2="samplename.R2.fastq.gz"
module load bioinfo/STAR/2.7.11b && STAR --genomeDir star-index \
  --readFilesIn ${R1} ${R2} \
  --outFileNamePrefix ${R1%.R1.fastq.gz} [options]... # (2)!

We declare variables R1 and R2 that are set to the previous filenames for testing purpose.
Same command as previous, but that uses the variables R1 and R2 as input to --readFilesIn option, and ${R1%.R1.fastq.gz} as output file prefix for option --outFileNamePrefix. The syntax ${R1%.R1.fastq.gz} produce samplename value, which is ${R1} value samplename.R1.fastq.gz where .R1.fastq.gz part is removed.

We check that the job always works correctly.

# As in previous step, we stop the script after few seconds.# The aim is (again) to check that the command syntax is correct./generate_my_sarray.sh^C

Now we confirm that our one liner job is correct, we want to get its computed list of commands. For this purpose, we use echo to display it.

generate_my_sarray.sh
#!/bin/bash
R1="samplename.R1.fastq.gz"
R2="samplename.R2.fastq.gz"
echo "module load bioinfo/STAR/2.7.11b && STAR --genomeDir star-index \
  --readFilesIn ${R1} ${R2} \
  --outFileNamePrefix ${R1%.R1.fastq.gz} [options]..." # (1)!

We surround our commands with double quotes " and put echo before.

We check that the command displayed is the one we except.

./generate_my_sarray.sh

Apply template to all files¶

As we work on pairs of files, we need to list them by pairs. This is the purpose of the paste command. The result of this command, for n samples would be:

ls /directory/*.fastq.gz | paste - -/directory/sample1.R1.fastq.gz /directory/sample1.R2.fastq.gz
/directory/sample2.R1.fastq.gz /directory/sample2.R2.fastq.gz
/directory/sample3.R1.fastq.gz /directory/sample3.R2.fastq.gz
...

We can then use this list of pairs of files with the template in such a way:

generate_my_sarray.sh
#!/bin/bash
FASTQ_PAIRS="$(ls /directory/*.fastq.gz | paste - -)" # (1)!
echo "${FASTQ_PAIRS}" | while read -r R1 R2 || [ -n "$R2" ]; do # (2)!
echo "module load bioinfo/STAR/2.7.11b && STAR --genomeDir star-index \
  --readFilesIn ${R1} ${R2} \
  --outFileNamePrefix ${R1%.R1.fastq.gz} [options]..."
done

We generate the list of pair of files as described before.
We loop over the list of pairs computed with the read function and attributes them respectively to variables R1 and R2. Contrary to for loop, the while read approach allows to manage multiple variables in an iteration.
In addition, the part || [ -n "$R2" ] manages last pair that can be forgotten in particular case. (1)
1. The read function uses new line to identify the end of a groups of variables. Depending how the list is generated, newline may missing at the end of the file listing and will be ignored.

We generate all the commands and write them into a script file named my_STAR_commands.sh by using the shell redirection >. It enables to redirect the content display by generate_my_sarray.sh into the file my_STAR_commands.sh instead of terminal.

./generate_my_sarray.sh > my_STAR_commands.sh

Open the my_STAR_commands.sh file content and check everything is good. Each line of the file must be the list of commands of one job as defined in template

Run the jobs¶

Finally, submit the STAR commands in parallel with sarray.

sarray -J startjob -o %j.out -e %j.err -t 01:00:00 --mem=2G \ --mail-type=BEGIN,END,FAIL my_STAR_commands.sh

As usual, submission command must be adapted to your needs, and execution can be checked with squeue command

squeue -u <username>

Alternative perl version¶

You can also use a one line perl program to create a list of commands from each pair of files.

ls /directory/*.fastq.gz | paste - - | \
perl -lane '($sample)=$F[0]=~/(.*).fastq.gz/; \
print "module load XXX; STAR --genomeDir star-index --readFilesIn $F[0] $F[1] --outFileNamePrefix $sample ...."' > my_STAR_commands.sh

Explanations:

The perl -lane part will run a perl command on each line. The 'enter' at end of line will be removed and the line will be split on spaces into an array named $F[]. In the example, each filename is then a cell of the array:
- The first cell $F[0] contains the R1 filename
- The second cell $F[1] contains the R2 filename
The perl command contains two instructions:
- The first one, ($sample)=$F[0]=~/(.*).fastq.gz/, will capture the sample name part from the R1 filename ($F[0]) by using a regular expression.
- The second one that will generate and print the command line by using variables: $F[0] (file R1), $F[1] (file R2) and $sample (sample name) obtained before.
Once the lines generated are what you want, redirect the output into a file named my_STAR_commands.sh by using >.

Bioinfo tips¶

How can I publish my sequencing reads?¶

How can I download dataset from ENA?¶

By using enaBrowserTools¶

By using ftp¶

How can I download and convert SRA dataset?¶

How can I generate a sarray for single fastq files?¶

Goal¶

Get dependencies¶

Compose job for one file¶

Transform the job into a template¶

Apply template to all files¶

Run the jobs¶

How can I generate a sarray for paired fastq files?¶

Goal¶

Get dependencies¶

Compose job for one file¶

Transform the job into a template¶

Apply template to all files¶

Run the jobs¶

Alternative perl version¶

How can I generate a `sarray` for single fastq files?¶

How can I generate a `sarray` for paired fastq files?¶