Bioinfo tips¶
How can I publish my sequencing reads?¶
If you want to publish results that implies sequencing, like long or short reads, you have to publish your data.
- For genomic data we recommend to publish it to ENA, the European equivalent of SRA.
- Count tables, like the ones obtained from RNASeq, can be submitted to ArrayExpress
Data can be alternatively published to NCBI. In this case, please use Short Read Archive (SRA) for sequences and Gene Expression Omnibus (GEO) for count tables.
How can I download dataset from ENA?¶
By using enaBrowserTools¶
If you have an SRA or an ENA id, you can use enaBrowserTools to fetch the dataset. In the following example, we will use the ENA id ERR2512031.
Downloading file with md5 check:ftp.sra.ebi.ac.uk/vol1/fastq/ERR251/001/ERR2512031/ERR2512031_1.fastq.gz
Downloading file with md5 check:ftp.sra.ebi.ac.uk/vol1/fastq/ERR251/001/ERR2512031/ERR2512031_2.fastq.gz
Completedtree ~/work/my-project/data/fastq/~/work/my-project/data/fastq/
└── ERR2512031
├── ERR2512031_1.fastq.gz
└── ERR2512031_2.fastq.gz
You can also rely on aspera to download dataset:
Completedtree ~/work/my-project/data/fastq/~/work/my-project/data/fastq/
└── ERR2512031
├── ERR2512031_1.fastq.gz
├── ERR2512031_2.fastq.gz
└── logs
├── aspera-scp-transfer.0.log
└── aspera-scp-transfer.log
By using ftp¶
Find your dataset at https://www.ebi.ac.uk/ena/.
Here we will use the dataset SRR1045853 as an example.
Find the FASTQ url in columns. When writing this documentation, the column name was "Generated FASTQ files: FTP". Right click on the link and copy it.
For the SRR1045853 example, it was:
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR104/003/SRR1045853/SRR1045853.fastq.gz
From this link, you can download the file on Genotoul cluster
- with
wget:
- Or with
aspera:
How can I download and convert SRA dataset?¶
If you have a SRA run id, you can use SRA Toolkit to fetch a dataset. In the following example, we will use the SRA id SRR1045854.
1970-01-01T00:00:01 prefetch.3.0.2: 1) 'SRR1045854' was downloaded successfully# Your data will be available `~/work/my-project/data/SRA/SRR1045854/SRR1045854.sra`# You can convert .sra files to .fastq files this way:fastq-dump -O ~/work/my-project/data/fastq \ ~/work/my-project/data/SRA/SRR1045854/SRR1045854.sraRead 171355474 spots for /home/<username>/work/my-project/data/SRA/SRR1045854/SRR1045854.sra
Written 171355474 spots for /home/<username>/work/my-project/data/SRA/SRR1045854/SRR1045854.sra
After conversion, remove the .sra files to save disk space.
You should use file from ENA as the protocol is much much much faster and files are already in fastq format.
Read the doc
There are some limitations like the maximum-size limit. Please read the SRA Toolkit documentation to know how to overcome them.
How can I generate a sarray for single fastq files?¶
This version is for single fastq files. For paired fastq files, please use the paired files version defined below.
Goal¶
The goal is to create a file with one job per line. It implies that all commands of a job must be written on the same line. The following example aims to explain how to construct such a file.
The construction logic is:
- Get dependencies a job needs, like required modules.
- For one fastqc file, compose the list of commands of the job and check it on one (small) file.
- Transform the job into a template by using variables and check it again with the same (small) file.
- Create a
forloop to generate all the job from the template. - Run the jobs with
sarray
Get dependencies¶
First, we search for module we need (here the FastQC module).
Compose job for one file¶
Then, we compose the commands for one file as usual.
From those commands, we create a script file, named generate_my_sarray.sh in this example, that does the same thing but in one line. For this purpose, we use the && between commands, which means the next command is run if the current one succeed. In code sections, click on (1) to get comments about commands.
- When clicked, a comment appears.
| generate_my_sarray.sh | |
|---|---|
1 2 | |
- With the
&&,fastqc MT_rep1_1_Ch6.fastq.gzcommand is only run if commandmodule load bioinfo/FastQC/0.12.1succeed.
We check that our script work as excepted.
Transform the job into a template¶
We transform now the script into a template by using a variable in place of the filename:
| generate_my_sarray.sh | |
|---|---|
1 2 3 | |
- We declare a variable
FASTQ_FILEthat is set to the previous filename for testing purpose. - Same command as previous, but that uses the variable
FASTQ_FILEinstead the filename directly.
We check that the job always works correctly.
Now we confirm that our one liner job is correct, we want to get its computed list of commands.
For this purpose, we use echo to display it.
| generate_my_sarray.sh | |
|---|---|
1 2 3 | |
- We surround our commands with double quotes
"and putechobefore.
We check that the command displayed is the one we except.
Apply template to all files¶
Now we can apply this template to all our fastq files by using a for loop.
| generate_my_sarray.sh | |
|---|---|
1 2 3 4 | |
- The part
*.fastq.gzwill list all the file ending by.fastq.gz. For each iteration of the for loop, the variableFASTQ_FILEwill takes a value of this list.
We generate all the commands and write them into a script file named my_fastqc_commands.sh by using the shell redirection >. It enables to redirect the content display by generate_my_sarray.sh into the file my_fastqc_commands.sh instead of terminal.
Open the my_fastqc_commands.sh file content and check everything is good. Each line of the file must be the list of commands of one job as defined in template
Run the jobs¶
Finally, submit the fastqc commands in parallel with sarray.
As usual, submission command must be adapted to your needs, and execution can be checked with squeue command
How can I generate a sarray for paired fastq files?¶
This version is for paired fastq files. For single fastq files, please use the single file version defined above.
Goal¶
As in single fastq files version, the goal is to create a file with one job per line. It implies that all commands of a job must be written on the same line. The following example aims to explain how to construct such a file.
The construction logic is:
- Get dependencies a job needs, like required modules.
- For one fastqc file, compose the list of commands of the job and check it on one (small) file.
- Transform the job into a template by using variables and check it again with the same (small) file.
- Create a
forloop to generate all the job from the template. - Run the jobs with
sarray
Get dependencies¶
First, we search for module we need (here the STAR module).
bioinfo/STAR-Fusion/1.5.0
Compose job for one file¶
Then, we compose the commands for one file as usual.
From those commands, we create a script file, named generate_my_sarray.sh in this example, that does the same thing but in one line. For this purpose, we use:
&&between commands, which means the next command is run if the current one succeed,\at the end of lines, which allows to spread a command over several lines and improves readability.
| generate_my_sarray.sh | |
|---|---|
1 2 3 4 | |
We check that our script work as excepted.
Transform the job into a template¶
We transform now the script into a template by using a variable in place of the filename:
| generate_my_sarray.sh | |
|---|---|
1 2 3 4 5 6 | |
- We declare variables
R1andR2that are set to the previous filenames for testing purpose. - Same command as previous, but that uses the variables
R1andR2as input to--readFilesInoption, and${R1%.R1.fastq.gz}as output file prefix for option--outFileNamePrefix. The syntax${R1%.R1.fastq.gz}producesamplenamevalue, which is${R1}valuesamplename.R1.fastq.gzwhere.R1.fastq.gzpart is removed.
We check that the job always works correctly.
Now we confirm that our one liner job is correct, we want to get its computed list of commands.
For this purpose, we use echo to display it.
| generate_my_sarray.sh | |
|---|---|
1 2 3 4 5 6 | |
- We surround our commands with double quotes
"and putechobefore.
We check that the command displayed is the one we except.
Apply template to all files¶
As we work on pairs of files, we need to list them by pairs. This is the purpose of the paste command. The result of this command, for n samples would be:
/directory/sample2.R1.fastq.gz /directory/sample2.R2.fastq.gz
/directory/sample3.R1.fastq.gz /directory/sample3.R2.fastq.gz
...
We can then use this list of pairs of files with the template in such a way:
| generate_my_sarray.sh | |
|---|---|
1 2 3 4 5 6 7 | |
- We generate the list of pair of files as described before.
-
We loop over the list of pairs computed with the
readfunction and attributes them respectively to variablesR1andR2. Contrary toforloop, thewhile readapproach allows to manage multiple variables in an iteration.In addition, the part|| [ -n "$R2" ]manages last pair that can be forgotten in particular case. (1)- The
readfunction uses new line to identify the end of a groups of variables. Depending how the list is generated, newline may missing at the end of the file listing and will be ignored.
- The
We generate all the commands and write them into a script file named my_STAR_commands.sh by using the shell redirection >. It enables to redirect the content display by generate_my_sarray.sh into the file my_STAR_commands.sh instead of terminal.
Open the my_STAR_commands.sh file content and check everything is good. Each line of the file must be the list of commands of one job as defined in template
Run the jobs¶
Finally, submit the STAR commands in parallel with sarray.
As usual, submission command must be adapted to your needs, and execution can be checked with squeue command
Alternative perl version¶
You can also use a one line perl program to create a list of commands from each pair of files.
perl -lane '($sample)=$F[0]=~/(.*).fastq.gz/; \
print "module load XXX; STAR --genomeDir star-index --readFilesIn $F[0] $F[1] --outFileNamePrefix $sample ...."' > my_STAR_commands.sh
Explanations:
- The
perl -lanepart will run aperlcommand on each line. The 'enter' at end of line will be removed and the line will be split on spaces into an array named$F[]. In the example, each filename is then a cell of the array:- The first cell
$F[0]contains the R1 filename - The second cell
$F[1]contains the R2 filename
- The first cell
- The
perlcommand contains two instructions:- The first one,
($sample)=$F[0]=~/(.*).fastq.gz/, will capture the sample name part from the R1 filename ($F[0]) by using a regular expression. - The second one that will generate and print the command line by using variables:
$F[0](file R1),$F[1](file R2) and$sample(sample name) obtained before.
- The first one,
- Once the lines generated are what you want, redirect the output into a file named
my_STAR_commands.shby using>.