What is NCBI Sequence Read Archive (SRA) Toolkit?

  • NCBI SRA toolkit is a set of utilities to download, view and search large volume of high-throughput sequencing data from NCBI SRA database at faster speed

Applications

  • Effectively download the large volume of high-throughput sequencing data (eg. FASTQ, SAM)
  • Convert SRA file into other biological file format (eg. FASTA, ABI, SAM, QSEQ, SFF)
  • Retrieve small subset of large files (eg. sequences, alignment)
  • Search within SRA files and fetch specific sequences
  • Allow to use Aspera client ascp for much faster download (Aspera client should have installed)

Download and install NCBI SRA toolkit

# I am using Ubuntu Linux 16.04.1 LTS
# download latest version of compiled binaries of NCBI SRA toolkit 
# ( July 24, 2018, version 2.9.2) for Ubuntu Linux
# Compiled binaries for other OS visit: https://github.com/ncbi/sra-tools/wiki/Downloads
$ wget http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.9.2/sratoolkit.2.9.2-ubuntu64.tar.gz
# extract tar.gz file 
$ tar -zxvf sratoolkit.2.9.2-ubuntu64.tar.gz
# add binaries to path using export path or editing ~/.bashrc file
$ export PATH=$PATH:/home/renesh/software/sratoolkit.2.9.2-ubuntu64/bin
# Now SRA binaries added to path and ready to use

# if you want to use Aspera client (ascp) with NCBI toolkit, you need to install it
# download: https://downloads.asperasoft.com/connect2/
# extract and install
# ascp will be installed in $HOME/.aspera/bin directory
./ibm-aspera-connect-3.8.1.161274-linux-g2.12-64.sh
# add ascp to path
export PATH=$PATH:/home/renesh/.aspera/connect/bin

Download SRA datasets using NCBI SRA toolkit and Aspera client

# download file: prefetch will download and save SRA file related to SRR accession in 
# $HOME/ncbi/public/sra directory
$ prefetch  SRR5790106  # for a single file
$ prefetch  SRR5790106 SRR5790104  # multiple files
# prefetch with ascp
# -a option includes 'path to ascp file|path to key file'
# you should see downloading via FASP in a message 
# you can also exclude -a option if ascp is in system path
prefetch -a '/home/renesh/.aspera/connect/bin/ascp|/home/renesh/.aspera/connect/etc/asperaweb_id_dsa.openssh' SRR8296149

# convert to FASTQ: fastq-dump will convert SRR5790106.sra to SRR5790106.fastq
$ fastq-dump  SRR5790106  # single file
$ fastq-dump  SRR5790106  SRR5790104 # multiple files
# now you can replace fastq-dump with fasterq-dump (version 2.9.2) which is much faster 
# and efficient for large datasets
$ fasterq-dump  SRR5790106  
# for paired-end data use --split-files (fastq-dump) and -S (fasterq-dump) option
$ fastq-dump --split-files SRR8296149
$ fasterq-dump -S SRR8296149
# download alignment files (SAM)
# make sure corresponding accession has alignment file at SRA database
$ sam-dump --output-file SRR1236468.sam SRR1236468

NOTE: With fastq-dump and fasterq-dump, prefetch step is unncessary and you can directly download sequence data in FASTQ format

Validation of downloaded SRA data integrity

It is essential to check the integrity and checksum of SRA datasets to ensure successful download

# download FASTQ file
$ fasterq-dump  SRR5790104  
# check integrity of downloaded SRR5790104.fastq file
# output from vdb-validate should report 'ok' and 'consistent' for all parameters
# Note: make sure you have .sra (not .cache) file for corresponding accession in 
# $HOME/ncbi/public/sra directory
$ vdb-validate SRR5790104
2018-12-11T22:59:01 vdb-validate.2.9.2 info: Validating '/home/renesh/ncbi/public/sra/SRR5790104.sra'...
2018-12-11T22:59:01 vdb-validate.2.9.2 info: Database 'SRR5790104.sra' metadata: md5 ok
2018-12-11T22:59:01 vdb-validate.2.9.2 info: Table 'SEQUENCE' metadata: md5 ok
2018-12-11T22:59:01 vdb-validate.2.9.2 info: Column 'QUALITY': checksums ok
2018-12-11T22:59:01 vdb-validate.2.9.2 info: Column 'READ': checksums ok
2018-12-11T22:59:01 vdb-validate.2.9.2 info: Column 'READ_LEN': checksums ok
2018-12-11T22:59:01 vdb-validate.2.9.2 info: Column 'READ_START': checksums ok
2018-12-11T22:59:01 vdb-validate.2.9.2 info: Database '/home/renesh/ncbi/public/sra/SRR5790104.sra' 
contains only unaligned reads
2018-12-11T22:59:01 vdb-validate.2.9.2 info: Database 'SRR5790104.sra' is consistent

Customized download of SRA datasets

You can use SRA tools for customized output of large SRA datasets without downloading complete datasets (NOTE: some options are not available in fasterq-dump)

# print first 10 reads from single-end FASTQ file
# -Z option will print output on screen (STDOUT)
$ fastq-dump -X 10 -Z SRR5790106
# save FASTQ file to specififed directory
$ fastq-dump -O temp SRR5790106
$ fasterq-dump -O temp SRR5790106
# compress FASTQ file gzip or bzip2
$ fastq-dump -O temp SRR5790106
$ fastq-dump --gzip SRR5790106  
$ fastq-dump --bzip2 SRR5790106
# Multithreading 
$ fasterq-dump -e 10 SRR5790106  

Convert SRA data into other biological formats

SRA tools allow you to convert SRA files into FASTA, ABI, Illumina native (QSEQ), and SFF format

# convert to FASTA
$ fastq-dump --fasta SRR5790106  
# convert to ABI (CSFASTA and QVAL)
$ abi-dump  SRR5790106  
# convert to QSEQ 
# SRA database should have alignment information submitted for corresponding accession 
$ illumina-dump --qseq 2 SRR1236472 # 2 for paired-end and 1 for single-end
# convert to SFF 
# SFF is a binary file format related to 454 high-throughput sequencing
$ sff-dump SRR996630

Search within SRA files

You can search specific sequences or subset of sequences in SRA files

# search within SRA files
# output will be sequence read IDs 
$ sra-search  GATGCCGCGCC SRR5790104

NOTE: For every SRA tools, you can check all options by providing -h parameter (eg. fasterq-dump -h)

How to cite?

Bedre, R. “How to use NCBI SRA Toolkit effectively?” Renesh Bedre (blog), December 12, 2018, https://reneshbedre.github.io/blog/fqutil.html.

If you have any questions, comments or recommendations, please email me at reneshbe@gmail.com

Last updated: December 12, 2018