HTSeqQC: Quality control analysis for high-throughput sequencing data (HTS)

Renesh Bedre        6 minute read

What is HTSeqQC?

  • HTSeqQC is an automated quality control analysis tool for a single and paired-end high-throughput sequencing data (HTS) generated from Illumina sequencing platforms.
  • HTSeqQC can be used to filter and/or trim sequence reads for low-quality sequences, adapter or primer contamination, and uncalled bases (N) to obtain high-quality data for downstream bioinformatics analysis

HTSeqQC Features

  • Simultaneously filter and/or trim reads for adapter or primer contamination, uncalled bases (N), and low-quality reads
  • Supports single and paired-end reads
  • Analyze multiple samples simultaneously
  • Parallel computation for accelerating the speed of analysis
  • Visualization and statistics
  • Docker image is available
  • Available on CyVerse Discovery Environment (DE)
  • No dependency on an external open-source tool

Getting Started

Prerequisites

HTSeqQC requires

  • Python 3 (tested on 3.6 and 3.7)
  • NumPy
  • pysam
  • termcolor
  • datetime
  • matplotlib

Installing

Install using git

# download and install HTSeqQC v1.0
git clone https://github.com/reneshbedre/HTSeqQC.git
cd HTSeqQC
python setup.py install

How to use

Print help message to see all required and optional parameters,

filter.py -h 

usage: filter.py [-h] [-a INPUT_FILES_1] [-b INPUT_FILES_2] [-c QUAL_FMT]
                 [-e N_CONT] [-f ADPT_SEQS] [-d MIN_SIZE] [-g ADPT_MATCH]
                 [-i QUAL_THRESH] [-n TRIM_OPT] [-p WIND_SIZE]
                 [-r MIN_LEN_FILT] [-q CPU] [-m OUT_FMT] [-v VIS_OPT]
                 [--version]

Quality control analysis of single and paired-end sequence data

optional arguments:
  -h, --help            show this help message and exit
  -a INPUT_FILES_1, --p1 INPUT_FILES_1
                        Single end input files or left files for paired-end
                        data (.fastq, .fq). Multiple sample files must be
                        separated by comma or space
  -b INPUT_FILES_2, --p2 INPUT_FILES_2
                        Right files for paired-end data (.fastq, .fq).
                        Multiple files must be separated by comma or space
  -c QUAL_FMT, --qfmt QUAL_FMT
                        Quality value format [1= Illumina 1.8, 2= Illumina
                        1.3,3= Sanger]. If quality format not provided, it
                        will automatically detect based on sequence data
  -e N_CONT, --nb N_CONT
                        Filter the reads containing given % of uncalled bases
                        (N)
  -f ADPT_SEQS, --adp ADPT_SEQS
                        Trim the adapter and truncate the read sequence
                        (multiple adapter sequences must be separated by
                        comma)
  -d MIN_SIZE, --msz MIN_SIZE
                        Filter the reads which are lesser than minimum size
  -g ADPT_MATCH, --per ADPT_MATCH
                        Truncate the read sequence if it matches to adapter
                        sequence equal or more than given percent (0.0-1.0)
                        [default=0.9]
  -i QUAL_THRESH, --qthr QUAL_THRESH
                        Filter the read sequence if average quality of bases
                        in reads is lower than threshold (1-40) [default:20]
  -n TRIM_OPT, --trim TRIM_OPT
                        If trim option set to True, the reads with low quality
                        (as defined by option --qthr) will be trimmed instead
                        of discarding [True|False] [default: False]
  -p WIND_SIZE, --wsz WIND_SIZE
                        The window size for trimming (5->3) the reads. This
                        option should always set when -trim option is defined
                        [default: 5]
  -r MIN_LEN_FILT, --mlk MIN_LEN_FILT
                        Minimum length of the reads to retain after trimming
  -q CPU, --cpu CPU     Number of CPU [default:2]
  -m OUT_FMT, --ofmt OUT_FMT
                        Output file format (fastq/fasta) [default:fastq]
  -v VIS_OPT, --no-vis VIS_OPT
                        No figures will be produced [True|False]
                        [default:False]
  --version             show program's version number and exit

HTSeqQC Analysis commands and output

Download the test paired-end dataset (Published in Kandel et al., 2019)

Run HTSeqQC on paired-end sequence data,

# once successfully downloaded, use HTSeqQC for quality filtering
# with default command and --cpu 6
filter.py --cpu 18 --p1 1_S1_L001_R1_001.fastq --p2 1_S1_L001_R2_001.fastq
# verbose output while program running
# [2020-03-30 21:49:59.738239] The fastq quality format is not provided therefore detecting the fastq variant...
# [2020-03-30 21:50:00.677720] The fastq quality format is illumina 1.8+
# [2020-03-30 21:50:01.618438] Preparing the data for analysis...
# [2020-03-30 21:50:16.445633] Started Filtering of the reads data...
# [2020-03-30 21:54:22.070608] Finished filtering of data successfully...
# [2020-03-30 21:54:22.071288] Output saved in /scratch/user/ren_net/software/HTSeqQC/test/1_S1_L001_R1_001_filtering_out

Once the quality control analysis is completed, HTSeqQC generates following files for quality evaluations in the folder 1_S1_L001_R1_001_filtering_out

Files Description
Statistics.txt Detailed statistics of quality control evaluations for provided parameters
1_S1_L001_R1_001_Clean.fastq Cleaned sequence data in FASTQ format (left file for paired-end data)
1_S1_L001_R2_001_Clean.fastq Cleaned sequence data in FASTQ format (right file for paired-end data)
1_S1_L001_R1_001_Qualdist.png Comparative sequence PHRED quality distribution for raw and cleaned data (for left file)
1_S1_L001_R2_001_Qualdist.png Comparative sequence PHRED quality distribution for raw and cleaned data (for right file)
1_S1_L001_R1_001_GCdist.png Comparative percentage GC content distribution for raw and cleaned data (for left file)
1_S1_L001_R2_001_GCdist.png Comparative percentage GC content distribution for raw and cleaned data (for right file)
1_S1_L001_R1_001_QualGroup.png Group-wise sequence PHRED quality distribution for cleaned data (for left file)
1_S1_L001_R2_001_QualGroup.png Group-wise for sequence PHRED quality distribution for cleaned data (for right file)
1_S1_L001_R1_001_Basedist.png Nucleotide base content for raw and cleaned data (for left file)
1_S1_L001_R2_001_Basedist.png Nucleotide base content for raw and cleaned data (for right file)
Command.log Commands used for the HTSeqQC analysis

Detailed statistics of quality control evaluations (Statistics.txt),

Comparative sequence PHRED quality distribution for raw and cleaned data (1_S1_L001_R1_001_Qualdist.png and 1_S1_L001_R2_001_Qualdist.png),

Comparative percentage GC content distribution for raw and cleaned data (1_S1_L001_R1_001_GCdist.png and 1_S1_L001_R2_001_GCdist.png),

Group-wise sequence PHRED quality distribution for cleaned data (1_S1_L001_R1_001_QualGroup.png and 1_S1_L001_R2_001_QualGroup.png),

Nucleotide base content for raw and cleaned data (1_S1_L001_R1_001_Basedist.png and 1_S1_L001_R2_001_Basedist.png),

Run HTSeqQC by changing the parameters for different quality thresholds, adapters, and uncalled bases content. All parameters can be checked using filter.py -h

# for paired end data with quality threshold, adapter sequences, and uncalled based parameters 
filter.py --cpu 18 --qthr 25 --nb 5 --adp AGATCGGAAGAGCACACGTCTGAACTCCAGTCA,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT 
    --p1 1_S1_L001_R1_001.fastq --p2 1_S1_L001_R2_001.fastq

Run HTSeqQC for multiple files simultaneously

# for paired end data with default parameter and multiple samples
filter.py --cpu 18 --p1 1_S1_L001_R1_001.fastq,2_S1_L001_R1_001.fastq,3_S1_L001_R1_001.fastq 
    --p2 1_S1_L001_R2_001.fastq,2_S1_L001_R2_001.fastq,3_S1_L001_R2_001.fastq
    
# Note: 2 and 3 files are not provided with datatsets. These are provided in commands
# for illustration purpose only

Similarly, HTSeqQC can be used for running single-end datasets (just exclude --p2 parameter)

Run HTSeqQC using Nextflow

  • Nextflow can also be used to run HTSeqQC analysis. HTSeqQC package also comes with Netxflow template (nextflow_template.nf) for running the analysis
  • If you have not installed Nextflow, you can install it as curl -s https://get.nextflow.io | bash. Nextflow works best with java version between 8 to 12. You can check the current java version as java -version
  • HTSeqQC can be used to run multiples files by providing each file name explicitly in the command line. Nextflow can automate this process by providing file patterns and thus avoiding providing each file name to HTSeqQC.
  • You can also run nextflow with distributed computing (run on several computing nodes) on high-performance computers (using mpirun)
  • For example, if you have three paired-end files (total 6 FASTQ files) with filename ending pattern of R1_001.fastq R2_001.fastq (check above HTSeqQC for multiple files commands), then you can edit the nextflow template like below,
// input data directory path
DATAP = "/scratch/user/ren_net/software/HTSeqQC/test"
// input file extension
FileExtension="fastq"
// Check paired end files (for single end files remove 2)
// first should check the file name structure before using
allReads="${DATAP}/*_{R1_001.fastq,R2_001.fastq}.${FileExtension}"


log.info """\
         HTSeqQC Quality Filtering
         =============================
         reads : ${DATAP}
         """
         .stripIndent()


// for paired end reads
Channel
    .fromFilePairs(allReads)
    .ifEmpty { error "Cannot find any reads matching: ${DATAP}" }
    .set { read_pairs }


// filter data
process filterData {
    tag "$pair_id"

    input:
    set pair_id, file(reads) from read_pairs

    script:
    """
    # you can change HTSeqQC parameter here
    filter.py --cpu 18 --p1 ${reads[0]} --p2 ${reads[1]}
    """
}

Once you edit the nextflow template file, run it as

./nextflow  nextflow_template.nf

Once nextflow run is completed, you can see the output files in the work directory generated in the same directory

Run HTSeqQC as GUI using CyVerse

Check detailed documentation here

How to cite:

  • Bedre R, Rajasekaran K, Mangu VR, Timm LE, Bhatnagar D, Baisakh N. Genome-wide transcriptome analysis of cotton (Gossypium hirsutum L.) identifies candidate gene signatures in response to aflatoxin producing fungus Aspergillus flavus. PLoS One. 2015;10(9).
  • Bedre R, Irigoyen S, Schaker PD, Monteiro-Vitorello CB, Da Silva JA, Mandadi KK. Genome-wide alternative splicing landscapes modulated by biotrophic sugarcane smut pathogen. Scientific reports. 2019 Jun 20;9(1):1-2.
  • Renesh Bedre. HTSeqQC: Quality control analysis for high-throughput sequencing data (HTS). https://reneshbedre.github.io/blog/htseqqc.html

Last updated: April 8, 2020

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.