Bioinformatics file readers and processing (FASTA, FASTQ, and VCF)

Renesh Bedre        1 minute read

Bioinformatics file readers

  • We will discuss how to read and process common bioinformatics files
  • We will use bioinfokit v0.8.4 or later
  • Check bioinfokit documentation for installation and documentation

FASTA reader

# you can use interactive python interpreter, jupyter notebook, google colab, spyder or python code
# I am using interactive python interpreter (Python 3.8.2)
from bioinfokit.analys import fasta
fasta_iter = fasta.fasta_reader(file='fasta_file')
# read fasta file
for record in fasta_iter:
    header, sequence = record
    print(header, sequence)
    # get sequence length
    sequence_len = len(sequence)
    # count A bases
    a_base = sequence.count('A')
    print(a_base)
    # reverse complementary
    rev_com_seq = fasta.rev_com(seq=sequence)
    print(rev_com_seq)

FASTQ reader

# you can use interactive python interpreter, jupyter notebook, google colab, spyder or python code
# I am using interactive python interpreter (Python 3.8.2)
from bioinfokit.analys import fastq
fastq_iter = fastq.fastq_reader(file='fastq_file')
# read fastq file
for record in fastq_iter:
    # get sequence headers, sequence, and quality values
    header_1, sequence, header_2, qual = record
    # get sequence length
    sequence_len = len(sequence)
    # count A bases
    a_base = sequence.count('A')
    print(sequence, qual, a_base, sequence_len)

VCF (variant call format) reader

# you can use interactive python interpreter, jupyter notebook, google colab, spyder or python code
# I am using interactive python interpreter (Python 3.8.2)
from bioinfokit.analys import marker
# id is for column name with chromosome identifier in VCF file
vcf_iter = marker.vcfreader(file='vcf_file', id='#CHROM')
# read vcf file
for record in vcf_iter:
    # get info lines, header, and variant records
    headers, info_lines, marker_lines = record
    # get variant records with specified chromosome name
    # change chr name as per need
    if marker_lines[0] == 'chr':
        print(marker_lines)
    # retrieve markers within specified positions
    # change chr, pos1, and pos2 variable as per need
    if marker_lines[0]=='SL4.0ch00' and int(marker_lines[1]>=764654) and int(marker_lines[1])<=1038399:
        print(marker_lines)

How to cite?
Renesh Bedre.(2020, July 29). reneshbedre/bioinfokit: Bioinformatics data analysis and visualization toolkit (Version v0.9). Zenodo. http://doi.org/10.5281/zenodo.3965241

If you have any questions, comments or recommendations, please email me at reneshbe@gmail.com

Last updated: June 17, 2020

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.