bioinfokit documentation

Renesh Bedre        25 minute read

What is bioinfokit?

The bioinfokit toolkit aimed to provide various easy-to-use functionalities to analyze, visualize, and interpret the biological data generated from genome-scale omics experiments.

How to install?

bioinfokit is developed in Python 3 and tested with Python versions >= 3.6

bioinfokit requires

  • NumPy
  • scikit-learn
  • seaborn
  • pandas
  • matplotlib
  • SciPy
  • matplotlib_venn

bioinfokit can be installed using pip, easy_install and git.

latest bioinfokit version: PyPI version, Build Status

Install using pip for Python 3 (easiest way)

# install
pip install bioinfokit

# upgrade to latest version
pip install bioinfokit --upgrade

# uninstall 
pip uninstall bioinfokit

Install using easy_install for Python 3 (easiest way)

# install latest version
easy_install bioinfokit

# specific version
easy_install bioinfokit==0.3

# uninstall 
pip uninstall bioinfokit

Install using conda

conda install -c bioconda bioinfokit

Install using git

# download and install bioinfokit (Tested on Linux, Mac, Windows) 
git clone https://github.com/reneshbedre/bioinfokit.git
cd bioinfokit
python setup.py install

Check the version of bioinfokit

>>> import bioinfokit
>>> bioinfokit.__version__
'0.9.6'

Download statistics

Weekly Monthly Total
Downloads Downloads Downloads

How to use bioinfokit?

Gene expression analysis

Volcano plot

latest update v0.8.8

bioinfokit.visuz.gene_exp.volcano(df, lfc, pv, lfc_thr, pv_thr, color, valpha, geneid, genenames, gfont, dim, r, ar, dotsize, markerdot, sign_line, gstyle, show, figtype, axtickfontsize, axtickfontname, axlabelfontsize, axlabelfontname, axxlabel, axylabel, xlm, ylm, plotlegend, legendpos, figname, legendanchor, legendlabels)

Parameters Description
df Pandas dataframe table having atleast gene IDs, log fold change, P-values or adjusted P-values columns
lfc Name of a column having log or absolute fold change values [string][default:logFC]
pv Name of a column having P-values or adjusted P-values [string][default:p_values]
lfc_thr Log or absolute fold change cutoff for up and downregulated genes [float][default:1.0]
pv_thr P-values or adjusted P-values cutoff for up and downregulated genes [float][default:0.05]
color Tuple of three colors [tuple or list][default: color=(“green”, “grey”, “red”)]
valpha Transparency of points on volcano plot [float (between 0 and 1)][default: 1.0]
geneid Name of a column having gene Ids. This is necessary for plotting gene label on the points [string][default: None]
genenames Tuple of gene Ids to label the points. The gene Ids must be present in the geneid column. If this option set to “deg” it will label all genes defined by lfc_thr and pv_thr [string, tuple, dict][default: None]
gfont Font size for genenames [float][default: 10.0]. gfont not compatible with gstyle=2.
dim Figure size [tuple of two floats (width, height) in inches][default: (5, 5)]
r Figure resolution in dpi [int][default: 300]. Not compatible with show= True
ar Rotation of X and Y-axis ticks labels [float][default: 90]
dotsize The size of the dots in the plot [float][default: 8]
markerdot Shape of the dot marker. See more options at https://matplotlib.org/3.1.1/api/markers_api.html [string][default: “o”]
sign_line Show grid lines on plot with defined log fold change (lfc_thr) and P-value (pv_thr) threshold value [True or False][default:False]
gstyle Style of the text for genenames. 1 for default text and 2 for box text [int][default: 1]
show Show the figure on console instead of saving in current folder [True or False][default:False]
figtype Format of figure to save. Supported format are eps, pdf, pgf, png, ps, raw, rgba, svg, svgz [string][default:’png’]
axtickfontsize Font size for axis ticks [float][default: 9]
axtickfontname Font name for axis ticks [string][default: ‘Arial’]
axlabelfontsize Font size for axis labels [float][default: 9]
axlabelfontname Font name for axis labels [string][default: ‘Arial’]
axxlabel Label for X-axis. If you provide this option, default label will be replaced [string][default: None]
axylabel Label for Y-axis. If you provide this option, default label will be replaced [string][default: None]
xlm Range of ticks to plot on X-axis [float (left, right, interval)][default: None]
ylm Range of ticks to plot on Y-axis [float (bottom, top, interval)][default: None]
plotlegend plot legend on volcano plot [True or False][default:False]
legendpos position of the legend on plot. For more options see loc parameter at https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html [string ][default:”best”]
figname name of figure [string ][default:”ma”]
legendanchor position of the legend outside of the plot. For more options see bbox_to_anchor parameter at https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html [list][default:None]
legendlabels legend label names. If you provide custom label names keep the same order of label names as default [list][default:[‘significant up’, ‘not significant’, ‘significant down’]]

Returns:

Volcano plot image in same directory (volcano.png) Working example

Inverted Volcano plot

latest update v0.8.8

bioinfokit.visuz.gene_exp.involcano(table, lfc, pv, lfc_thr, pv_thr, color, valpha, geneid, genenames, gfont, gstyle, dotsize, markerdot, r, dim, show, figtype, axxlabel, axylabel, axlabelfontsize, axtickfontsize, axtickfontname, plotlegend, legendpos, legendanchor, figname, legendlabels, ar)

Parameters Description
table Pandas dataframe table having atleast gene IDs, log fold change, P-values or adjusted P-values
lfc Name of a column having log fold change values [default:logFC]
pv Name of a column having P-values or adjusted P-values [default:p_values]
lfc_thr Log fold change cutoff for up and downregulated genes [default:1]
pv_thr P-values or adjusted P-values cutoff for up and downregulated genes [default:0.05]
color Tuple of three colors [tuple or list][default: color=(“green”, “grey”, “red”)]
valpha Transparency of points on volcano plot [float (between 0 and 1)][default: 1.0]
geneid Name of a column having gene Ids. This is necessary for plotting gene label on the points [string][default: None]
genenames Tuple of gene Ids to label the points. The gene Ids must be present in the geneid column. If this option set to “deg” it will label all genes defined by lfc_thr and pv_thr [string, tuple, dict][default: None]
gfont Font size for genenames [float][default: 10.0]
gstyle Style of the text for genenames. 1 for default text and 2 for box text [int][default: 1]
dotsize The size of the dots in the plot [float][default: 8]
markerdot Shape of the dot marker. See more options at https://matplotlib.org/3.1.1/api/markers_api.html [string][default: “o”]
dim Figure size [tuple of two floats (width, height) in inches][default: (5, 5)]
r Figure resolution in dpi [int][default: 300]. Not compatible with show= True
figtype Format of figure to save. Supported format are eps, pdf, pgf, png, ps, raw, rgba, svg, svgz [string][default:’png’]
show Show the figure on console instead of saving in current folder [True or False][default:False]
axxlabel Label for X-axis. If you provide this option, default label will be replaced [string][default: None]
axylabel Label for Y-axis. If you provide this option, default label will be replaced [string][default: None]
axlabelfontsize Font size for axis labels [float][default: 9]
axtickfontsize Font size for axis ticks [float][default: 9]
axtickfontname Font name for axis ticks [string][default: ‘Arial’]
plotlegend plot legend on inverted volcano plot [True or False][default:False]
legendpos position of the legend on plot. For more options see loc parameter at https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html [string ][default:”best”]
legendanchor position of the legend outside of the plot. For more options see bbox_to_anchor parameter at https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html [list][default:None]
figname name of figure [string ][default:”involcano”]
legendlabels legend label names. If you provide custom label names keep the same order of label names as default [list][default:[‘significant up’, ‘not significant’, ‘significant down’]]
ar Rotation of X and Y-axis ticks labels [float][default: 90]

Returns:

Inverted volcano plot image in same directory (involcano.png) Working example

MA plot

latest update v0.8.8

bioinfokit.visuz.gene_exp.ma(df, lfc, ct_count, st_count, lfc_thr, color, dim, dotsize, show, r, valpha, figtype, axxlabel, axylabel, axlabelfontsize, axtickfontsize, axtickfontname, xlm, ylm, fclines, fclinescolor, legendpos, legendanchor, figname, legendlabels, plotlegend, ar)

Parameters Description
df Pandas dataframe table having atleast gene IDs, log fold change, and normalized counts (control and treatment) columns
lfc Name of a column having log fold change values [default:logFC]
ct_count Name of a column having count values for control sample [default:value1]
st_count Name of a column having count values for treatment sample [default:value2]
lfc_thr Log fold change cutoff for up and downregulated genes [default:1]
color Tuple of three colors [tuple or list][default: (“green”, “grey”, “red”)]
dotsize The size of the dots in the plot [float][default: 8]
markerdot Shape of the dot marker. See more options at https://matplotlib.org/3.1.1/api/markers_api.html [string][default: “o”]
valpha Transparency of points on plot [float (between 0 and 1)][default: 1.0]
dim Figure size [tuple of two floats (width, height) in inches][default: (5, 5)]
r Figure resolution in dpi [int][default: 300]. Not compatible with show= True
figtype Format of figure to save. Supported format are eps, pdf, pgf, png, ps, raw, rgba, svg, svgz [string][default:’png’]
show Show the figure on console instead of saving in current folder [True or False][default:False]
axxlabel Label for X-axis. If you provide this option, default label will be replaced [string][default: None]
axylabel Label for Y-axis. If you provide this option, default label will be replaced [string][default: None]
axlabelfontsize Font size for axis labels [float][default: 9]
axtickfontsize Font size for axis ticks [float][default: 9]
axtickfontname Font name for axis ticks [string][default: ‘Arial’]
xlm Range of ticks to plot on X-axis [float (left, right, interval)][default: None]
ylm Range of ticks to plot on Y-axis [float (bottom, top, interval)][default: None]
fclines draw log fold change threshold lines as defines by lfc [True or False][default:False]
fclinescolor color of fclines [string][default: ‘#2660a4’]
plotlegend plot legend on MA plot [True or False][default:False]
legendpos position of the legend on plot. For more options see loc parameter at https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html [string ][default:”best”]
legendanchor position of the legend outside of the plot. For more options see bbox_to_anchor parameter at https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html [list][default:None]
figname name of figure [string ][default:”ma”]
legendlabels legend label names. If you provide custom label names keep the same order of label names as default [list][default:[‘significant up’, ‘not significant’, ‘significant down’]]
ar Rotation of X and Y-axis ticks labels [float][default: 90]

Returns:

MA plot image in same directory (ma.png)

Working example

Heatmap

latest update v0.8.4

bioinfokit.visuz.gene_exp.hmap(table, cmap='seismic', scale=True, dim=(6, 8), rowclus=True, colclus=True, zscore=None, xlabel=True, ylabel=True, tickfont=(12, 12), show, r, figtype, figname)

Parameters Description
file CSV delimited data file. It should not have NA or missing values
cmap Color Palette for heatmap [string][default: ‘seismic’]
scale Draw a color key with heatmap [boolean (True or False)][default: True]
dim heatmap figure size [tuple of two floats (width, height) in inches][default: (6, 8)]
rowclus Draw hierarchical clustering for rows [boolean (True or False)][default: True]
colclus Draw hierarchical clustering for columns [boolean (True or False)][default: True]
zscore Z-score standardization of row (0) or column (1). It works when clus is True. [None, 0, 1][default: None]
xlabel Plot X-label [boolean (True or False)][default: True]
ylabel Plot Y-label [boolean (True or False)][default: True]
tickfont Fontsize for X and Y-axis tick labels [tuple of two floats][default: (14, 14)]
show Show the figure on console instead of saving in current folder [True or False][default:False]
r Figure resolution in dpi [int][default: 300]. Not compatible with show= True
figtype Format of figure to save. Supported format are eps, pdf, pgf, png, ps, raw, rgba, svg, svgz [string][default:’png’]
figname name of figure [string ][default:”heatmap”]

Returns:

heatmap plot (heatmap.png, heatmap_clus.png)

Working example

Clustering analysis

Scree plot

latest update v0.9.8

bioinfokit.visuz.cluster.screeplot(obj, axlabelfontsize, axlabelfontname, axxlabel, axylabel, figtype, r, show, dim)

Parameters Description
obj list of component name and component variance
axlabelfontsize Font size for axis labels [float][default: 9]
axlabelfontname Font name for axis labels [string][default: ‘Arial’]
axxlabel Label for X-axis. If you provide this option, default label will be replaced [string][default: None]
axylabel Label for Y-axis. If you provide this option, default label will be replaced [string][default: None]
figtype Format of figure to save. Supported format are eps, pdf, pgf, png, ps, raw, rgba, svg, svgz [string][default:’png’]
r Figure resolution in dpi [int][default: 300]
show Show the figure on console instead of saving in current folder [True or False][default:False]
dim Figure size [tuple of two floats (width, height) in inches][default: (6, 4)]

Returns:

Scree plot image (screeplot.png will be saved in same directory)

Working Example

Principal component analysis (PCA) loadings plots

latest update v0.9.8

bioinfokit.visuz.cluster.pcaplot(x, y, z, labels, var1, var2, var3, axlabelfontsize, axlabelfontname, figtype, r, show, plotlabels, dim)

Parameters Description
x loadings (correlation coefficient) for principal component 1 (PC1)
y loadings (correlation coefficient) for principal component 2 (PC2)
z loadings (correlation coefficient) for principal component 3 (PC2)
labels original variables labels from dataframe used for PCA
var1 Proportion of PC1 variance [float (0 to 1)]
var2 Proportion of PC2 variance [float (0 to 1)]
var3 Proportion of PC3 variance [float (0 to 1)]
axlabelfontsize Font size for axis labels [float][default: 9]
axlabelfontname Font name for axis labels [string][default: ‘Arial’]
figtype Format of figure to save. Supported format are eps, pdf, pgf, png, ps, raw, rgba, svg, svgz [string][default:’png’]
r Figure resolution in dpi [int][default: 300]
show Show the figure on console instead of saving in current folder [True or False][default:False]
plotlabels Plot labels as defined by labels parameter [True or False][default:True]
dim Figure size [tuple of two floats (width, height) in inches][default: (6, 4)]

Returns:

PCA loadings plot 2D and 3D image (pcaplot_2d.png and pcaplot_3d.png will be saved in same directory)

Working Example

Principal component analysis (PCA) biplots

latest update v0.9.8

bioinfokit.visuz.cluster.biplot(cscore, loadings, labels, var1, var2, var3, axlabelfontsize, axlabelfontname, figtype, r, show, markerdot, dotsize, valphadot, colordot, arrowcolor, valphaarrow, arrowlinestyle, arrowlinewidth, centerlines, colorlist, legendpos, datapoints, dim)

Parameters Description
cscore principal component scores (obtained from PCA().fit_transfrom() function in sklearn.decomposition)
loadings loadings (correlation coefficient) for principal components
labels original variables labels from dataframe used for PCA
var1 Proportion of PC1 variance [float (0 to 1)]
var2 Proportion of PC2 variance [float (0 to 1)]
var3 Proportion of PC3 variance [float (0 to 1)]
axlabelfontsize Font size for axis labels [float][default: 9]
axlabelfontname Font name for axis labels [string][default: ‘Arial’]
figtype Format of figure to save. Supported format are eps, pdf, pgf, png, ps, raw, rgba, svg, svgz [string][default:’png’]
r Figure resolution in dpi [int][default: 300]
show Show the figure on console instead of saving in current folder [True or False][default:False]
markerdot Shape of the dot on plot. See more options at https://matplotlib.org/3.1.1/api/markers_api.html [string][default: “o”]
dotsize The size of the dots in the plot [float][default: 6]
valphadot Transparency of dots on plot [float (between 0 and 1)][default: 1]
colordot Color of dots on plot [string or list ][default:”#4a4e4d”]
arrowcolor Color of the arrow [string ][default:”#fe8a71”]
valphaarrow Transparency of the arrow [float (between 0 and 1)][default: 1]
arrowlinestyle line style of the arrow. check more styles at https://matplotlib.org/3.1.0/gallery/lines_bars_and_markers/linestyles.html [string][default: ‘-‘]
arrowlinewidth line width of the arrow [float][default: 1.0]
centerlines draw center lines at x=0 and y=0 for 2D plot [bool (True or False)][default: True]
colorlist list of the categories to assign the color [list][default:None]
legendpos position of the legend on plot. For more options see loc parameter at https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html [string ][default:”best”]
datapoints plot data points on graph [bool (True or False)][default: True]
dim Figure size [tuple of two floats (width, height) in inches][default: (6, 4)]

Returns:

PCA biplot 2D and 3D image (biplot_2d.png and biplot_3d.png will be saved in same directory)

Working Example

t-SNE plot

latest update v0.8.5

bioinfokit.visuz.cluster.tsneplot(score, colorlist, axlabelfontsize, axlabelfontname, figtype, r, show, markerdot, dotsize, valphadot, colordot, dim, figname, legendpos, legendanchor)

Parameters Description
score t-SNE component embeddings (obtained from TSNE().fit_transfrom() function in sklearn.manifold)
colorlist list of the categories to assign the color [list][default:None]
axlabelfontsize Font size for axis labels [float][default: 9]
axlabelfontname Font name for axis labels [string][default: ‘Arial’]
figtype Format of figure to save. Supported format are eps, pdf, pgf, png, ps, raw, rgba, svg, svgz [string][default:’png’]
r Figure resolution in dpi [int][default: 300]
show Show the figure on console instead of saving in current folder [True or False][default:False]
markerdot Shape of the dot on plot. See more options at https://matplotlib.org/3.1.1/api/markers_api.html [string][default: “o”]
dotsize The size of the dots in the plot [float][default: 6]
valphadot Transparency of dots on plot [float (between 0 and 1)][default: 1]
colordot Color of dots on plot [string or list ][default:”#4a4e4d”]
legendpos position of the legend on plot. For more options see loc parameter at https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html [string ][default:”best”]
legendanchor position of the legend outside of the plot. For more options see bbox_to_anchor parameter at https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html [list][default:None]
dim Figure size [tuple of two floats (width, height) in inches][default: (6, 4)]
figname name of figure [string ][default:”tsne_2d”]

Returns:

t-SNE 2D image (tsne_2d.png will be saved in same directory)

Working Example

Normalization

RPM or CPM normalization

latest update v0.8.9

Normalize raw gene expression counts into Reads per million mapped reads (RPM) or Counts per million mapped reads (CPM)

bioinfokit.analys.norm.cpm(df)

Parameters Description
df Pandas dataframe containing raw gene expression values. Genes with missing expression values (NA) will be dropped.

Returns:

RPM or CPM normalized Pandas dataframe as class attributes (cpm_norm)

Working Example

RPKM or FPKM normalization

latest update v0.9

Normalize raw gene expression counts into Reads per kilo base per million mapped reads (RPKM) or Fragments per kilo base per million mapped reads (FPKM)

bioinfokit.analys.norm.rpkm(df, gl)

Parameters Description
df Pandas dataframe containing raw gene expression values. Genes with missing expression or gene length values (NA) will be dropped.
gl Name of a column having gene length in bp [string][default: None]

Returns:

RPKM or FPKM normalized Pandas dataframe as class attributes (rpkm_norm)

Working Example

TPM normalization

latest update v0.9.1

Normalize raw gene expression counts into Transcript per million (TPM)

bioinfokit.analys.norm.tpm(df, gl)

Parameters Description
df Pandas dataframe containing raw gene expression values. Genes with missing expression or gene length values (NA) will be dropped.
gl Name of a column having gene length in bp [string][default: None]

Returns:

TPM normalized Pandas dataframe as class attributes (tpm_norm)

Working Example

Variant analysis

Manhatten plot

latest update v0.9.2

bioinfokit.visuz.marker.mhat(df, chr, pv, color, dim, r, ar, gwas_sign_line, gwasp, dotsize, markeridcol, markernames, gfont, valpha, show, figtype, axxlabel, axylabel, axlabelfontsize, ylm, gstyle, figname)

Parameters Description
df Pandas dataframe object with atleast SNP, chromosome, and P-values columns
chr Name of a column having chromosome numbers [string][default:None]
pv Name of a column having P-values. Must be numeric column [string][default:None]
color List the name of the colors to be plotted. It can accept two alternate colors or the number colors equal to chromosome number. If nothing (None) provided, it will randomly assign the color to each chromosome [list][default:None]
gwas_sign_line Plot statistical significant threshold line defined by option gwasp [bool (True or False)][default: False]
gwasp Statistical significant threshold to identify significant SNPs [float][default: 5E-08]
dotsize The size of the dots in the plot [float][default: 8]
markeridcol Name of a column having SNPs. This is necessary for plotting SNP names on the plot [string][default: None]
markernames The list of the SNPs to display on the plot. These SNP should be present in SNP column. Additionally, it also accepts the dict of SNPs and its associated gene name. If this option set to True, it will label all SNPs with P-value significant score defined by gwasp [string, list, tuple, dict][default: True]
gfont Font size for SNP names to display on the plot [float][default: 8]. gfont not compatible with gstyle=2.
valpha Transparency of points on plot [float (between 0 and 1)][default: 1.0]
dim Figure size [tuple of two floats (width, height) in inches][default: (6, 4)]
r Figure resolution in dpi [int][default: 300]
ar Rotation of X-axis labels [float][default: 90]
figtype Format of figure to save. Supported format are eps, pdf, pgf, png, ps, raw, rgba, svg, svgz [string][default:’png’]
show Show the figure on console instead of saving in current folder [True or False][default:False]
axxlabel Label for X-axis. If you provide this option, default label will be replaced [string][default: None]
axylabel Label for Y-axis. If you provide this option, default label will be replaced [string][default: None]
axlabelfontsize Font size for axis labels [float][default: 9]
ylm Range of ticks to plot on Y-axis [float tuple (bottom, top, interval)][default: None]
gstyle Style of the text for markernames. 1 for default text and 2 for box text [int][default: 1]
figname name of figure [string][default:”manhatten”]

Returns:

Manhatten plot image in same directory (manhatten.png)

Working example

Variant annotation

latest update v0.9.3

Assign genetic features and function to the variants in VCF file

bioinfokit.analys.marker.vcf_anot(file, id, gff_file, anot_attr)

Parameters Description
file VCF file
id chromosome id column in VCF file [string][default=’#CHROM’]
gff_file GFF3 genome annotation file
anot_attr Gene function tag in attributes field of GFF3 file

Returns:

Tab-delimited text file with annotation (annotated text file will be saved in same directory)

Working Example

Concatenate VCF files

latest update v0.9.4

Concatenate multiple VCF files into single VCF file (for example, VCF files for each chromosome)

bioinfokit.analys.marker.concatvcf(file)

Parameters Description
file Multiple vcf files separated by comma

Returns:

Concatenated VCF file (concat_vcf.vcf)

Working example

Split VCF file

bioinfokit.analys.marker.splitvcf(file)

Split single VCF file containing variants for all chromosomes into individual file containing variants for each chromosome

Parameters Description
file VCF file to split
id chromosome id column in VCF file [string][default=’#CHROM’]

Returns:

VCF files for each chromosome

Working example

High-throughput sequence analysis

FASTQ batch downloads from SRA database

latest update v0.9.7

bioinfokit.analys.fastq.sra_bd(file, t, other_opts)

FASTQ files will be downloaded using fasterq-dump. Make sure you have the latest version of the NCBI SRA toolkit (version 2.10.8) is installed and binaries are added to the system path

Parameters Description
file List of SRA accessions for batch download. All accession must be separated by a newline in the file.
t Number of threads for parallel run [int][default=4]
other_opts Provide other relevant options for fasterq-dump [str][default=None]
Provide the options as a space-separated string. You can get a detailed option for fasterq-dump using the -help option.

Returns:

FASTQ files for each SRA accession in the current directory unless specified by other_opts

Description and working example

FASTQ quality format detection

bioinfokit.analys.format.fq_qual_var(file)

Parameters Description
file FASTQ file to detect quality format [deafult: None]

Returns:

Quality format encoding name for FASTQ file (Supports only Sanger, Illumina 1.8+ and Illumina 1.3/1.4)

Working Example

Sequencing coverage

latest update v0.9.7

bioinfokit.analys.fastq.seqcov(file, gs)

Parameters Description
file FASTQ file
gs Genome size in Mbp

Returns:

Sequencing coverage of the given FASTQ file

Description and Working example

Reverse complement of DNA sequence

latest update v0.9.8

bioinfokit.analys.fasta.rev_com(sequence)

Parameters Description
seq DNA sequence to perform reverse complement
file DNA sequence in a fasta file

Returns:

Reverse complement of original DNA sequence

Working example

File format conversions

bioinfokit.analys.format

Function Parameters Description
bioinfokit.analys.format.fqtofa(file) FASTQ file Convert FASTQ file into FASTA format
bioinfokit.analys.format.hmmtocsv(file) HMM file Convert HMM text output (from HMMER tool) to CSV format
bioinfokit.analys.format.tabtocsv(file) TAB file Convert TAB file to CSV format
bioinfokit.analys.format.csvtotab(file) CSV file Convert CSV file to TAB format

Returns:

Output will be saved in same directory

Working example

GFF3 to GTF file format conversion

latest update v0.9.8

bioinfokit.analys.gff.gff_to_gtf(file, mrna_feature_name)

Parameters Description
file GFF3 genome annotation file
mrna_feature_name Name of the feature (column 3 of GFF3 file) of protein coding mRNA if other than ‘mRNA’ or ‘transcript’

Returns:

GTF format genome annotation file (file.gtf will be saved in same directory)

Working Example

Bioinformatics file readers and processing (FASTA, FASTQ, and VCF)

Function Parameters Description
bioinfokit.analys.fasta.fasta_reader(file) FASTA file FASTA file reader
bioinfokit.analys.fastq.fastq_reader(file) FASTQ file FASTQ file reader
bioinfokit.analys.marker.vcfreader(file) VCF file VCF file reader

Returns:

File generator object (can be iterated only once) that can be parsed for the record

Description and working example

Extract subsequence from FASTA files

latest update v0.9.8

bioinfokit.analys.fasta.ext_subseq(file, id, st, end, strand)

Extract the subsequence of specified region from FASTA file. If the target subsequence region is on minus strand. the reverse complementary of subsequence will be printed.

Parameters Description
file FASTA file [file]
id The ID of sequence from FASTA file to extract the subsequence [string]
st Start integer coordinate of subsequnece [int]
end End integer coordinate of subsequnece [int]
strand Strand of the subsequence [‘plus’ or ‘minus’][default: ‘plus’]

Returns:

Subsequence to stdout

Biostatistical analysis

Correlation matrix plot

bioinfokit.visuz.stat.corr_mat(table, corm, cmap, r, dim, show, figtype, axtickfontsize, axtickfontname)

Parameters Description
table Dataframe object with numerical variables (columns) to find correlation. Ideally, you should have three or more variables. Dataframe should not have identifier column.
corm Correlation method [pearson,kendall,spearman] [default:pearson]
cmap Color Palette for heatmap [string][default: ‘seismic’]. More colormaps are available at https://matplotlib.org/3.1.0/tutorials/colors/colormaps.html
r Figure resolution in dpi [int][default: 300]. Not compatible with show= True
dim Figure size [tuple of two floats (width, height) in inches][default: (6, 5)]
show Show the figure on console instead of saving in current folder [True or False][default:False]
figtype Format of figure to save. Supported format are eps, pdf, pgf, png, ps, raw, rgba, svg, svgz [string][default:’png’]
axtickfontsize Font size for axis ticks [float][default: 7]
axtickfontname Font name for axis ticks [string][default: ‘Arial’]

Returns:

Correlation matrix plot image in same directory (corr_mat.png)

Working example

Bar-dot plot

latest update v0.8.5

bioinfokit.visuz.stat.bardot(df, colorbar, colordot, bw, dim, r, ar, hbsize, errorbar, dotsize, markerdot, valphabar, valphadot, show, figtype, axxlabel, axylabel, axlabelfontsize, axlabelfontname, ylm, axtickfontsize, axtickfontname, yerrlw, yerrcw)

Parameters Description
df Pandas dataframe object
colorbar Color of bar graph [string or list][default:”#bbcfff”]
colordot Color of dots on bar [string or list][default:”#ee8972”]
bw Width of bar [float][default: 0.4]
dim Figure size [tuple of two floats (width, height) in inches][default: (6, 4)]
r Figure resolution in dpi [int][default: 300]
ar Rotation of X-axis labels [float][default: 0]
hbsize Horizontal bar size for standard error bars [float][default: 4]
errorbar Draw standard error bars [bool (True or False)][default: True]
dotsize The size of the dots in the plot [float][default: 6]
markerdot Shape of the dot marker. See more options at https://matplotlib.org/3.1.1/api/markers_api.html [string][default: “o”]
valphabar Transparency of bars on plot [float (between 0 and 1)][default: 1]
valphadot Transparency of dots on plot [float (between 0 and 1)][default: 1]
figtype Format of figure to save. Supported format are eps, pdf, pgf, png, ps, raw, rgba, svg, svgz [string][default:’png’]
show Show the figure on console instead of saving in current folder [True or False][default:False]
axxlabel Label for X-axis. If you provide this option, default label will be replaced [string][default: None]
axylabel Label for Y-axis. If you provide this option, default label will be replaced [string][default: None]
axlabelfontsize Font size for axis labels [float][default: 9]
axlabelfontname Font name for axis labels [string][default: ‘Arial’]
ylm Range of ticks to plot on Y-axis [float tuple (bottom, top, interval)][default: None]
axtickfontsize Font size for axis ticks [float][default: 9]
axtickfontname Font name for axis ticks [string][default: ‘Arial’]
yerrlw Error bar line width [float][default: None]
yerrcw Error bar cap width [float][default: None]

Returns:

Bar-dot plot image in same directory (bardot.png)

Working Example

One sample and two sample (independent and paired) t-tests

latest update v0.9.6

bioinfokit.analys.stat.ttest(df, xfac, res, evar, alpha, test_type, mu)

Parameters Description
df Pandas dataframe for appropriate t-test.
One sample: It should have atleast dependent (res) variable
Two sample independent: It should have independent (xfac) and dependent (res) variables
Two sample paired: It should have two dependent (res) variables
xfac Independent group column name with two levels [string][default: None]
res Dependent variable column name [string or list or tuple][default: None]
evar t-test with equal variance [bool (True or False)][default: True]
alpha Significance level for confidence interval (CI). If alpha=0.05, then 95% CI will be calculated [float][default: 0.05]
test_type Type of t-test [int (1,2,3)][default: None].
1: One sample t-test
2: Two sample independent t-test
3: Two sample paired t-test
mu Population or known mean for the one sample t-test [float][default: None]

Returns:

Summary output as class attribute (summary)

Description and Working example

Chi-square test

latest update v0.9.5

bioinfokit.analys.stat.chisq(df, p)

Parameters Description
df Pandas dataframe. It should be one or two-dimensional contingency table.
p Theoretical expected probabilities for each group. It must be non-negative and sum to 1. If p is provide Goodness of Fit test will be performed [list or tuple][default: None]

Returns:

Summary and expected counts as class attributes (summary and expected_df)

Working example

Linear regression analysis

bioinfokit.visuz.stat.lin_reg(df, x, y)

Parameters Description
df Pandas dataframe object
x Name of column having independent X variables [list][default:None]
y Name of column having dependent Y variables [list][default:None]

Returns:

Regression analysis summary

Working Example

Regression plot

bioinfokit.visuz.stat.regplot(df, x, y, yhat, dim, colordot, colorline, r, ar, dotsize, markerdot, linewidth, valphaline, valphadot, show, figtype, axxlabel, axylabel, axlabelfontsize, axlabelfontname, xlm, ylm, axtickfontsize, axtickfontname)

Parameters Description
df Pandas dataframe object
x Name of column having independent X variables [string][default:None]
y Name of column having dependent Y variables [string][default:None]
yhat Name of column having predicted response of Y variable (y_hat) from regression [string][default:None]
dim Figure size [tuple of two floats (width, height) in inches][default: (6, 4)]
r Figure resolution in dpi [int][default: 300]
ar Rotation of X-axis labels [float][default: 0]
dotsize The size of the dots in the plot [float][default: 6]
markerdot Shape of the dot marker. See more options at https://matplotlib.org/3.1.1/api/markers_api.html [string][default: “o”]
valphaline Transparency of regression line on plot [float (between 0 and 1)][default: 1]
valphadot Transparency of dots on plot [float (between 0 and 1)][default: 1]
linewidth Width of regression line [float][default: 1]
figtype Format of figure to save. Supported format are eps, pdf, pgf, png, ps, raw, rgba, svg, svgz [string][default:’png’]
show Show the figure on console instead of saving in current folder [True or False][default:False]
axxlabel Label for X-axis. If you provide this option, default label will be replaced [string][default: None]
axylabel Label for Y-axis. If you provide this option, default label will be replaced [string][default: None]
axlabelfontsize Font size for axis labels [float][default: 9]
axlabelfontname Font name for axis labels [string][default: ‘Arial’]
xlm Range of ticks to plot on X-axis [float tuple (bottom, top, interval)][default: None]
ylm Range of ticks to plot on Y-axis [float tuple (bottom, top, interval)][default: None]
axtickfontsize Font size for axis ticks [float][default: 9]
axtickfontname Font name for axis ticks [string][default: ‘Arial’]

Returns:

Regression plot image in same directory (reg_plot.png)

Working Example

How to cite bioinfokit?

  • Renesh Bedre. (2020, July 29). reneshbedre/bioinfokit: Bioinformatics data analysis and visualization toolkit (Version v0.9). Zenodo. http://doi.org/10.5281/zenodo.3965241
  • Additionally check Zenodo to cite specific version of bioinfokit

References:

  • Travis E. Oliphant. A guide to NumPy, USA: Trelgol Publishing, (2006).
  • John D. Hunter. Matplotlib: A 2D Graphics Environment, Computing in Science & Engineering, 9, 90-95 (2007), DOI:10.1109/MCSE.2007.55 (publisher link)
  • Fernando Pérez and Brian E. Granger. IPython: A System for Interactive Scientific Computing, Computing in Science & Engineering, 9, 21-29 (2007), DOI:10.1109/MCSE.2007.53 (publisher link)
  • Michael Waskom, Olga Botvinnik, Joel Ostblom, Saulius Lukauskas, Paul Hobson, MaozGelbart, … Constantine Evans. (2020, January 24). mwaskom/seaborn: v0.10.0 (January 2020) (Version v0.10.0). Zenodo. http://doi.org/10.5281/zenodo.3629446
  • Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay. Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, 12, 2825-2830 (2011)

bioinfokit cited by:

  • Jennifer Gribble, Andrea J. Pruijssers, Maria L. Agostini, Jordan Anderson-Daniels, James D. Chappell, Xiaotao Lu, Laura J. Stevens, Andrew L. Routh, Mark R. Denison bioRxiv 2020.04.23.057786; doi: https://doi.org/10.1101/2020.04.23.057786
  • Greaney AM, Adams TS, Raredon MS, Gubbins E, Schupp JC, Engler AJ, Ghaedi M, Yuan Y, Kaminski N, Niklason LE. Platform Effects on Regeneration by Pulmonary Basal Cells as Evaluated by Single-Cell RNA Sequencing. Cell Reports. 2020 Mar 24;30(12):4250-65.

How to cite?
Renesh Bedre.(2020, July 29). reneshbedre/bioinfokit: Bioinformatics data analysis and visualization toolkit (Version v0.9). Zenodo. http://doi.org/10.5281/zenodo.3965241

If you have any questions, comments or recommendations, please email me at reneshbe@gmail.com

Last updated: August 13, 2020

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.