t-SNE in Python

Renesh Bedre        8 minute read

What is t-SNE?

  • t-SNE (t-Distributed Stochastic Neighbor Embedding) is nonlinear dimensionality reduction technique in which interrelated high dimensional data (usually hundreds or thousands of variables) is mapped into low-dimensional data (like 2 or 3 variables) while preserving the significant structure (relationship among the data points in different variables) of original high dimensional data.
  • The resulting reduced 2 or 3-dimensional data represents the structure of high dimensional data and easy to visualize on the scatter plot. The proximal samples will be placed together and dissimilar samples at greater distances. t-SNE is mostly used for the visualization purposes only and not for detailed quantitative analysis.
  • For example, t-SNE is more suitable for single cell RNA-seq (scRNA-seq) as it produces the expression data for various cell classes which encompasses a biologically meaningful hierarchical structure.
  • Unlike PCA, t-SNE can be applied and work better with both linear and nonlinear well-clustered datasets and produces more meaningful clustering

t-SNE in Python

  • To run t-SNE in Python, we will use the digits dataset which is available in the scikit-learn package. I have a also used scRNA-seq data for t-SNE visualization (see below).
  • The digits dataset (representing an image of a digit) has 64 variables (D) and 1797 observations (N) divided into 10 different categories of digits
  • we will use sklearn and bioinfokit (v0.8.5 or later) packages for t-SNE and visualization
  • Check bioinfokit documentation for installation and documentation
# you can use interactive python interpreter, jupyter notebook, spyder or python code
# I am using interactive python interpreter (Python 3.7)
# import pandas dataframe formatted digits dataset for t-SNE analysis 
# (dataset available at scikit-learn)
>>> from bioinfokit.analys import get_data
>>> df = get_data('digits').data
>>> df.head()
   pixel_0_0  pixel_0_1  pixel_0_2  pixel_0_3  pixel_0_4  ...  pixel_7_4  pixel_7_5  pixel_7_6  pixel_7_7  class
0        0.0        0.0        5.0       13.0        9.0  ...       10.0        0.0        0.0        0.0      0
1        0.0        0.0        0.0       12.0       13.0  ...       16.0       10.0        0.0        0.0      1
2        0.0        0.0        0.0        4.0       15.0  ...       11.0       16.0        9.0        0.0      2
3        0.0        0.0        7.0       15.0       13.0  ...       13.0        9.0        0.0        0.0      3
4        0.0        0.0        0.0        1.0       11.0  ...       16.0        4.0        0.0        0.0      4

>>> df.shape
(1797, 65)

# run t-SNE
>>> from sklearn.manifold import TSNE
# perplexity parameter can be changed based on the input datatset
# dataset with larger number of variables requires larger perplexity
# set this value between 5 and 50 (sklearn documentation)
# verbose=1 displays run time messages
# set n_ite sufficiently high to resolve the well stabilized cluster
# get embeddings
>>> tsne_em = TSNE(n_components=2, perplexity=30.0, n_iter=1000, verbose=1).fit_transform(df)
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 1797 samples in 0.040s...
[t-SNE] Computed neighbors for 1797 samples in 0.447s...
[t-SNE] Computed conditional probabilities for sample 1000 / 1797
[t-SNE] Computed conditional probabilities for sample 1797 / 1797
[t-SNE] Mean sigma: 8.132731
[t-SNE] KL divergence after 250 iterations with early exaggeration: 61.424686
[t-SNE] KL divergence after 1000 iterations: 0.736327

# plot t-SNE clusters
>>> from bioinfokit.visuz import cluster
>>> cluster.tsneplot(score=tsne_em)
# plot will be saved in same directory (tsne_2d.png) 

Generated t-SNE plot,

Add colors to the cluster,

# get a list of categories
>>> color_class = df['class'].to_numpy()
>>> cluster.tsneplot(score=tsne_score, colorlist=color_class, legendpos='upper right', legendanchor=(1.15, 1) )

Generated t-SNE plot,

Add customized colors to the cluster,

# get a list of categories
>>> color_class = df['class'].to_numpy()
>>> cluster.tsneplot(score=tsne_score, colorlist=color_class, colordot=('#713e5a', '#63a375', '#edc79b', '#d57a66', '#ca6680', '#395B50', '#92AFD7', '#b0413e', '#4381c1', '#736ced'), 
    legendpos='upper right', legendanchor=(1.15, 1) )

Generated t-SNE plot,

t-SNE with single cell RNA-seq dataset

  • I have downloaded the subset of single cell gene expression dataset of Arabidopsis thaliana root cells processed by 10X genomics Cell Ranger pipeline (Ryu et al., 2019)
  • This scRNA-seq dataset contains 4406 cells with ~75K reads per cells
  • I have preprocessed this data (for expression cut-off, sequence depth normalization, log-transformation, and molecular feature selection) using Seurat R package and exported highly variable molecular features for t-SNE visualization.
# import scRNA-seq as pandas dataframe
>>> from bioinfokit.analys import get_data
>>> df = get_data('ath_root').data
>>> df = df.set_index(df.columns[0])
>>> dft = df.T
>>> dft = dft.set_index(dft.columns[0])
>>> dft.head()
gene                AT1G01070  RPP1A  HTR12  AT1G01453  ADF10  PLIM2B  SBTI1.1  GL22  GPAT2  AT1G02570  BXL2  IMPA6  ...  PER72  RAB18  AT5G66440  AT5G66580  AT5G66590  AT5G66800  AT5G66815  AT5G66860  AT5G66985  IRX14H  PER73  RPL26B
AAACCTGAGACAGACC-1       0.51   1.40  -0.26      -0.28  -0.24   -0.14    -0.13 -0.07  -0.29      -0.31 -0.23   0.66  ...  -0.25   0.64       0.61      -0.55      -0.41      -0.43       2.01       3.01      -0.24   -0.18  -0.34    1.16
AAACCTGAGATCCGAG-1      -0.22   1.36  -0.26      -0.28  -0.60   -0.51    -0.13 -0.07  -0.29      -0.31  0.81  -0.31  ...  -0.25   1.25      -0.48      -0.55      -0.41      -0.43      -0.24       0.89      -0.24   -0.18  -0.49   -0.68
AAACCTGAGTGTGAAT-1      -0.22   2.49  -0.26      -0.28  -0.60   -0.51    -0.13 -0.07  -0.29      -0.31 -0.23   0.99  ...  -0.25  -0.52      -0.48      -0.55       2.92      -0.43      -0.24       2.82      -0.24   -0.18  -0.49    1.60
AAACCTGCAAAGAATC-1       2.24   0.82  -0.26      -0.28  -0.60   -0.51    -0.13 -0.07  -0.29      -0.31 -0.23  -0.31  ...  -0.25  -0.52       0.91      -0.55      -0.41      -0.43      -0.24      -0.43      -0.24   -0.18  -0.49    1.95
AAACCTGCAAAGGAAG-1      -0.22  -0.51  -0.26      -0.28  -0.60   -0.51    -0.13 -0.07  -0.29      -0.31 -0.23  -0.31  ...   3.51  -0.52      -0.48       1.85      -0.41      -0.43      -0.24      -0.43       8.85   -0.18  -0.49    0.16

>>> dft.shape
(4406, 2000)

# as we have large number variables, we will first do to PCA to keep minimum number 
# of variables for t-SNE
>>> from sklearn.decomposition import PCA
>>> import pandas as pd
>>> pca_scores = PCA().fit_transform(dft)
# create a dataframe of pca_scores
>>> df_pc = pd.DataFrame(pca_scores)

# perform t-SNE on PCs scores
# we will use first 50 PCs but this can vary
>>> from sklearn.manifold import TSNE
>>> tsne_em = TSNE(n_components=2, perplexity=30.0, early_exaggeration=12, n_iter=1000, learning_rate=368, verbose=1).fit_transform(df_pc.loc[:,0:49])
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 4406 samples in 0.081s...
[t-SNE] Computed neighbors for 4406 samples in 1.451s...
[t-SNE] Computed conditional probabilities for sample 1000 / 4406
[t-SNE] Computed conditional probabilities for sample 2000 / 4406
[t-SNE] Computed conditional probabilities for sample 3000 / 4406
[t-SNE] Computed conditional probabilities for sample 4000 / 4406
[t-SNE] Computed conditional probabilities for sample 4406 / 4406
[t-SNE] Mean sigma: 4.812347
[t-SNE] KL divergence after 250 iterations with early exaggeration: 64.164688
[t-SNE] KL divergence after 1000 iterations: 0.840337
# here you can run TSNE multiple times to keep run with lowest KL divergence

# plot t-SNE clusters
>>> from bioinfokit.visuz import cluster
>>> cluster.tsneplot(score=tsne_em)
# plot will be saved in same directory (tsne_2d.png) 

Generated t-SNE plot,

Now, I will recognize the clusters using the DBSCAN algorithm. This will help to color and visualize clusters of similar data points

>>> from sklearn.cluster import DBSCAN
# here eps parameter is very important and optimizing eps is essential
# for well defined clusters. I have run DBSCAN with several eps values
# and got good clusters with eps=3
>>> get_clusters = DBSCAN(eps=3, min_samples=10).fit_predict(tsne_em)
# check unique clusters
# -1 value represents noisy points could not assigned to any cluster
>>> set(get_clusters)
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, -1}
# get t-SNE plot with colors assigned to each cluster
>>> cluster.tsneplot(score=tsne_em, colorlist=get_clusters, 
    colordot=('#713e5a', '#63a375', '#edc79b', '#d57a66', '#ca6680', '#395B50', '#92AFD7', '#b0413e', '#4381c1', '#736ced', '#631a86', '#de541e', '#022b3a', '#000000'), 
    legendpos='upper right', legendanchor=(1.15, 1))

Generated t-SNE plot,

Interpretation

  • The points within the individual clusters are highly similar to each other and in distant to points in other clusters. The same pattern likely holds in a high-dimensional original dataset. In the digits dataset, t-SNE separated clusters of each digit class. In the context of scRNA-seq, these clusters represent the cells types with similar transcriptional profiles.

Recommendations for running t-SNE and hyperparameter optimization

  • t-SNE is a stochastic method and produces slightly different embeddings if run multiple times. These different results could affect the numeric values on the axis but do not affect the clustering of the points. Therefore, t-SNE can be run several times to get the embeddings with the smallest Kullback–Leibler (KL) divergence. The run with the smallest KL could have the greatest variation.
  • If the original high-dimensional dataset contains larger number variables , it is highly recommended first to reduce the variables to small numbers (e.g. 20 to 50) using another dimensionality reduction technique (e.g. PCA) for t-SNE. It will help to speed up the t-SNE computation time and suppresses the noisy data points. Additionally, you can also use other variants of t-SNE such as Fast Fourier Transform-accelerated Interpolation-based t-SNE (FIt-SNE) for larger datasets (Linderman et al., 2019).
  • In t-SNE, a most important parameter called perplexity, which measures the effective number of neighbors, controls the trade-off between global high-dimensional and local low-dimensional space, and possible to produce a more defined structure of the clusters. The number of variables in the original high dimensional data determines the perplexity parameter (standard range 10-100).
  • While t-SNE is good in visualizing the well-separated clusters, most of the time it fails to preserve the global geometry of the data. Kobak et al., 2019 suggested keeping large perplexity parameter (n/100; where n is the number of cells) for preserving the global geometry.
  • In addition to the perplexity parameter, other parameters such as the number of iterations, learning rate (set n/12 or 200 whichever is greater), and early exaggeration factor can also affect the visualization and should be optimized for larger datasets (Kobak et al., 2019).

References

  • Maaten LV, Hinton G. Visualizing data using t-SNE. Journal of machine learning research. 2008;9(Nov):2579-605.
  • Kobak D, Berens P. The art of using t-SNE for single-cell transcriptomics. Nature communications. 2019 Nov 28;10(1):1-4.
  • Cieslak MC, Castelfranco AM, Roncalli V, Lenz PH, Hartline DK. t-Distributed Stochastic Neighbor Embedding (t-SNE): A tool for eco-physiological transcriptomic analysis. Marine Genomics. 2019 Nov 26:100723.
  • Rich-Griffin C, Stechemesser A, Finch J, Lucas E, Ott S, Schäfer P. Single-cell transcriptomics: a high-resolution avenue for plant functional genomics. Trends in plant science. 2020 Feb 1;25(2):186-97.
  • Devassy BM, George S. Dimensionality reduction and visualisation of hyperspectral ink data Using t-SNE. Forensic Science International. 2020 Feb 12:110194.
  • Linderman GC, Rachh M, Hoskins JG, Steinerberger S, Kluger Y. Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nature methods. 2019 Mar;16(3):243-5.
  • Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature biotechnology. 2018 May;36(5):411-20.
  • Ryu KH, Huang L, Kang HM, Schiefelbein J. Single-cell RNA sequencing resolves molecular relationships among individual plant cells. Plant physiology. 2019 Apr 1;179(4):1444-56.
  • C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their Applications to Handwritten Digit Recognition, MSc Thesis, Institute of Graduate Studies in Science and Engineering, Bogazici University.

How to cite?
Renesh Bedre.(2020, July 29). reneshbedre/bioinfokit: Bioinformatics data analysis and visualization toolkit (Version v0.9). Zenodo. http://doi.org/10.5281/zenodo.3965241

If you have any questions, comments or recommendations, please email me at reneshbe@gmail.com

Last updated: June 22, 2020

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.