One sample and two sample (independent and paired) t-test in Python

Renesh Bedre        5 minute read

One Sample t-test

  • One Sample t-test is used to compare the sample mean (a random sample from a population) with the specific value (hypothesized or known mean of the population).
  • For example, a ball has a diameter of 5 cm and we want to check whether the average diameter of the ball from the random sample (e.g. 50 balls) picked from the production line differs from the known size.

Assumptions

  • Dependent variable should have an approximately normal distribution (Shapiro-Wilks Test)
  • Observations are independent of each other

Hypotheses

  • Null hypotheses: Sample mean is equal to the hypothesized or known population mean
  • Alternative hypotheses: Sample mean is not equal to the hypothesized or known population mean (two-tailed)
  • Alternative hypotheses: Sample mean is either greater or lesser to the hypothesized or known population mean (one-tailed)

How to perform one sample t-test in Python?

# I am using interactive python interpreter (Python 3.7.4)
>>> from bioinfokit.analys import get_data, stat
# load dataset as pandas dataframe
# the dataset should not have missing (NaN) values. If it has, it will omitted
>>> df = get_data('t_one_samp').data
>>> df.head()
       size
0  5.739987
1  5.254042
2  5.152388
3  4.870819
4  3.536251

>>> res = stat()
>>> res.ttest(df=df, test_type=1, res='size', mu=5)
# output
>>> print(res.summary)

One Sample t-test

------------------  --------
Sample size         50
Mean                 5.05128
t                    0.36789
Df                  49
P-value (one-tail)   0.35727
P-value (two-tail)   0.71454
Lower 95.0%          4.77116
Upper 95.0%          5.3314
------------------  --------

Interpretation

The P-value obtained from the one sample t-test is not significant (P>0.05), and therefore, we conclude that the average diameter of the balls in a random sample is equal to 5 cm.

Two sample t-test (unpaired or independent t-test)

  • Two Sample independent t-test Used to compare the means of two independent groups
  • For example, we have two different plant genotypes (genotype A and genotype B) and would like to compare if the yield of genotype A is significantly different from genotype B

Two sample t-test Hypotheses

  • Null hypotheses: Two group means are equal
  • Alternative hypotheses: Two group means are different (two-tailed)
  • Alternative hypotheses: Mean of one group either greater or lesser than another group (one-tailed)

Two sample t-test Assumptions

  • Observations in two groups have an approximately normal distribution (Shapiro-Wilks Test)
  • Homogeneity of variances (variances are equal between treatment groups) (Levene or Bartlett Test)
  • The two groups are sampled independently from each other from the same population

How to perform Two sample t-test in Python?

  • We will use bioinfokit v0.9.6 or later
  • Check bioinfokit documentation for installation and documentation
  • Download dataset for two sample and Welch’s t-test
# I am using interactive python interpreter (Python 3.7.4)
>>> from bioinfokit.analys import get_data, stat
# load dataset as pandas dataframe
# the dataset should not have missing (NaN) values. If it has, it will omitted
>>> df = get_data('t_ind_samp').data
>>> df.head()
  Genotype  yield
0        A   78.0
1        A   84.3
2        A   81.0
3        B   88.0
4        B   92.0

>>> res = stat()
# for unequal variance t-test (Welch's t-test) set evar=False
>>> res.ttest(df=df, xfac="Genotype", res="yield", test_type=2)
# output
>>> print(res.summary)

Two sample t-test with equal variance

------------------  -------------
Mean diff           -10.3
t                    -5.40709
Std Error             1.90491
df                   10
P-value (one-tail)    0.000149204
P-value (two-tail)    0.000298408
Lower 95.0%         -14.5444
Upper 95.0%          -6.05561
------------------  -------------

Parameter estimates

Level      Number    Mean    Std Dev    Std Error    Lower 95.0%    Upper 95.0%
-------  --------  ------  ---------  -----------  -------------  -------------
A               6    79.1    3.30817      1.35056        75.6283        82.5717
B               6    89.4    3.29059      1.34338        85.9467        92.8533

Note: Even though you can perform a t-test when the sample size is unequal between two groups, it is more efficient to have an equal sample size in two groups to increase the power of the t-test.

Interpretation

The P-value obtained from the t-test is significant (P<0.05), and therefore, we conclude that the yield of genotype A is significantly different than genotype B.

Paired t-test (dependent t-test)

  • Paired t-test used to compare the differences between the pair of dependent variables for the same subject
  • For example, we have plant variety A and would like to compare the yield of A before and after the application of some fertilizer
  • Note: Paired t-test is a one sample t-test on the differences between the two dependent variables

Paired t-test Hypotheses

  • Null hypotheses: There is no difference between the two dependent variables (difference=0)
  • Alternative hypotheses: There is a difference between the two dependent variables (two-tailed)
  • Alternative hypotheses: Difference between two response variables either greater or lesser than zero (one-tailed)

Paired t-test Assumptions

  • Differences between the two dependent variables follows an approximately normal distribution (Shapiro-Wilks Test)
  • Independent variable should have a pair of dependent variables
  • Differences between the two dependent variables should not have outliers
  • Observations are sampled independently from each other
# I am using interactive python interpreter (Python 3.7.4)
>>> from bioinfokit.analys import get_data, stat
# load dataset as pandas dataframe
# the dataset should not have missing (NaN) values. If it has, it will omitted
>>> df = get_data('t_pair').data
>>> df.head()
      BF     AF
0  44.41  47.99
1  46.29  56.64
2  45.98  48.90
3  43.35  49.01
4  45.75  48.41

>>> res = stat()
>>> res.ttest(df=df, res=['AF', 'BF'], test_type=3)
# output
>>> print(res.summary)

Paired t-test

------------------  ------------
Sample size         65
Difference Mean      5.55262
t                   14.2173
Df                  64
P-value (one-tail)   8.87966e-22
P-value (two-tail)   1.77593e-21
Lower 95.0%          4.7724
Upper 95.0%          6.33283
------------------  ------------

Interpretation

The P-value obtained from the t-test is significant (P<0.05), and therefore, we conclude that the yield of plant variety A significantly increased by the application of fertilizer.

References

  • Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature methods. 2020 Mar;17(3):261-72.
  • Kim TK, Park JH. More about the basic assumptions of t-test: normality and sample size. Korean journal of anesthesiology. 2019 Aug;72(4):331.

Check detailed usage

How to cite?
Renesh Bedre.(2020, July 29). reneshbedre/bioinfokit: Bioinformatics data analysis and visualization toolkit (Version v0.9). Zenodo. http://doi.org/10.5281/zenodo.3965241

If you have any questions, comments or recommendations, please email me at reneshbe@gmail.com

Last updated: April 22, 2020

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.