# One sample and two sample (independent and paired) t-test in Python

## One Sample t-test

• One Sample t-test is used to compare the sample mean (a random sample from a population) with the specific value (hypothesized or known mean of the population).
• For example, a ball has a diameter of 5 cm and we want to check whether the average diameter of the ball from the random sample (e.g. 50 balls) picked from the production line differs from the known size.

### Assumptions

• Dependent variable should have an approximately normal distribution (Shapiro-Wilks Test)
• Observations are independent of each other

### Hypotheses

• Null hypotheses: Sample mean is equal to the hypothesized or known population mean
• Alternative hypotheses: Sample mean is not equal to the hypothesized or known population mean (two-tailed)
• Alternative hypotheses: Sample mean is either greater or lesser to the hypothesized or known population mean (one-tailed)

### How to perform one sample t-test in Python?

# I am using interactive python interpreter (Python 3.7.4)
>>> from bioinfokit.analys import get_data, stat
# load dataset as pandas dataframe
# the dataset should not have missing (NaN) values. If it has, it will omitted
>>> df = get_data('t_one_samp').data
size
0  5.739987
1  5.254042
2  5.152388
3  4.870819
4  3.536251

>>> res = stat()
>>> res.ttest(df=df, test_type=1, res='size', mu=5)
# output
>>> print(res.summary)

One Sample t-test

------------------  --------
Sample size         50
Mean                 5.05128
t                    0.36789
Df                  49
P-value (one-tail)   0.35727
P-value (two-tail)   0.71454
Lower 95.0%          4.77116
Upper 95.0%          5.3314
------------------  --------

### Interpretation

The P-value obtained from the one sample t-test is not significant (P>0.05), and therefore, we conclude that the average diameter of the balls in a random sample is equal to 5 cm.

## Two sample t-test (unpaired or independent t-test)

• Two Sample independent t-test Used to compare the means of two independent groups
• For example, we have two different plant genotypes (genotype A and genotype B) and would like to compare if the yield of genotype A is significantly different from genotype B

### Two sample t-test Hypotheses

• Null hypotheses: Two group means are equal
• Alternative hypotheses: Two group means are different (two-tailed)
• Alternative hypotheses: Mean of one group either greater or lesser than another group (one-tailed)

### Two sample t-test Assumptions

• Observations in two groups have an approximately normal distribution (Shapiro-Wilks Test)
• Homogeneity of variances (variances are equal between treatment groups) (Levene or Bartlett Test)
• The two groups are sampled independently from each other from the same population

### How to perform Two sample t-test in Python?

• We will use bioinfokit v0.9.6 or later
• Check bioinfokit documentation for installation and documentation
# I am using interactive python interpreter (Python 3.7.4)
>>> from bioinfokit.analys import get_data, stat
# load dataset as pandas dataframe
# the dataset should not have missing (NaN) values. If it has, it will omitted
>>> df = get_data('t_ind_samp').data
Genotype  yield
0        A   78.0
1        A   84.3
2        A   81.0
3        B   88.0
4        B   92.0

>>> res = stat()
# for unequal variance t-test (Welch's t-test) set evar=False
>>> res.ttest(df=df, xfac="Genotype", res="yield", test_type=2)
# output
>>> print(res.summary)

Two sample t-test with equal variance

------------------  -------------
Mean diff           -10.3
t                    -5.40709
Std Error             1.90491
df                   10
P-value (one-tail)    0.000149204
P-value (two-tail)    0.000298408
Lower 95.0%         -14.5444
Upper 95.0%          -6.05561
------------------  -------------

Parameter estimates

Level      Number    Mean    Std Dev    Std Error    Lower 95.0%    Upper 95.0%
-------  --------  ------  ---------  -----------  -------------  -------------
A               6    79.1    3.30817      1.35056        75.6283        82.5717
B               6    89.4    3.29059      1.34338        85.9467        92.8533

Note: Even though you can perform a t-test when the sample size is unequal between two groups, it is more efficient to have an equal sample size in two groups to increase the power of the t-test.

### Interpretation

The P-value obtained from the t-test is significant (P<0.05), and therefore, we conclude that the yield of genotype A is significantly different than genotype B.

## Paired t-test (dependent t-test)

• Paired t-test used to compare the differences between the pair of dependent variables for the same subject
• For example, we have plant variety A and would like to compare the yield of A before and after the application of some fertilizer
• Note: Paired t-test is a one sample t-test on the differences between the two dependent variables

### Paired t-test Hypotheses

• Null hypotheses: There is no difference between the two dependent variables (difference=0)
• Alternative hypotheses: There is a difference between the two dependent variables (two-tailed)
• Alternative hypotheses: Difference between two response variables either greater or lesser than zero (one-tailed)

### Paired t-test Assumptions

• Differences between the two dependent variables follows an approximately normal distribution (Shapiro-Wilks Test)
• Independent variable should have a pair of dependent variables
• Differences between the two dependent variables should not have outliers
• Observations are sampled independently from each other
# I am using interactive python interpreter (Python 3.7.4)
>>> from bioinfokit.analys import get_data, stat
# load dataset as pandas dataframe
# the dataset should not have missing (NaN) values. If it has, it will omitted
>>> df = get_data('t_pair').data
BF     AF
0  44.41  47.99
1  46.29  56.64
2  45.98  48.90
3  43.35  49.01
4  45.75  48.41

>>> res = stat()
>>> res.ttest(df=df, res=['AF', 'BF'], test_type=3)
# output
>>> print(res.summary)

Paired t-test

------------------  ------------
Sample size         65
Difference Mean      5.55262
t                   14.2173
Df                  64
P-value (one-tail)   8.87966e-22
P-value (two-tail)   1.77593e-21
Lower 95.0%          4.7724
Upper 95.0%          6.33283
------------------  ------------

### Interpretation

The P-value obtained from the t-test is significant (P<0.05), and therefore, we conclude that the yield of plant variety A significantly increased by the application of fertilizer.

## References

• Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature methods. 2020 Mar;17(3):261-72.
• Kim TK, Park JH. More about the basic assumptions of t-test: normality and sample size. Korean journal of anesthesiology. 2019 Aug;72(4):331.

Check detailed usage

How to cite?
Renesh Bedre.(2020, July 29). reneshbedre/bioinfokit: Bioinformatics data analysis and visualization toolkit (Version v0.9). Zenodo. http://doi.org/10.5281/zenodo.3965241

If you have any questions, comments or recommendations, please email me at reneshbe@gmail.com

Last updated: April 22, 2020