Ways to determine the statistical significance between two independent datasets

Question

Ways to determine the statistical significance between two independent datasets

605 views Asked by Marian At 03 October 2023 at 14:43

Suppose A and B are two datasets. The datasets might have 100 features each. How do I perform hypothesis testing on these independent datasets to compare statistical significance?

I tried to write a code in Python. I have preprocessed both the datasets and I have tried using Student's t test considering the columns are normalized. The datasets are tabular data with continuous values and have performed one hot encoding on the categorical features. I tried performing t-test on a numerical column from the both datasets. But I can't seem to figure out how to perform on the entire dataset. I used the scipy.stats library.

Original Q&A

There are 3 answers

Suraj Shourie On 03 October 2023 at 15:23

This might be a question for https://stats.stackexchange.com/.

But I'll try to give one approach using python code. This uses student's T-test or the Welch T-test which is a stricter test as it doesn't assume that the variance of two distributions are similar.

Note that this checks if the means of the two distributions are statistically similar or not.

Example code in python for dummy data:

import numpy as np
from scipy import stats
arr1 = np.random.normal(loc=1,size=(10000,2))
arr2 = np.random.normal(loc=1,size=(10000,2))
print(stats.ttest_ind(arr1, arr2, equal_var=True, axis=0))

Output:

TtestResult(statistic=array([-2.13993016,  0.87158797]), pvalue=array([0.03237248, 0.38344366]), df=array([19998., 19998.]))

The above code compares column by column for equal mean and gives the p-value ( and t-statistic) for each.

Arun Singh Babal On 03 October 2023 at 17:58

You can apply the t-test on all the features of dataset as follows:

p_values = []
for i in range(df1.shape[1]):
    _, p_value = stats.ttest_ind(df1[:, i], df2[:, i])
    p_values.append(p_value)

**Sandipan Dey** · Accepted Answer · 2023-10-04T00:10:45+00:00

The Kolmogorov-Smirnov test is a non-parametric statistical test that can be used to determine if two samples come from the same distribution.

One approach that you can take is for each of the features (columns) from the datasets A and B perform a KS test to check if they have come from the same distribution (using the scipy.stats.ks_2samp() function).

Th following code shows an example, where it uses couple of 2-column datasets, namely, A and B. The first feature (column) of the datatsets A and B comes (are sampled) from the same (standard normal) distribution, but the second feature comes from different (normal) distributions (with different parmeters).

import numpy as np
from scipy.stats import ks_2samp

n = 100 # number of samples

A = np.hstack((np.random.normal(loc=0, scale=1, size=n).reshape(-1,1), \
               np.random.normal(loc=0, scale=1, size=n).reshape(-1,1)))

B = np.hstack((np.random.normal(loc=0, scale=1, size=n).reshape(-1,1), \
               np.random.normal(loc=20, scale=5, size=n).reshape(-1,1)))

If you plot the histogram of the features for the datasets, you will obtain a figure like the following:

Clearly the second feature is highly likely to be chosen from different distributions. Let's verify with the KS test.

for i in range(A.shape[1]):
    print(f'Kolmogorov-Smirnov test for feature column {i}')
    statistic, pvalue = ks_2samp(A[:,i], B[:,i])
    print(f"Test statistic: {statistic}")
    print(f"P-value: {pvalue}")

# Kolmogorov-Smirnov test for feature column 0
# Test statistic: 0.13
# P-value: 0.36818778606286096  # can't reject H0

# Kolmogorov-Smirnov test for feature column 1
# Test statistic: 1.0
# P-value: 2.2087606931995054e-59 # reject H0

As can be seen from above, using the KS test,

we can not reject the null hypothesis (at 5% level of significance) that the first feature for the datasets A and B came from the same distribution since the p-value is high (0.368 > 0.05),
we can correctly reject the null hypothesis that the second feature for the datasets A and B came from the same distribution since the p-value is almost 0.

You can use the same approach on your 100-column datasets, by comparing them parewise.

TechQA.

Ways to determine the statistical significance between two independent datasets

There are 3 answers

Related Questions in PYTHON

Related Questions in STATISTICS

Related Questions in ANALYTICS

Related Questions in HYPOTHESIS-TEST

Related Questions in KOLMOGOROV-SMIRNOV

Popular Questions

Trending Questions