# Goodness of fit test in R

This article is originally published at https://hameddaily.blogspot.com/As a data scientist, occasionally, you receive a dataset and you would like to know what is the generative distribution for that dataset. In this post, I aim to show how we can answer that question in R. To do that let's make an arbitrary dataset that we sample from a Gamma distribution. To make the problem a little more interesting, let add Gaussian noise to simulate measurement noise:

num_of_samples = 1000

x <- rgamma(num_of_samples, shape = 10, scale = 3)

x <- x + rnorm(length(x), mean=0, sd = .1)

- Visualization. plot the histogram of data
- Guess what distribution would fit to the data the best
- Use some statistical test for goodness of fit
- Repeat 2 and 3 if measure of goodness is not satisfactory

p1 <- hist(x,breaks=50, include.lowest=FALSE, right=FALSE)

- H0 = The data is consistent with a specified reference distribution.
- H1 = The data is
*NOT*consistent with a specified reference distribution

*(or*

__statistical significance__*). The value of the significant level depends on the application but it is usually in the range of*

__significant level__**[.01, .1]**. If the result of statistical test is above the level we would no reject the null hypothesis. In other words, if the test result is above the threshold, we conclude that the observed sample frequencies is significantly similar to expected frequencies specified in the null hypothesis.

*Reference distribution*is defined as a distribution which we__assume__fits the data the best. Our hypothesis testing tests if this assumption is correct or not*Primary distribution*is defined as actual distribution that the data was sampled from. In practice this distribution is unknown and we try to estimate and find that distribution.

#### Chi Square test

- The candidate distribution needs to be a pmf where its sum is 1. If you don't have the distribution normalized set rescale.p to TRUE.
- The chi square test is a statistical test, hence it needs to be run using Monte Carlo to make sure its result is accurate enough. For use the Monte Carlo set simulate.p.value. You can also set the iteration number by set B.

a <- chisq.test(p1$counts, p=null.probs, rescale.p=TRUE, simulate.p.value=TRUE)

**How to create the null.probs**

library('zoo')

breaks_cdf <- pgamma(p1$breaks, shape=10, scale=3)

null.probs <- rollapply(breaks_cdf, 2, function(x) x[2]-x[1])

#### Cramér–von Mises criterion

*reference distribution*for hypnosis testing. Note that since the second gamma distribution is the basis of the comparison we are using a large sample size to closely estimate the Gamma distribution.

num_of_samples = 100000

y <- rgamma(num_of_samples, shape = 10, scale = 3)

res <- CramerVonMisesTwoSamples(x,y)

p-value = 1/6*exp(-res)

*significance level*and so the two distribution are close enough.

#### Kolmogorov–Smirnov test

__dimensional__probability distribution. Same as Cramer von Mises test, it compares empirical distribution with reference probability. So we would use the test same as we used before:

num_of_samples = 100000

y <- rgamma(num_of_samples, shape = 10, scale = 3)

result = ks.test(x, y)

Different Reference Distribution

Reference Distribution | Chi square test | Kolmogorov–Smirnov test | Cramér–von Mises criterion |
---|---|---|---|

Gamma(11,3) | 5e-4 | 2e-10 | 0.019 |

N(30, 90) | 4e-5 | 2.2e-16 | 3e-3 |

Gamme(10, 3) | .2 | .22 | .45 |

Thanks for visiting r-craft.org

This article is originally published at https://hameddaily.blogspot.com/

Please visit source website for post related comments.