R News

Big Data Solutions: A/B t test

by Simon Jackson · August 15, 2017

This article is originally published at https://drsimonj.svbtle.com

@drsimonj here to share my code for using Welch’s t-test to compare group means using summary statistics.

Motivation

I’ve just started working with A/B tests that use big data. Where once I’d whimsically run t.test(), now my data won’t fit into memory!

I’m sharing my solution here in the hope that it might help others.

In-memory data

As a baseline, let’s start with an in-memory case by comparing whether automatic and manual cars have different Miles Per Gallon ratings on average (using the mtcars data set).

t.test(mpg ~ am, data = mtcars)
#> 
#>  Welch Two Sample t-test
#> 
#> data:  mpg by am
#> t = -3.7671, df = 18.332, p-value = 0.001374
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -11.280194  -3.209684
#> sample estimates:
#> mean in group 0 mean in group 1 
#>        17.14737        24.39231

Well… that was easy!

Big Data

The problem with big data is that we can’t pull it into memory and work with R.

Fortunately, we don’t need the raw data to run Welch’s t-test. All we need is the mean, variance, and sample size of each group. So our raw data might have billions of rows, but we only need six numbers.

Here are the numbers we need for the previous example:

library(dplyr)

grp_summary <- mtcars %>% 
  group_by(am) %>% 
  summarise(
    mpg_mean = mean(mpg),
    mpg_var  = var(mpg),
    n        = n()
  )

grp_summary
#> # A tibble: 2 x 4
#>      am mpg_mean  mpg_var     n
#>   <dbl>    <dbl>    <dbl> <int>
#> 1     0 17.14737 14.69930    19
#> 2     1 24.39231 38.02577    13

This is everything we need to obtain a t value, degrees of freedom, and a p value.

t value

Here we use the means, varianes, and sample sizes to compute Welch’s t:

welch_t <- diff(grp_summary$mpg_mean) / sqrt(sum(grp_summary$mpg_var/grp_summary$n))

cat("Welch's t value of the mean difference is", welch_t)
#> Welch's t value of the mean difference is 3.767123

This is the same value returned by t.test(), apart from the sign (which is unimportant).

Degrees of Freedom

Here, we use the variances and sample sizes to compute the degrees of freedom, which is estimated by the Welch–Satterthwaite equation:

welch_df <- ((sum(grp_summary$mpg_var/grp_summary$n))^2) /
            sum(grp_summary$mpg_var^2/(grp_summary$n^2 * (grp_summary$n - 1)))

cat("Degrees of Freedom for Welch's t is", welch_df)
#> Degrees of Freedom for Welch's t is 18.33225

Again, same as t.test().

p value

We can now calculate the p value thanks to R’s pt(). Assuming we want to conduct a two-tailed test, here’s what we need to do:

welch_p <- 2 * pt(abs(welch_t), welch_df, lower.tail = FALSE)

cat("p-value for Welch's t is", welch_p)
#> p-value for Welch's t is 0.001373638

Same as t.test() again!

All-in-one Function

Now we know the math, let’s write a function that takes 2-element vectors of means, variances, and sample sizes, and returns the results in a data frame:

welch_t_test <- function(sample_means, sample_vars, sample_ns) {
  t_val <- diff(sample_means) / sqrt(sum(sample_vars/sample_ns))

  df    <- ((sum(sample_vars/sample_ns))^2) /
            sum(sample_vars^2/(sample_ns^2 * (sample_ns - 1)))

  p_val <- 2 * pt(abs(t_val), df, lower.tail = FALSE)

  data.frame(t_val = t_val,
             df    = df,
             p_val = p_val)
}

welch_t_test(grp_summary$mpg_mean,
             grp_summary$mpg_var,
             grp_summary$n)
#>      t_val       df       p_val
#> 1 3.767123 18.33225 0.001373638

Excellent!

Back to Big Data

The point of all this was to help me conduct an A/B test with big data. Has it?

Of course! I don’t pull billions of rows from my data base into memory. Instead, I create a table of the summary statistics within my big data ecosystem. These are easy to pull into memory.

How you create this summary table will vary depending on your setup, but here’s a mock Hive/SQL query to demonstrate the idea:

CREATE TABLE summary_tbl AS

SELECT
    group_var
  , AVG(outcome)      AS outcome_mean
  , VARIANCE(outcome) AS outcome_variance
  , COUNT(*)          AS n

FROM
  raw_tbl

GROUP BY
  group_var

Happy testing!

Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at [email protected] to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

Thanks for visiting r-craft.org
This article is originally published at https://drsimonj.svbtle.com
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Big Data Solutions: A/B t test

You may also like...

Categories

Big Data Solutions: A/B t test

Motivation

In-memory data

Big Data

t value

Degrees of Freedom

p value

All-in-one Function

Back to Big Data

Sign off

You may also like...

Statistical inference on MCMC traces

Build a Sentiment & Entity Detection API with FastAPI (2/2)

Faster Way to Slice Dataframe by Row

Categories