To sample or not to sample…part 2
This article is originally published at https://blogs.oracle.com/r/compendium/rss
In my previous post To sample or not to sample, we discussed some of the issues involved in sampling data for use in machine learning. In this post, we look at using Oracle R Enterprise transparency layer to perform a few types of sampling: simple random sampling, with and without replacement, and stratified sampling.
When your data is too large to fit in memory, you're left with a paradox: you need to sample the data so it fits in memory, but you need to load it into memory before you can sample it. Of course, one can read subsets of the data, sample those, and repeat until the desired sample is achieved, but this requires additional coding and may or may not be general purpose.
Enter Oracle R Enterprise, which enables users to sample directly from the database, without first having to load data into memory. This minimizes data movement and simplifies sampling, especially for larger data volumes. The transparency layer, via row indexing, makes this possible.
Simple random sampling
The notion of "simple random sampling" is one where a subset of the data rows is selected and each member of the subset is equally likely to be selected. There are two common variations of simple random sampling: without replacement and with replacement. Sampling without replacement guarantees that each row can be chosen at most once. Sampling with replacement allows rows to be chosen multiple time if randomly selected, effectively placing the row back into the pool of records for re-selection. For those interested in understanding why one would use one or the other, check out these resources: UTexas, ThoughtCo, Wikipedia.
So, how do we perform simple random sampling using Oracle R Enterprise? Consider the following simple example where a data.frame consists of 20 rows and 2 columns - one of numbers and one of letters. We push this R data.frame to the database to get an ore.frame using ore.push, which is automatically made an ordered frame, i.e., one we can use for row indexing. We sample 5 rows from this ore.frame without replacement using the native R sample function to obtain the indexes of the rows we want to retrieve from the database. Note that the result in simpleRandomSample is also an ore.frame. This allows us to manipulate that data in the database before possibly pulling it to the client for further processing.
> set.seed(1) > N <- 20 > myData <- data.frame(a=1:N,b=letters[1:N]) > MYDATA <- ore.push(myData) > head(MYDATA) a b 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e 6 6 f > sampleSize <- 5 > simpleRandomSample <- MYDATA[sample(nrow(MYDATA),sampleSize), , + drop=FALSE] > class(simpleRandomSample) [1] "ore.frame" attr(,"package") [1] "OREbase" > simpleRandomSample a b 4 4 d 6 6 f 8 8 h 11 11 k 16 16 pin the next example, we illustrate simple random sampling with replacement -- setting the argument replace to TRUE in the sample function. A key difference in the result is that row 11 has been chosen twice, meaning that it was randomly selecting twice by the sample function. Note also that ORE automatically created a unique row name for the duplicate row or "11.1".> set.seed(1) > N <- 50 > myData <- data.frame(a=1:N,b=rep(letters[1:10],each=5)[1:N]) > MYDATA <- ore.push(myData) > head(MYDATA) a b 1 1 a 2 2 a 3 3 a 4 4 a 5 5 a 6 6 b > sampleSize <- 15 > simpleRandomSample <- MYDATA[sample(nrow(MYDATA),sampleSize, replace=TRUE), , + drop=FALSE] > class(simpleRandomSample) [1] "ore.frame" attr(,"package") [1] "OREbase" > simpleRandomSample a b 4 4 a 9 9 b 11 11 c 11.1 11 c 14 14 c 19 19 d 20 20 d 29 29 f 32 32 g 34 34 g 35 35 g 39 39 h 45 45 i 46 46 j 48 48 jNotice that column b above does not contain at least one of each distinct value. This is possible, if not typical, with simple random sampling. Implication: a predictive model built using a sample that doesn't include all possible target values cannot predict values it hasn't seen in the training data. To ensure that the target has representative values for each of the levels, we can use stratified sampling.
Stratified sampling
With stratified sampling, the data are partitioned into groups, for example, using a specified column, such as the target. Then, within each group, simple random sampling can be used to select rows. In the following example, we construct a data.frame of 200 rows and a target value that contains seven unique values. Since it's a normal distribution, the table output reflects this. After pushing the data to create an ordered ore.frame, we use standard R syntax with rbind and lapply to split the data, perform the sampling, then combine the results into a single ore.frame. Note that after sampling, all seven values are represented with a similar distribution.
> set.seed(1) > N <- 200 > myData <- data.frame(a=1:N,b=round(rnorm(N),2), + target=round(rnorm(N,4),0)) > table(myData$target) 1 2 3 4 5 6 7 2 12 36 82 53 14 1 > MYDATA <- ore.push(myData) > head(MYDATA) a b target 1 1 -0.63 4 2 2 0.18 6 3 3 -0.84 6 4 4 1.60 4 5 5 0.33 2 6 6 -0.82 6 > sampleSize <- 50 > stratifiedSample <- + do.call(rbind, + lapply(split(MYDATA, MYDATA$target), + function(y) { + ny <- nrow(y) + y[sample(ny, max(sampleSize*ny/N,1),, + drop = FALSE] + })) > table(stratifiedSample$target) 1 2 3 4 5 6 7 1 3 9 20 13 3 1 > class(stratifiedSample) [1] "ore.frame" attr(,"package") [1] "OREbase" > head(stratifiedSample) a b target 1 161 0.43 1 2.189 189 -0.43 2 2.145 145 -1.12 2 2.5 5 0.33 2 3.102 102 0.04 3 3.119 119 0.49 3There are variations on the stratified sampling theme as well, but this should give you an idea of how to use the transparency layer of Oracle R Enterprise for a few common sampling scenarios. For more sampling options, see this tutorial.Thanks for visiting r-craft.org
This article is originally published at https://blogs.oracle.com/r/compendium/rss
Please visit source website for post related comments.