Select Distinct Values with Pig
This article is originally published at https://statcompute.wordpress.com
First of all, I used SQL statement with SQLDF package in R. It took ~51 seconds user time to select 12 rows out of 7 millions.
library(sqldf) a <- read.csv.sql('2008.csv2', sql = "select distinct V1, V2 from file", header = FALSE) print(a)
Next, I used Apache Pig running in the local mode and spent ~36 seconds getting the same 12 rows.
a = LOAD '2008.csv2' USING PigStorage(','); b = DISTINCT(FOREACH a GENERATE $0, $1); dump b;
Although my purpose of this exercise is to learn Pig Latin through SQL statement, I am still very impressed by the efficiency of Apache Pig.
Please visit source website for post related comments.