Removing Records by Duplicate Values
This article is originally published at https://statcompute.wordpress.com
Removing records from a data table based on duplicate values in one or more columns is a commonly used but important data cleaning technique. Below shows an example about how to accomplish this task by SAS, R, and Python respectively.
SAS Example
data _data_; input label $ value; datalines; A 4 B 3 C 6 B 3 B 1 A 2 A 4 A 4 ; run; proc sort data = _last_; by label value; run; data _data_; set _last_; by label; if first.label then output; run; proc print data = _last_ noobs; run; /* OUTPUT: label value A 2 B 1 C 6 */
R Example
> # INPUT DATA INT THE CONSOLE > df <- read.table(header = T, text = ' + label value + A 4 + B 3 + C 6 + B 3 + B 1 + A 2 + A 4 + A 4 + ') > # SORT DATA FRAME BY COLUMNS > df2 <- df[order(df$label, df$value), ] > print(df2) label value 6 A 2 1 A 4 7 A 4 8 A 4 5 B 1 2 B 3 4 B 3 3 C 6 > # DEDUP RECORDS > df3 <- df2[!duplicated(df2$label), ] > print(df3) label value 6 A 2 5 B 1 3 C 6
Python Example
In [1]: import pandas as pd In [2]: # INPUT DATA INTO DATAFRAME In [3]: df = pd.DataFrame({'label': ['A', 'B', 'C'] + ['B'] * 2 + ['A'] * 3, 'value': [4, 3, 6, 3, 1, 2, 4, 4]}) In [4]: # SORT DATA BY COLUMNS In [5]: df2 = df.sort(['label', 'value']) In [6]: print(df2) label value 5 A 2 0 A 4 6 A 4 7 A 4 4 B 1 1 B 3 3 B 3 2 C 6 In [7]: # DEDUP RECORDS In [8]: df3 = df2.drop_duplicates(['label']) In [9]: print(df3) label value 5 A 2 4 B 1 2 C 6
Thanks for visiting r-craft.org
This article is originally published at https://statcompute.wordpress.com
Please visit source website for post related comments.