Machine learning / R / R News / Rstats

xray: The R Package to Have X Ray Vision on your Datasets

by Pablo Seibelt · November 22, 2017

This article is originally published at https://blog.datascienceheroes.com/

xray: The R Package to Have X Ray Vision on your Datasets

This package lets you analyze the variables of a dataset, to evaluate how the data is shaped. Consider this the first step when you have your data for modeling, you can use this package to analyze all variables and check if there is anything weird worth transforming or even avoiding the variable altogether.

Installation

You can install xray from github with:

# install.packages("devtools")
devtools::install_github("sicarul/xray")

Usage

Anomaly detection

xray::anomalies analyzes all your columns for anomalies, whether they are NAs, Zeroes, Infinite, etc, and warns you if it detects variables with at least 80% of rows with those anomalies. It also warns you when all rows have the same value.

Example:

data(longley)
badLongley=longley
badLongley$GNP=NA
xray::anomalies(badLongley)
#> Warning in xray::anomalies(badLongley): Found 1 possible problematic variables: 
#> GNP
#> $variables
#>       Variable  q qNA  pNA qZero pZero qBlank pBlank qInf pInf qDistinct
#> 1          GNP 16  16 100%     0     -      0      -    0    -         1
#> 2 GNP.deflator 16   0    -     0     -      0      -    0    -        16
#> 3   Unemployed 16   0    -     0     -      0      -    0    -        16
#> 4 Armed.Forces 16   0    -     0     -      0      -    0    -        16
#> 5   Population 16   0    -     0     -      0      -    0    -        16
#> 6         Year 16   0    -     0     -      0      -    0    -        16
#> 7     Employed 16   0    -     0     -      0      -    0    -        16
#>      type anomalous_percent
#> 1 Logical              100%
#> 2 Numeric                 -
#> 3 Numeric                 -
#> 4 Numeric                 -
#> 5 Numeric                 -
#> 6 Integer                 -
#> 7 Numeric                 -
#> 
#> $problem_variables
#>   Variable  q qNA  pNA qZero pZero qBlank pBlank qInf pInf qDistinct
#> 1      GNP 16  16 100%     0     -      0      -    0    -         1
#>      type anomalous_percent
#> 1 Logical              100%
#>                                                              problems
#> 1 Anomalies present in 100% of the rows. Less than 2 distinct values.

Distributions

xray::distributions tries to analyze the distribution of your variables, so you can understand how each variable is statistically structured. It also returns a percentiles table of numeric variables as a result, which can inform you of the shape of the data.

distrLongley=longley
distrLongley$testCategorical=c(rep('One',7), rep('Two', 9))
xray::distributions(distrLongley)

#>       Variable      p_1     p_10     p_25     p_50     p_75    p_90
#> 1 GNP.deflator    83.78    88.35   94.525    100.6   111.25  114.95
#> 2          GNP 237.8537   258.74  317.881  381.427 454.0855 510.387
#> 3   Unemployed   187.93   201.55  234.825   314.35   384.25   434.4
#> 4 Armed.Forces   147.61    160.3    229.8   271.75  306.075  344.85
#> 5   Population 107.7616 109.2025 111.7885 116.8035  122.304  126.61
#> 6         Year  1947.15   1948.5  1950.75   1954.5  1958.25  1960.5
#> 7     Employed  60.1938  60.7225  62.7125   65.504  68.2905 69.4475
#>       p_99
#> 1   116.72
#> 2 549.3859
#> 3  478.725
#> 4  358.695
#> 5 129.7466
#> 6  1961.85
#> 7   70.403

Distributions along a time axis

xray::timebased also investigates into your distributions, but shows you the change over time, so if there is any change in the distribution over time (For example a variable stops or starts being collected) you can easily visualize it.

dateLongley=longley
dateLongley$Year=as.Date(paste0(dateLongley$Year,'-01-01'))
dateLongley$Data='Original'
ndateLongley=dateLongley
ndateLongley$GNP=dateLongley$GNP+10
ndateLongley$Data='Offseted'
xray::timebased(rbind(dateLongley, ndateLongley), 'Year')

#> [1] "7 charts have been generated."

If you are just starting to learn about Data Science or want to explore further, i encourage you to check this cool book made by my buddy Pablo Casas: The Data Science Live Book

Thanks for visiting r-craft.org
This article is originally published at https://blog.datascienceheroes.com/
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

xray: The R Package to Have X Ray Vision on your Datasets

You may also like...

Categories

xray: The R Package to Have X Ray Vision on your Datasets

Installation

Usage

Anomaly detection

Distributions

Distributions along a time axis

You may also like...

Getting an environment’s name in R: the envnames package

numpy exponential

I Just Wanted The Data : Turning Tableau & Tidyverse Tears Into Smiles with Base R (An Encoding Detective Story)

Categories