xray: The R Package to Have X Ray Vision on your Datasets
This article is originally published at https://blog.datascienceheroes.com/
This package lets you analyze the variables of a dataset, to evaluate how the data is shaped. Consider this the first step when you have your data for modeling, you can use this package to analyze all variables and check if there is anything weird worth transforming or even avoiding the variable altogether.
Installation
You can install xray from github with:
# install.packages("devtools")
devtools::install_github("sicarul/xray")
Usage
Anomaly detection
xray::anomalies
analyzes all your columns for anomalies, whether they are NAs, Zeroes, Infinite, etc, and warns you if it detects variables with at least 80% of rows with those anomalies. It also warns you when all rows have the same value.
Example:
data(longley)
badLongley=longley
badLongley$GNP=NA
xray::anomalies(badLongley)
#> Warning in xray::anomalies(badLongley): Found 1 possible problematic variables:
#> GNP
#> $variables
#> Variable q qNA pNA qZero pZero qBlank pBlank qInf pInf qDistinct
#> 1 GNP 16 16 100% 0 - 0 - 0 - 1
#> 2 GNP.deflator 16 0 - 0 - 0 - 0 - 16
#> 3 Unemployed 16 0 - 0 - 0 - 0 - 16
#> 4 Armed.Forces 16 0 - 0 - 0 - 0 - 16
#> 5 Population 16 0 - 0 - 0 - 0 - 16
#> 6 Year 16 0 - 0 - 0 - 0 - 16
#> 7 Employed 16 0 - 0 - 0 - 0 - 16
#> type anomalous_percent
#> 1 Logical 100%
#> 2 Numeric -
#> 3 Numeric -
#> 4 Numeric -
#> 5 Numeric -
#> 6 Integer -
#> 7 Numeric -
#>
#> $problem_variables
#> Variable q qNA pNA qZero pZero qBlank pBlank qInf pInf qDistinct
#> 1 GNP 16 16 100% 0 - 0 - 0 - 1
#> type anomalous_percent
#> 1 Logical 100%
#> problems
#> 1 Anomalies present in 100% of the rows. Less than 2 distinct values.
Distributions
xray::distributions
tries to analyze the distribution of your variables, so you can understand how each variable is statistically structured. It also returns a percentiles table of numeric variables as a result, which can inform you of the shape of the data.
distrLongley=longley
distrLongley$testCategorical=c(rep('One',7), rep('Two', 9))
xray::distributions(distrLongley)
#> Variable p_1 p_10 p_25 p_50 p_75 p_90
#> 1 GNP.deflator 83.78 88.35 94.525 100.6 111.25 114.95
#> 2 GNP 237.8537 258.74 317.881 381.427 454.0855 510.387
#> 3 Unemployed 187.93 201.55 234.825 314.35 384.25 434.4
#> 4 Armed.Forces 147.61 160.3 229.8 271.75 306.075 344.85
#> 5 Population 107.7616 109.2025 111.7885 116.8035 122.304 126.61
#> 6 Year 1947.15 1948.5 1950.75 1954.5 1958.25 1960.5
#> 7 Employed 60.1938 60.7225 62.7125 65.504 68.2905 69.4475
#> p_99
#> 1 116.72
#> 2 549.3859
#> 3 478.725
#> 4 358.695
#> 5 129.7466
#> 6 1961.85
#> 7 70.403
Distributions along a time axis
xray::timebased
also investigates into your distributions, but shows you the change over time, so if there is any change in the distribution over time (For example a variable stops or starts being collected) you can easily visualize it.
dateLongley=longley
dateLongley$Year=as.Date(paste0(dateLongley$Year,'-01-01'))
dateLongley$Data='Original'
ndateLongley=dateLongley
ndateLongley$GNP=dateLongley$GNP+10
ndateLongley$Data='Offseted'
xray::timebased(rbind(dateLongley, ndateLongley), 'Year')
#> [1] "7 charts have been generated."
If you are just starting to learn about Data Science or want to explore further, i encourage you to check this cool book made by my buddy Pablo Casas: The Data Science Live Book
Thanks for visiting r-craft.org
This article is originally published at https://blog.datascienceheroes.com/
Please visit source website for post related comments.