# Data discretization made easy with funModeling

This article is originally published at https://blog.datascienceheroes.com/

**tl;dr**: Convert numerical variables into categorical, as it is shown in the next image.

⏳ *Reading time ~ 6 min.*

## Let's start!

The package `funModeling`

(from version > 1.6.6) introduces two

functions— `discretize_get_bins`

& `discretize_df`

—that work together

in order to help us in the discretization task.

*If you were using the 1.6.6, please see the update note below (Jan-19-2018).*

```
# First we load the libraries
# install.packages("funModeling")
library(funModeling)
library(dplyr)
```

Let's see an example. First, we check current data types:

```
df_status(heart_disease, print_results = F) %>% select(variable, type, unique, q_na) %>% arrange(type)
## variable type unique q_na
## 1 gender factor 2 0
## 2 chest_pain factor 4 0
## 3 fasting_blood_sugar factor 2 0
## 4 resting_electro factor 3 0
## 5 thal factor 3 2
## 6 exter_angina factor 2 0
## 7 has_heart_disease factor 2 0
## 8 age integer 41 0
## 9 resting_blood_pressure integer 50 0
## 10 serum_cholestoral integer 152 0
## 11 max_heart_rate integer 91 0
## 12 exer_angina integer 2 0
## 13 slope integer 3 0
## 14 num_vessels_flour integer 4 4
## 15 heart_disease_severity integer 5 0
## 16 oldpeak numeric 40 0
```

We've got factor, integer, and numeric variables: a good mix! The

transformation has two steps. First, it gets the cuts or threshold

values from which each segment begins. The second step is using the

threshold to obtain the variables as categoricals.

Two variables will be discretized in the following example:

`max_heart_rate`

and `oldpeak`

. Also, we'll introduce some `NA`

values

into `oldpeak`

to test how the function works with missing data.

```
# Introducing some missing values in the first 30 rows of the oldpeak variable
heart_disease$oldpeak[1:30]=NA
```

Step 1) Getting the bin thresholds for each input variable:

`discretize_get_bins`

returns a data frame that needs to be used in the

`discretize_df`

function, which returns the final processed data frame.

```
d_bins=discretize_get_bins(data=heart_disease, input=c("max_heart_rate", "oldpeak"), n_bins=5)
## [1] "Variables processed: max_heart_rate, oldpeak"
# Checking `d_bins` object:
d_bins
## variable cuts
## 1 max_heart_rate 131|147|160|171|Inf
## 2 oldpeak 0.1|0.3|1.1|2|Inf
```

Parameters:

`data`

: the data frame containing the variables to be processed.`input`

: vector of strings containing the variable names.`n_bins`

: the number of bins/segments to have in the discretized

data.

We can see each threshold point (or upper boundary) for each variable.

**Update Jan-19-2018:** Some points that differs from version 1.6.6 to 1.6.7:

`discretize_get_bins`

doesn't create the`-Inf`

threshold since that value was always considered to be the minimum.- The one value category now it is represented as a range, for example, what it was
`"5"`

, now it is`"[5, 6)"`

. - Buckets formatting may have changed, if you were using this function in production, you would need to check the new values.

Time to continue with next step!

Step 2) Applying the thresholds for each variable:

```
# Now it can be applied on the same data frame or in a new one
# (for example, in a predictive model that changes data over time)
heart_disease_discretized=discretize_df(data=heart_disease,
data_bins=d_bins,
stringsAsFactors=T)
## [1] "Variables processed: max_heart_rate, oldpeak"
```

Parameters:

`data`

: data frame containing the numerical variables to be

discretized.`data_bins`

: data frame returned by`discretize_get_bins`

. If it is

changed by the user, then each upper boundary must be separated by a

pipe character (`|`

) as shown in the example.`stringsAsFactors`

:`TRUE`

by default, final variables will be

factor (instead of a character) and useful when plotting.

#### Final results and their plots

Before and after

Final distribution:

```
describe(heart_disease_discretized %>% select(max_heart_rate,oldpeak))
## heart_disease_discretized %>% select(max_heart_rate, oldpeak)
##
## 2 Variables 303 Observations
## ---------------------------------------------------------------------------
## max_heart_rate
## n missing distinct
## 303 0 5
##
## Value [-Inf, 131) [ 131, 147) [ 147, 160) [ 160, 171) [ 171, Inf]
## Frequency 63 59 62 62 57
## Proportion 0.208 0.195 0.205 0.205 0.188
## ---------------------------------------------------------------------------
## oldpeak
## n missing distinct
## 303 0 6
##
## Value [-Inf, 0.1) [ 0.1, 0.3) [ 0.3, 1.1) [ 1.1, 2.0) [ 2.0, Inf]
## Frequency 97 18 54 54 50
## Proportion 0.320 0.059 0.178 0.178 0.165
##
## Value NA.
## Frequency 30
## Proportion 0.099
## ---------------------------------------------------------------------------
p5=ggplot(heart_disease_discretized, aes(max_heart_rate)) +
geom_bar(fill="#0072B2") + theme_bw() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
p6=ggplot(heart_disease_discretized, aes(oldpeak)) +
geom_bar(fill="#CC79A7") + theme_bw() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
gridExtra::grid.arrange(p5, p6, ncol=2)
```

Showing final variable distribution:

Sometimes, it is not possible to get the same number of cases per bucket

when computing **equal frequency** as is shown in the `oldpeak`

variable.

#### NA handling

Regarding the `NA`

values, the new `oldpeak`

variable has six

categories: five categories defined in `n_bins=5`

plus the `NA.`

value.

Note the point at the end indicating the presence of missing values.

#### More info

`discretize_df`

will never return an`NA`

value without transforming

it to the string`NA.`

.`n_bins`

sets the number of bins for all the variables.- If
`input`

is missing, then it will run for all numeric/integer

variables whose number of unique values is greater than the number

of bins (`n_bins`

). - Only the variables defined in
`input`

will be processed while

remaining variables will**not be modified at all**. `discretize_get_bins`

returns just a data frame that can be changed

by hand as needed, either in a text file or in the R session.

#### Discretization with new data

In our data, the minimum value for `max_heart_rate`

is 71. The data

preparation must be robust with new data; e.g., if a new patient arrives

whose `max_heart_rate`

is 68, then the current process will assign

her/him to the lowest category.

In other functions from other packages, this preparation may return an

`NA`

because it is out of the segment.

As we pointed out before, if new data comes over time, it's likely to

get new min/max value/s. This can break our process. To solve this,

`discretize_df`

will always have as min/max the values `-Inf`

/`Inf`

;

thus, any new value falling below/above the minimum/maximum will be

added to the lowest or highest segment as applicable.

The data frame returned by `discretize_get_bins`

must be saved in order

to apply it to new data. If the discretization is not intended to run

with new data, then there is no sense in having two functions: it can be

only one. In addition, there would be no need to save the results of

`discretize_get_bins`

.

Having this two-step approach, we can handle both cases.

#### Conclusions about two-step discretization

The usage of `discretize_get_bins`

+ `discretize_df`

provides quick data

preparation, with a clean data frame that is ready to use. Clearly

showing where each segment begin and end, indispensable when making

statistical reports.

The decision of *not fail* when dealing with a new min/max in new data

is **just a decision**. In some contexts, failure would be the desired

behavior.

**The human intervention**: The easiest way to discretize a data frame

is to select the same number of bins to apply to every variable—just

like the example we saw—however, if tuning is needed, then some

variables may need a **different number of bins**. For example, a

variable with less dispersion can work well with a low number of bins.

Common values for the number of segments could be 3, 5, 10, or 20 (but

no more). It is up to the data scientist to make this decision.

#### Bonus track: The trade-off art ⚖️

- A high number of bins => More noise captured.
- A low number of bins => Oversimplification, less variance.

Do these terms sound similar to any other ones in machine learning?

The answer: **Yes!**. Just to mention one example: the trade-off between

adding or subtracting variables from a predictive model.

- More variables: Overfitting alert (too detailed predictive model).
- Fewer variables: Underfitting danger (not enough information to

capture general patterns).

*Just like oriental philosophy has pointed out for thousands of years, there is an art in finding the right balance between one value and its opposite.*

? This article was adapted from the **Data Science Live Book** - *Handling Data Types* chapter: https://livebook.datascienceheroes.com/data-preparation.html#data_types. Please go there for a deeper coverage.

`Keep in touch:`

@pabloc_ds.

~ Thanks for reading ?.

Thanks for visiting r-craft.org

This article is originally published at https://blog.datascienceheroes.com/

Please visit source website for post related comments.