Deal with outliers by setting an 'NA value' or by 'stopping' them at a certain. There are three supported methods to flag the values as outliers: "bottom_top", "tukey" and "hampel". The parameters: 'top_percent' and/or 'bottom_percent' are used only when method="bottom_top".

For a full reference please check the official documentation at: https://livebook.datascienceheroes.com/data-preparation.html#treatment_outliers> Setting NA is recommended when doing statistical analysis, parameter: type='set_na'. Stopping is recommended when creating a predictive model without biasing the result due to outliers, parameter: type='stop'.

The function can take a data frame, and returns the same data plus the transformations specified in the input parameter. Or it can take a single vector (in the same 'data' parameter), and it returns a vector.

prep_outliers(
  data,
  input = NA,
  type = NA,
  method = NA,
  bottom_percent = NA,
  top_percent = NA,
  k_mad_value = NA
)

Arguments

data

a data frame or a single vector. If it's a data frame, the function returns a data frame, otherwise it returns a vector.

input

string input variable (if empty, it runs for all numeric variable).

type

can be 'stop' or 'set_na', in the first case all falling out of the threshold will be converted to the threshold, on the other case all of these values will be set as NA.

method

indicates the method used to flag the outliers, it can be: "bottom_top", "tukey" or "hampel".

bottom_percent

value from 0 to 1, represents the lowest X percentage of values to treat. Valid only when method="bottom_top".

top_percent

value from 0 to 1, represents the highest X percentage of values to treat. Valid only when method="bottom_top".

k_mad_value

only used when method='hampel', 3 by default, might seem quite restrictive. Set a higher number to spot less outliers.

Value

A data frame with the desired outlier transformation

Examples

if (FALSE) { # Creating data frame with outliers set.seed(10) df=data.frame(var1=rchisq(1000,df = 1), var2=rnorm(1000)) df=rbind(df, 1135, 2432) # forcing outliers df$id=as.character(seq(1:1002)) # for var1: mean is ~ 4.56, and max 2432 summary(df) ######################################################## ### PREPARING OUTLIERS FOR DESCRIPTIVE STATISTICS ######################################################## #### EXAMPLE 1: Removing top 1%% for a single variable # checking the value for the top 1% of highest values (percentile 0.99), which is ~ 7.05 quantile(df$var1, 0.99) # Setting type='set_na' sets NA to the highest value specified by top_percent. # In this case 'data' parameter is single vector, thus it returns a single vector as well. var1_treated=prep_outliers(data = df$var1, type='set_na', top_percent = 0.01,method = "bottom_top") # now the mean (~ 1) is more accurate, and note that: 1st, median and 3rd # quartiles remaining very similar to the original variable. summary(var1_treated) #### EXAMPLE 2: Removing top and bottom 1% for the specified input variables. vars_to_process=c('var1', 'var2') df_treated3=prep_outliers(data = df, input = vars_to_process, type='set_na', bottom_percent = 0.01, top_percent = 0.01, method = "bottom_top") summary(df_treated3) ######################################################## ### PREPARING OUTLIERS FOR PREDICTIVE MODELING ######################################################## data_prep_h=funModeling::prep_outliers(data = heart_disease, input = c('age','resting_blood_pressure'), method = "hampel", type='stop') # Using Hampel method to flag outliers: summary(heart_disease$age);summary(data_prep_h$age) # it changed from 29 to 29.31, and the max remains the same at 77 hampel_outlier(heart_disease$age) # checking the thresholds data_prep_a=funModeling::prep_outliers(data = heart_disease, input = c('age','resting_blood_pressure'), method = "tukey", type='stop') max(heart_disease$age);max(data_prep_a$age) # remains the same (77) because the max thers for age is 100 tukey_outlier(heart_disease$age) }