prep_outliers.Rd
Deal with outliers by setting an 'NA value' or by 'stopping' them at a certain. There are three supported methods to flag the values as outliers: "bottom_top", "tukey" and "hampel". The parameters: 'top_percent' and/or 'bottom_percent' are used only when method="bottom_top".
For a full reference please check the official documentation at: https://livebook.datascienceheroes.com/data-preparation.html#treatment_outliers> Setting NA is recommended when doing statistical analysis, parameter: type='set_na'. Stopping is recommended when creating a predictive model without biasing the result due to outliers, parameter: type='stop'.
The function can take a data frame, and returns the same data plus the transformations specified in the input parameter. Or it can take a single vector (in the same 'data' parameter), and it returns a vector.
prep_outliers( data, input = NA, type = NA, method = NA, bottom_percent = NA, top_percent = NA, k_mad_value = NA )
data | a data frame or a single vector. If it's a data frame, the function returns a data frame, otherwise it returns a vector. |
---|---|
input | string input variable (if empty, it runs for all numeric variable). |
type | can be 'stop' or 'set_na', in the first case all falling out of the threshold will be converted to the threshold, on the other case all of these values will be set as NA. |
method | indicates the method used to flag the outliers, it can be: "bottom_top", "tukey" or "hampel". |
bottom_percent | value from 0 to 1, represents the lowest X percentage of values to treat. Valid only when method="bottom_top". |
top_percent | value from 0 to 1, represents the highest X percentage of values to treat. Valid only when method="bottom_top". |
k_mad_value | only used when method='hampel', 3 by default, might seem quite restrictive. Set a higher number to spot less outliers. |
A data frame with the desired outlier transformation
if (FALSE) { # Creating data frame with outliers set.seed(10) df=data.frame(var1=rchisq(1000,df = 1), var2=rnorm(1000)) df=rbind(df, 1135, 2432) # forcing outliers df$id=as.character(seq(1:1002)) # for var1: mean is ~ 4.56, and max 2432 summary(df) ######################################################## ### PREPARING OUTLIERS FOR DESCRIPTIVE STATISTICS ######################################################## #### EXAMPLE 1: Removing top 1%% for a single variable # checking the value for the top 1% of highest values (percentile 0.99), which is ~ 7.05 quantile(df$var1, 0.99) # Setting type='set_na' sets NA to the highest value specified by top_percent. # In this case 'data' parameter is single vector, thus it returns a single vector as well. var1_treated=prep_outliers(data = df$var1, type='set_na', top_percent = 0.01,method = "bottom_top") # now the mean (~ 1) is more accurate, and note that: 1st, median and 3rd # quartiles remaining very similar to the original variable. summary(var1_treated) #### EXAMPLE 2: Removing top and bottom 1% for the specified input variables. vars_to_process=c('var1', 'var2') df_treated3=prep_outliers(data = df, input = vars_to_process, type='set_na', bottom_percent = 0.01, top_percent = 0.01, method = "bottom_top") summary(df_treated3) ######################################################## ### PREPARING OUTLIERS FOR PREDICTIVE MODELING ######################################################## data_prep_h=funModeling::prep_outliers(data = heart_disease, input = c('age','resting_blood_pressure'), method = "hampel", type='stop') # Using Hampel method to flag outliers: summary(heart_disease$age);summary(data_prep_h$age) # it changed from 29 to 29.31, and the max remains the same at 77 hampel_outlier(heart_disease$age) # checking the thresholds data_prep_a=funModeling::prep_outliers(data = heart_disease, input = c('age','resting_blood_pressure'), method = "tukey", type='stop') max(heart_disease$age);max(data_prep_a$age) # remains the same (77) because the max thers for age is 100 tukey_outlier(heart_disease$age) }