A handy function to return different vectors of variable names aimed to quickly filter NA, categorical (factor / character), numerical and other types (boolean, date, posix). It also returns a vector of variables which have high cardinality. It returns an 'integrity' object, which has: 'status_now' (comes from status function), and 'results' list, following elements can be found:

vars_cat: Vector containing the categorical variables names (factor or character)

vars_num: Vector containing the numerical variables names

vars_char: Vector containing the character variables names

vars_factor: Vector containing the factor variables names

vars_other: Vector containing the other variables names (date time, posix and boolean)

vars_num_with_NA: Summary table for numerical variables with NA

vars_cat_with_NA: Summary table for categorical variables with NA

vars_cat_high_card: Summary table for high cardinality variables (where thershold = MAX_UNIQUE parameter)

vars_one_value: Vector containing the variables names with 1 unique different value

Explore the NA and high cardinality variables by doing summary(integrity_object), or a full summary by doing print(integrity_object)

data_integrity(data, MAX_UNIQUE = 35)

Arguments

data

data frame or a single vector

MAX_UNIQUE

max unique threshold to flag a categorical variable as a high cardinality one. Normally above 35 values it is needed to reduce the number of different values.

Value

An 'integrity' object.

Examples

# Example 1: data_integrity(heart_disease)
#> $vars_num_with_NA #> variable q_na p_na #> 1 num_vessels_flour 4 0.01320132 #> #> $vars_cat_with_NA #> variable q_na p_na #> 1 thal 2 0.00660066 #> #> $vars_cat_high_card #> [1] variable unique #> <0 rows> (or 0-length row.names) #> #> $MAX_UNIQUE #> [1] 35 #> #> $vars_one_value #> character(0) #> #> $vars_cat #> [1] "gender" "chest_pain" "fasting_blood_sugar" #> [4] "resting_electro" "thal" "exter_angina" #> [7] "has_heart_disease" #> #> $vars_num #> [1] "age" "resting_blood_pressure" "serum_cholestoral" #> [4] "max_heart_rate" "exer_angina" "oldpeak" #> [7] "slope" "num_vessels_flour" "heart_disease_severity" #> #> $vars_char #> character(0) #> #> $vars_factor #> [1] "gender" "chest_pain" "fasting_blood_sugar" #> [4] "resting_electro" "thal" "exter_angina" #> [7] "has_heart_disease" #> #> $vars_other #> character(0) #>
# Example 2: # changing the default minimum threshold to flag a variable as high cardiniality data_integrity(data=data_country, MAX_UNIQUE=50)
#> $vars_num_with_NA #> [1] variable q_na p_na #> <0 rows> (or 0-length row.names) #> #> $vars_cat_with_NA #> [1] variable q_na p_na #> <0 rows> (or 0-length row.names) #> #> $vars_cat_high_card #> variable unique #> 1 country 70 #> #> $MAX_UNIQUE #> [1] 50 #> #> $vars_one_value #> character(0) #> #> $vars_cat #> [1] "country" "has_flu" #> #> $vars_num #> [1] "person" #> #> $vars_char #> [1] "country" "has_flu" #> #> $vars_factor #> character(0) #> #> $vars_other #> character(0) #>