1 Star 0 Fork 2

贾凯威/Machine-Learning-in-R

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
克隆/下载
02-preprocessing.Rmd 9.35 KB
一键复制 编辑 原始数据 按行查看 历史
# Preprocessing ## Load packages Explicitly load the packages that we need for this analysis. ```{r packages} library(rio) library(ck37r) library(caret) ``` ## Load the data Load the heart disease dataset. ```{r load_data} # Load the heart disease dataset using import() from the rio package. data_original = import("data-raw/heart.csv") # Preserve the original copy data = data_original str(data) ``` ## Read background information and variable descriptions https://archive.ics.uci.edu/ml/datasets/heart+Disease ## Data preprocessing Data peprocessing is an integral first step in machine learning workflows. Because different algorithms sometimes require the moving parts to be coded in slightly different ways, always make sure you research the algorithm you want to implement so that you properly setup your $y$ and $x$ variables and split your data appropriately. > NOTE: also, use the `save` function to save your variables of interest. In the remaining walkthroughs, we will use the `load` function to load the relevant variables. ### What is one-hot encoding? One additional preprocessing aspect to consider: datasets that contain factor (categorical) features should typically be expanded out into numeric indicators (this is often referred to as [one-hot encoding](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f). You can do this manually with the `model.matrix` R function. This makes it easier to code a variety of algorithms to a dataset as many algorithms handle factors poorly (decision trees being the main exception). Doing this manually is always good practice. In general however, functions like `lm` will do this for you automatically. ## Handling missing data Missing values need to be handled somehow. Listwise deletion (deleting any row with at least one missing value) is common but this method throws out a lot of useful information. Many advocate for mean imputation, but arithmetic means are sensitive to outliers. Still, others advocate for Chained Equation/Bayesian/Expectation Maximization imputation (e.g., the [mice](https://www.jstatsoft.org/article/view/v045i03/v45i03.pdf) and [Amelia II](https://gking.harvard.edu/amelia) R packages). K-nearest neighbor imputation can also be useful but median imputation is demonstrated below. However, you will want to learn about [Generalized Low Rank Models](https://stanford.edu/~boyd/papers/pdf/glrm.pdf) for missing data imputation in your research. See the `impute_missing_values` function from the ck37r package to learn more - you might need to install an h2o dependency. First, count the number of missing values across variables in our dataset. ```{r review_missingness} colSums(is.na(data)) ``` We have no missing values, so let's introduce a few to the "oldpeak" feature for this example to see how it works: ```{r} # Add five missing values added to oldpeak in row numbers 50, 100, 150, 200, 250 data$oldpeak[c(50, 100, 150, 200, 250)] = NA colSums(is.na(data)) colMeans(is.na(data)) ``` There are now 5 missing values in the "oldpeak" feature. Now, median impute the missing values! We also want to create missingness indicators to inform us about the location of missing data. These are additional columns we will add to our data frame that represent the _locations_ within each feature that have missing values - 0 means data are present, 1 means there was a missing (and subsequently imputed) value. ```{r impute_missing_values} result = ck37r::impute_missing_values(data, verbose = TRUE, type = "standard") names(result) # Use the imputed dataframe. data = result$data # View new columns. Note that the indicator feature "miss_oldpeak" has been added as the last column of our data frame. str(data) # No more missing data! colSums(is.na(data)) ``` Since the "ca", "cp", "slope", and "thal" features are currently integer type, convert them to factors. The other relevant variables are either continuous or are already indicators (just 1's and 0's). ```{r} data = ck37r::categoricals_to_factors(data, categoricals = c("sex", "ca", "cp", "slope", "thal"), verbose = TRUE) # Inspect the updated data frame str(data) ``` ## Defining *y* outcome vectors and *x* feature dataframes ### Convert factors to indicators Now expand "sex", "ca", "cp", "slope", and "thal" features out into indicators. ```{r factors_to_indicators} result = ck37r::factors_to_indicators(data, verbose = TRUE) data = result$data str(data) dim(data) ``` What happened? ## Regression setup Now that the data have been imputed and properly converted, we can assign the regression outcome variable (`age`) to its own vector for the lasso **REGRESSION task**. Remember that lasso can also perform classification as well. ### Set seed for reproducibility Take the simple approach to data splitting and divide our data into training and test sets; 70% of the data will be assigned to the training set and the remaining 30% will be assigned to the holdout, or test, set. ### Random versus stratified random split Splitting data into training and test subsets is a fundamental step in machine learning. Usually, the marjority portion of the original dataset is partitioned to the training set, where the algorithms learn the relationships between the $x$ feature predictors and the $y$ outcome variable. Then, these models are given new data (the test set) to see how well they perform on data they have not yet seen. Since age is a continuous variable and will be the outcome for the OLS and lasso regressions, we will not perform a stratified random split like we will for the classification tasks (see below). Instead, [let's randomly assign](https://stackoverflow.com/questions/17200114/how-to-split-data-into-training-testing-sets-using-sample-function) 70% of the `age` values to the training set and the remaining 30% to the test set. ```{r} # Create a list to organize our regression task task_reg = list( data = data, outcome = "age" ) # All variables can be used as covariates except the outcome ("age") (task_reg$covariates = setdiff(names(data), task_reg$outcome)) names(task_reg) # Define the sizes of training (70%) and test (30%) sets. (training_size = floor(0.70 * nrow(task_reg$data))) # Set seed for reproducibility. set.seed(1) # Partition the rows to be included in the training set. training_rows = sample(nrow(task_reg$data), size = training_size) task_reg$train_rows = training_rows head(task_reg$train_rows) # View our regresion task list names(task_reg) ``` ## Classification setup Assign the outcome variable to its own vector for the decision tree, random forest, gradient boosted tree, and SuperLearner ensemble **CLASSIFICATION tasks**. However, keep in mind that these algorithms can also perform regression! This time however, "target" will by our y outcome variable (1 = person has heart disease, 0 = person does not have heart disease) - the others will be our x features. ```{r} # Store everything in a list like we did for the regression task task_class = list( data = data, outcome = "target" ) (task_class$covariates = setdiff(names(task_class$data), task_class$outcome)) # See the names of the list elements names(task_class) ``` Our factors have still been converted to indicators from the regression setup! :) ### Stratified random split For classification, we then use [stratified random sampling](https://stats.stackexchange.com/questions/250273/benefits-of-stratified-vs-random-sampling-for-generating-training-data-in-classi) to divide our data into training and test sets; 70% of the data will be assigned to the training set and the remaining 30% will be assigned to the holdout, or test, set. ```{r} # Set seed for reproducibility. set.seed(2) # Create a stratified random split. training_rows = caret::createDataPartition(task_class$data[[task_class$outcome]], p = 0.70, list = FALSE) # Partition training dataset task_class$train_rows = training_rows mean(task_class$data[training_rows, "target"]) table(task_class$data[training_rows, "target"]) mean(task_class$data[-training_rows, "target"]) table(task_class$data[-training_rows, "target"]) ``` ## Define training and test x and y variables for regression and classification ### Regression variables ```{r} # Pull out regression preprocessed data for easier analysis train_x_reg = task_reg$data[task_reg$train_rows, task_reg$covariates] train_y_reg = task_reg$data[task_reg$train_rows, task_reg$outcome] test_x_reg = task_reg$data[-task_reg$train_rows, task_reg$covariates] test_y_reg = task_reg$data[-task_reg$train_rows, task_reg$outcome] # Look at ages of first 20 individuals head(train_y_reg, n = 20) # Look at features for the corresponding first 6 individuals head(train_x_reg) ``` ### Classification variables ```{r} train_x_class = task_class$data[task_class$train_rows, task_class$covariates] train_y_class = task_class$data[task_class$train_rows, task_class$outcome] test_x_class = task_class$data[-task_class$train_rows, task_class$covariates] test_y_class = task_class$data[-task_class$train_rows, task_class$outcome] ``` ### Save our preprocessed data We save our preprocessed data into an RData file so that we can easily load it the later files. ```{r save_data} save(data, data_original, task_reg, task_class, train_x_reg, train_y_reg, test_x_reg, test_y_reg, train_x_class, train_y_class, test_x_class, test_y_class, file = "data/preprocessed.RData") ```
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/jiakaiwei/Machine-Learning-in-R.git
git@gitee.com:jiakaiwei/Machine-Learning-in-R.git
jiakaiwei
Machine-Learning-in-R
Machine-Learning-in-R
master

搜索帮助

0d507c66 1850385 C8b1a773 1850385