Procedure 5: Loading Data into h2O with R
Start by loading the FraudRisk.csv file into R using readr:
library(readr)
FraudRisk <- read_csv("C:/Users/Richard/Desktop/Bundle/Data/FraudRisk/FraudRisk.csv")
Run the block of script to console:
The training process will make use of a test dataset and a sample dataset. The preferred method to randomly split a dataframe is to create a vector which comprises random values, then append this vector to the dataframe. Using Vector sub setting, data frames will be split based on a random value.
Start by observing the length of the dataframe by typing (on any dataframe variable):
length(FraudRisk$Dependent)
Run the line of script to console:
Having established that the dataframe has 1827 records, use this value to create a vector of the same size containing random values between 0 and 1. The RunIf function is used to create vectors or a prescribed length with random values between a certain range:
RandomDigit <- runif(1827,0,1)
Run the line of script to console:
A vector containing random digits, of same length as the dataframe, has been created. Validate vector by typing:
Run the line of script to console:
The random digits are written out showing there to be values created, on a random basis, between 0 and 1 with a high degree of precision. Append this vector to the dataframe as using Dplyr and Mutate:
library(dplyr)
FraudRisk <- mutate(FraudRisk,RandomDigit)
Run the block of script to console:
The RandomDigit vector is now appended to the FraudRisk dataframe and can be used in sub setting and splitting. Create the cross-validation dataset by creating a filter creating a new data frame by assignment:
CV <- filter(FraudRisk,RandomDigit < 0.2)
Run the line of script to console:
A new data frame by the name of CV has been created. Observe the CV data frame length:
length(CV$Dependent)
Run the line of script to console:
It can be seen that the data frame has 386 records, which is broadly 20% of the FraudRisk data frames records. The task remains to create the training dataset, which is similar albeit sub setting for a larger opposing random digit filter:
Training <- filter(FraudRisk,RandomDigit >= 0.2)
Run the line of script to console:
Validate the length of the Training data frame:
length(Training$Dependent)
Run the line of script to console:
It can be observed that the Training dataset is 1463 records in length, which is broadly 70% of the file. So not to accidentally use the RandomDigit vector in training, drop it from the Training and CV data frames:
CV$RandomDigit <- NULL
Training$RandomDigit <- NULL
Run the block of script to console:
H2O requires that the Dependent Variable is a factor, it is after all a classification problem. Convert the dependent variable to a factor for the training and cross validation dataset:
Run the line of script to console:
At this stage, there now exists a randomly selected Training dataset as well as a randomly selection Cross Validation training set. Keep in mind that H2O requires that the dataframe is converted to the native hex format, achieved through the creation of a parsed data object for each dataset. Think of this process as being the loading of data into the H2O server, more so than a conversion to Hex:
Training.hex <- as.h2o(Training)
CV.hex <- as.h2o(CV)
Run the block of script to console:
All models that are available to be trained via the Flow interface are available via the R interface, with the hex files being ready to be passed as parameters.