This article provides an example of how Data360 Analyze Desktop can leverage R to classify data using Machine Learning.
By leveraging the power of R, Data360 Analyze can be extended beyond its native analytic capabilities to address a larger problem set while retaining Data360 Analyze's other facilities for acquiring, federating, and publishing data.
There are many articles available online that detail how to use R Machine Learning techniques to build classification models so this article does not attempt to repeat this general information. Instead, the article focuses primarily on how Data360 Analyze can integrate with R and discusses the specifics of the particular Machine Learning technique used within the example.
If you want to build the example data flow as you follow this article the steps are included in the text. Alternatively, the complete data flow is attached to the article and can be imported into your Data360 Analyze installation.
What is R?
R is a popular and powerful Open Source statistical computing platform with an extensive library of analytic techniques which allows data scientists to create complex computations and apply them to solve statistical and predictive analytic problems.
The R Node
The R node provides an interface to the R computation platform, allowing R scripts to be executed using Data360 Analyze data sets, producing tabular or graphical output which is returned to Data360 Analyze.
Disclaimer
Open-source R is available under separate open source software license terms and is not part of Data360 Analyze. As such, Open-Source R is not within the scope of your license for Data360 Analyze. Open-source R is not supported, maintained, or warranted in any way by Infogix, Inc. Download and use of Open-Source R is solely at your own discretion and subject to the free open source license terms applicable to Open-Source R.
Obtaining the R Node
The R node archive for the latest version of Analyze can be downloaded from the Analyze Install page of the Infogix website. The page also includes instructions on how to install and configure R and the R node. You can also obtain the R node for your Analyze version from the Analyze Downloads page (signin required).
Using the R Node
The R node uses a TCP/IP connection to communicate with an R environment. This connection is provided by Rserve. For Analyze to communicate with the R environment, the Rserve server must be running on the machine hosting the R environment (which may be on same machine as Analyze or on another machine on your IT infrastructure). This article assumes you are running R on your desktop.
To start the Rserve server open the R console and enter the following commands at the R prompt:
library(Rserve)
Rserve()
You should then see a message similar to “Starting Rserve …”.
Once the Rserve server is running you can quit the R console by entering the q() command and then selecting ‘No’ to the question about saving the workspace image. However, you may need to keep the console open for a while longer if you have not got all the prerequisite packages installed (see next section).
Note, the Rserve server will need to be restarted each time your machine boots.
Installing CRAN packages
The Comprehensive R Archive Network (CRAN) libraries offer over 7000 add-on software ‘Packages’ that provide a wealth of advanced statistical and predictive capabilities. In this example we will be using the ‘caret’ package which streamlines access to over 50 predictive algorithms.
For completeness, you can install the caret package and all of its prerequisite/ suggested packages by entering the following command at the R console prompt. This installs quite a few packages, the majority of which are not used in this particular example. If you go down this route I’d suggest you grab a coffee at this point.
install.packages("caret", dependencies = c("Depends", "Suggests"))
Alternatively, you can enter the minimal set of packages by just entering the following command. You may also need to install the "MASS" and "e1071" packages too if they are not installed in your R environment.
install.packages("caret")
You may be asked to select a ‘mirror’ to use for the current session – select one that is local to you. R will then download and install caret and the other packages. When this is complete you can then quit the R console session.
You may see some issues with missing packages. You can test whether caret is installed by using the following command to load the library into the workspace
library(caret)
You if you get an error relating to the "lazyeval" and "munsell" packages being missing you can install them manually using the following command and then try loading the caret library again.
install.packages(c("munsell","lazyeval"))
Example Data
In this example we will be using the Letter Recognition data set from the UCI Machine Learning Repository. This data set provides a set of character image features that can be used in a machine learning model to identify (classify) the upper case letter (A – Z). The overview of the data set can be found here.
Downloading the Example Data
We are going to download the example data directly from the UCI repository. To do this Start Analyze and Create a new data flow. From the ‘Interfaces and Adapter’s category of the node palette drag a R node onto the canvas.
By default the R node does not have any input or output pins. We are going to add an output pin for the downloaded data. With the node selected in the canvas, click on the ‘Define’ tab:
Scroll down to the bottom of the properties panel so that the ‘Inputs’ and ‘Outputs’ sections are visible. Click into the ‘Output Name’ text area (where the “Type to add a new output” placeholder is displayed). Type the name for the output – in this case enter ‘out1’ (without the quotes) and press enter. The panel will change to show a new placeholder line and the R node on the canvas will be updated to display the new output pin.
Now scroll back to the top of the properties panel and select the ‘Configure’ tab.
Type a descriptive name in the ‘Name’ property. The indication area of the node properties show that there is one mandatory value missing – the contents of the RScript property – into which we will be inserting the required R script.
Before we get into the specifics of the script for this example, it is worth noting that, when the R node is run it will, by default, generate an error if it cannot locate a variable with the same name as an output pin – ‘out1’ in our case. When developing a R script within the node it is useful to assign a placeholder for any outputs when you create the pins. When outputting data, the data must be a R object of class ‘data.frame’. In its simplest case this can be a single string value, e.g.
out1 <- data.frame(placeholder="Hello world!")
Alternatively, if you had a node with an input pin (called ‘in1’) you could pass the data through the R node using:
out1 <- in1
Note that in R, variables are case sensitive. The variables you use must be valid R names.
Ok, back to our example. The R node can download data from a web source. So in the script the first statement defines the URL of the file (comment lines starting with # are ignored).
## Define the URL for the source data from the UCI Machine Learning Repository
data_URL <- "https://archive.ics.uci.edu/ml/machine-learning-databases/letter-recognition/letter-recognition.data"
The next R statement downloads the data from the specified URL and assigns it to the ‘df’ variable:
df <- read.csv(url(data_URL), header=FALSE)
As the file in the repository does not include the variable names the header attribute is set to ‘FALSE’.
Next we assign the correct names to the variables (see the data description for details). The names are defined as a vector and the data.frame column names are then assigned to the values in the vector:
var_Names <- c("lettr","x-box","y-box","width",
"high","onpix","x-bar","y-bar",
"x2bar","y2bar","xybar","x2ybr",
"xy2br","x-ege","xegvy","y-ege","yegvx")
colnames(df) <- var_Names
As we want to output the downloaded data we assign the ‘out1’ variable to the ‘df’ data.frame
out1 <- data.frame(df)
As in this example the R environment is on the local machine with a default Rserve configuration, there is no need to configure the Rusername, Rpassword or change the Rport value.
Running the node retrieves the data and outputs it on the ‘out1’ pin:
Creating an Unseen Data set
There are 20,000 records in the ‘Letter Recognition’ data set. Rather than use all these records in building the model, we will split out a subset of the records and retained them for use as ‘unseen’ new data in a final test of the model’s effectiveness. In this case a Transform node is added to the canvas and used to generate a simplistic sample by selecting 1 in 25 of the records. The Transform node has a second output pin ('out2') defined. The node is configured with a conditional expression that uses the Python modulus operator "%" to extract 800 records by writing the record to one of the two output pins depending on the value of the remainder from the modulus operator.
if node.execCount % 25 != 0:
out1 += in1
else:
out2 += in1
Creating Train and Test Data Sets
We are going to use the remaining (19,200) records to train and validate the model. We could have used another Transform node to do this but here we are going to use another R node. The R node is added to the canvas and it is configured with one input pin (‘in1’) and two outputs (‘out1’ and ‘out2’) in the similar manner to how we previously created an output pin on the R node used to download the data.
As before, a suitable name is given to the R node. In the node’s R script the first statement just assigns a reference variable to the input data (it is not strictly necessary as we could have referred to ‘in1’ in the remainder of the script but I like to do this as it does not impose a significant overhead and is a reminder that we are working with a data.frame).
df <- in1
Next, we define the percentage of the records we want in the ‘train’ data set – in this case 75%. The nrow() function is used to get the number of records in the input data from which we calculate the number of records to be in the train data set.
Percent <- 0.75
total_Records <- nrow(df)
Train_numRows <- floor(Percent * total_Records)
Next we set the seed so that the selection derived from the random sample is repeatable over multiple runs.
set.seed(101)
We then create a vector containing the indices of the sample of records that are to be included in the train data set.
train_idx <- sample(seq_len(total_Records), size = Train_numRows)
This vector is then used to subset the data into the required train and test data sets
train <- df[train_idx, ]
test <- df[-train_idx, ]
Finally, we assign each of these two data sets on the named output pins.
out1 <- data.frame(train)
out2 <- data.frame(test)
When the node is run the two data sets are presented at the output:
Building the Model
To model the data we are going to use the Linear Discriminant Analysis (LDA) algorithm. This is just one of the many algorithms that can be used with the caret package.
Note:
When the LDA algorithm is used with the caret package it will automatically load the required packages. If the ‘MASS’ or ‘e1071’ packages are required for your R environment, use the install.packages("MASS") command or install.packages("e1071") command to install them, as previously described.
Another R node is added to the canvas and configured with two input pins (‘in1’ and ‘in2’) and one output pin (‘out1’). The two outputs of the ‘Create Train and Test Sets’ R node are then connected to the two input pins of the new R node.
Having named the node we can then configure the R script. As we are going to be using the installed caret package (library), we first configure the search paths to include the Analyze temporary directory, and set the R node to use the same directory as its working directory. These use built-in ‘ls.brain.temDir’ property reference – which will be substituted for the actual configured value of the path when the node is run.
myLibrary = "{{%ls.brain.tempDir%}}" # Version/platform independent path
.libPaths( c( .libPaths(), myLibrary) )
setwd("{{%ls.brain.tempDir%}}")
Next the caret library is loaded and the ‘train’ and ‘test’ variables are used to reference the input data sets on ‘in1’ and ‘in2’.
library(caret)
train <- in1
test <- in2
Next the ‘WriteModel’ variable is set to TRUE – this will be used later to indicate we want the LDA model to be saved to a file. The directory where the file is to be saved can be specified – by default the node will use the Analyze temporary file directory. Note, when working on a Windows machine, the path separators should be either the forward slash character (‘/’) or the escaped backslash character (‘\\’).
WriteModel <- TRUE
#OutputDir <- "C:/temp"
OutputDir <- NULL
The additional logic instructs the node to use the Analyze temporary directory if the output directory was not specified (is NULL).
if(is.null(OutputDir)) {
ModelOutDir = c("{{%ls.brain.tempDir%}}")
} else {
ModelOutDir = OutputDir
}
A variable is defined with the name of the model.
ModelOutFilename <- "LDA_Model"
Ok, so now we can configure some of the attributes used to build the LDA model. Again, we set the seed for repeatable results.
set.seed(101)
The trainControl() function from the caret package is then used to build a trainControl object that is configured to use 10-fold Cross Validation (CV) to test the accuracy of the LDA model. The ‘Accuracy’ measure is configured as the metric for the CV tests.
trainCont <- trainControl(method="cv", number=10)
Metric <- "Accuracy"
Now we can (finally) go head and build the model using the train() function from the caret package. The LDA algorithm accepts a ‘model formula’ to specify which variables that are to be included in the model. In this case the model formula is set to ‘lettr~.’ Which is a shorthand means of saying model the ‘lettr’ variable as a function of all the other variables ‘.’ in the training data. The trControl and metric attributes are also included in the train function. The fitted LDA model is assigned to the ‘fit’ variable.
fit <- train(lettr~., data=train, method="lda", metric=Metric, trControl=trainCont)
The fitted model includes a number of attributes. We capture the output of the print function to obtain a summary of the model. This is assigned to a variable so it can be output later.
Summary <- paste(capture.output(print(fit)), collapse= "\r\n")
Next, we are going to use the ‘test’ data set to validate the model. The predict() function is used to generate class label predictions for the observations in the test data.
predictions <- predict(fit, test)
To assess the model’s classification accuracy the predicted class labels are compared with the actual class labels for the observations in the test data. This is used to generate the ‘Confusion Matrix’ object. We capture the summary of this and assign it to the ‘ConfMtx’ variable so we can output it.
ConfMatrixObj <- confusionMatrix(predictions, test$lettr)
ConfMtx <- paste(capture.output(print(ConfMatrixObj)), collapse= "\r\n")
The fitted LDA model is only available within the current R node. However, we may want to use the model within another node without having to regenerate it and this can be achieved by saving (serializing) the model’s R object to a file. The path for the file is constructed from the previously specified file path and model name – plus a ‘.RDS’ file extension to aid identification. The serialized object is then (conditionally) saved to the file.
modelOutFP <- file.path(ModelOutDir, paste0(ModelOutFilename,".RDS"))
if (isTRUE(WriteModel)) {
## Save the serialized model (in uncompressed binary format)
saveRDS(fit, modelOutFP, compress = FALSE)
}
Finally, the results are output on the node’s ‘out1’ pin. The output data will include the file path of the serialized model file if it has been saved.
if (isTRUE(WriteModel)) {
out1 <- data.frame(Summary, ConfusionMatrix=ConfMtx, ModelFilePath=modelOutFP)
} else {
out1 <- data.frame(Summary, ConfusionMatrix=ConfMtx)
}
Examining the LDA Model Results
The output data includes three fields – the model summary, the confusion matrix summary and the file path of the serialized model. The first two fields are multi-line data (indicated by the yellow triangle in the top right of the cell). Hovering the mouse over the cell displays the contents of the cell. This can also be copied out to, say, Notepad using ‘Ctrl+C’ or the copy option of the right-click context menu.
The Summary provides details about the number of samples used to build the model (the overall total and the number in each of the 10 sample ‘folds’), the number of predictors used, and number of class levels in the model (A-Z in this case). It also provides an indication of the accuracy of the model, based on the cross validation results. – in this case 70% (maybe it still needs more work but it is good enough for this example).
The Confusion Matrix summary provides a lot of information – too much to see in the hover text so it is best to copy this out to Notepad. The Confusion Matrix shows the counts of the (mis-) classification of the test observations. The actual (reference) class labels are show in the rows and the predicted class labels are shown in the columns. In the image below the results for the letter ‘C’ are highlighted. Ideally the intersection of the red rectangles should be the only cell to contain a non-zero value as all other cells within the red rectangles represent mis-classifications for that character.
The Confusion Matrix output also contains statistics (e.g. confidence intervals the P-value) and information on the model’s classification accuracy for each individual class label.
Consolidating the Logic
The R code logic described above for this example was split across multiple R nodes to break down the operation so that the created objects could be inspected at the output of the various stages. It also provides an example of a node that could be used in other projects to create training and test data sets.
In practice, the logic in the ‘Create Train and Test Sets’ node and the ‘Classification using caret Package’ node can be combined into a single node, which improves the performance as the train and test data does not need to be passed from one node to the other. This consolidated logic has been included in the ‘Linear Discriminant Analysis’ node (see the example data flow for details).
Using the Fitted Model to Make Predictions
So far we have used the train data to build the LDA model and the test data to assess the performance (accuracy) of the final model. Now we will build a node that uses the serialized model to predict the class labels for new (‘unseen’) data.
As before, this is going to use a R node. It will take two inputs – the file path of the serialized LDA model on the first input and the new data on the second input. This is the 800 records from the original data that we retained as our ‘hold-out’ data set to simulate new data.
The R script in this node is similar to that previously described for the classification logic.
The library paths and working directory are configured and the caret package is loaded.
myLibrary = "{{%ls.brain.tempDir%}}" # Version/platform independent path
.libPaths( c( .libPaths(), myLibrary) )
setwd("{{%ls.brain.tempDir%}}")
library(caret)
Next we reference the new data presented on the node’s second input pin.
newData <- in2
Then the file path for the serialized model is obtained from the ‘ModelFilePath’ field on the first input pin (as a character string rather than a factor).
ModelInFP <- as.character(in1$ModelFilePath)
The serialized model is then read-in (de-serialized) and assigned to the ‘modelObject’ variable. The class of the modelObject is checked to confirm that is was a model created by the train() function.
modelObject <- readRDS(ModelInFP)
ObjClass <- class(modelObject)
if(ObjClass[1] != "train") {
stop(paste("The object in file", ModelInFP, "is not an object of class 'train'\n" ))
}
The seed is set again to give repeatable results and the predict() function is used to generate the class label predictions for the new (unseen) data. The predictions are a vector of the class labels – these are combined with the data in the newData data frame to form the output data. These are then referenced so they can be output on the ‘out1’ pin.
set.seed(101)
predictions <- predict(modelObject, newData)
PredictDF <- data.frame(Prediction=predictions, newData)
out1 <- data.frame(PredictDF)
In this case, the output data includes both the actual class label (which, obviously, normally would not be present) and the predicted class label for the letter.
We can use an Aggregate node to aggregate the classification results for our new data, to illustrate how it performed on the unseen data. The Script properties in the Advanced tab are configured as shown below. The ‘GroupBy’ property is configured to group the aggregations by the ‘lettr’ field.
The ConfigureFields script in the Operations' Advanced tab defines two custom aggregates that count the number of correct and incorrect predictions.
outputs[0]["Correct"] = group.count(fields["Prediction"], lambda agg, newValue: newValue == fields.lettr)
outputs[0]["Wrong"] = group.count(fields["Prediction"], lambda agg, newValue: newValue != fields.lettr)
The Processrecords script is configured to output the values of the aggregates.
The aggregations identify the number of correct and incorrect predictions for each letter. A Calculate Fields node is used to calculate the percentage of correct values for each letter.
The results show the performance in this case is variable, depending on the letter. The best accuracy is for ‘L’ and ‘M’ – at over 90% whereas the accuracy for ‘G’, ‘S’ and ‘H’ are less than 50%, which we may want to improve on.
Conclusions
In this article we have seen how Data360 Analyze can leverage the capabilities of Open Source R and the add-on capabilities provided by one of the CRAN libraries to generate a machine learning model for classifying categorical data.
While the actual results from the classification may need to be improved for a real-world letter recognition system, it is instructive on how machine learning can be incorporated into an overall Data360 Analyze data flow. The choice of the classification algorithm can affect the performance of the model. The use of LDA in this example was arbitrary and the results could possibly be improved by utilizing one of the other algorithms available with the caret package such as Random Forest or Support Vector Machines. For instance, in other tests the K Nearest Neighbour algorithm was seen to perform better than LDA with an accuracy of between 100% and 77% for all class labels.
The logic in the Linear Discriminant Analysis node served to illustrate how Data360 Analyze can integrate with an R environment and it could be extended to make it more generic and reusable by, say, parameterizing: the percentage records allocated to the train/test sets; the definition of the model formula; and the name of the dependent variable – rather than hard-wiring these within the R code itself. Other improvements could be made to improve the sampling to use a stratified sample across each of the class labels rather than drawing a sample from all the observations as a single data set – which could affect the outcome if there are some categories in the data set with fewer observations.
In addition to offering integration with Open Source R, Data360 Analyze includes a range of statistical and predictive nodes that provide pre-defined nodes for techniques such as Linear and Logistic Regression, supervised classification using Random Forest, unsupervised classification using K-Means and Hierarchical Clustering, Affinity analysis and Time Series analysis. These premium-licensed statistical and predictive nodes and the general purpose ‘Power R’ node leverage the capabilities of the embedded R engine deployed with the Data360 Analyze software rather than an Open Source R environment.
I hope you have enjoyed working through this Machine Learning example and would encourage you to investigate how you can use Data360 Analyze to extract insights from your data and make better business decisions.
Comments
0 comments
Please sign in to leave a comment.