### Data Exploration & Machine Learning, Hands-on

Practical Walkthroughs on Machine Learning, Data Exploration and Insight Finding

}

# Using Correlations To Understand Your Data

Practical walkthroughs on machine learning, data exploration and finding insight.

Resources

Packages Used in this Walkthrough

• {psych} - correlation matrix plotting
• {caret} - dummyVars function

A great way to explore new data is to use a pairwise correlation matrix. This will measure the correlation between every combination of your variables. It doesn’t really matter if you have an outcome (or response) variable at this point, it will compare everything against everything else.

For those not familiar with the correlation coefficient, it is simply a measure of similarity between two vectors of numbers. The measure value can range between 1 and -1, where 1 is perfectly correlated, -1 is perfectly inversly correlated, and 0 is not correlated at all:

``````print(cor(1:5,1:5))
## 1
print(cor(1:5,5:1))
## -1
print(cor(1:5,c(1,2,3,4,4)))
## 0.9701425
``````

To help us understand this process, let’s download the adult.data set from the UCI Machine Learning Repository. The data is from the 1994 Census and attempts to predict those with income under \$50,000 a year:

``````library(RCurl) # download https data
x <- getURL(urlfile, ssl.verifypeer = FALSE)

# if the above getURL command fails, try this:
``````

We fill in the missing headers for the UCI set and cast the outcome variable `Income` to a binary format of 1 and 0 (here I am reversing the orginal order, if it is under \$50k, it is 0, and above, 1; this will make the final correlation-matrix plot easier to understand):

``````names(adults)=c('Age','Workclass','FinalWeight','Education','EducationNumber',
'MaritalStatus','Occupation','Relationship','Race',
'Sex','CapitalGain','CapitalLoss','HoursWeek',
'NativeCountry','Income')

``````

We load the caret package to dummify (see my other walkthrough) all factor variables as the `cor` function only accepts numerical values:

``````dmy <- dummyVars(" ~ .", data = adults)
``````

We borrow two very useful functions from Stephen Turner: `cor.prob` and `flattenSquareMatrix`. cor.prob will create a correlation matrix along with p-values and flattenSquareMatrix will flatten all the combinations from the square matrix into a data frame of 4 columns made up of row names, column names, the correlation value and the p-value:

``````corMasterList <- flattenSquareMatrix (cor.prob(adultsTrsf))
``````
``````##                         i                       j       cor         p
## 1                     age            workclass...  0.042627 1.421e-14
## 2                     age  workclass..Federal.gov  0.051227 0.000e+00
## 3            workclass...  workclass..Federal.gov -0.042606 1.454e-14
## 4                     age    workclass..Local.gov  0.060901 0.000e+00
## 5            workclass...    workclass..Local.gov -0.064070 0.000e+00
## 6  workclass..Federal.gov    workclass..Local.gov -0.045682 2.220e-16
## 7                     age workclass..Never.worked -0.019362 4.759e-04
## 8            workclass... workclass..Never.worked -0.003585 5.178e-01
## 9  workclass..Federal.gov workclass..Never.worked -0.002556 6.447e-01
## 10   workclass..Local.gov workclass..Never.worked -0.003843 4.880e-01
``````

This final format allows you to easily order the pairs however you want - for example, by those with the highest absolute correlation value:

``````corList <- corMasterList[order(-abs(corMasterList\$cor)),]
``````
``````##                                      i                            j        cor p
## 1953                       sex..Female                    sex..Male -1.0000000 0
## 597                       workclass...                occupation...  0.9979854 0
## 1256 maritalStatus..Married.civ.spouse        relationship..Husband  0.8932103 0
## 1829                       race..Black                  race..White -0.7887475 0
## 527  maritalStatus..Married.civ.spouse maritalStatus..Never.married -0.6448661 0
## 1881             relationship..Husband                  sex..Female -0.5801353 0
## 1942             relationship..Husband                    sex..Male  0.5801353 0
## 1258      maritalStatus..Never.married        relationship..Husband -0.5767295 0
## 1306 maritalStatus..Married.civ.spouse  relationship..Not.in.family -0.5375883 0
## 497                                age maritalStatus..Never.married -0.5343590 0
``````

The top correlated pair (sex..Female and sex.Male), as seen above, won’t be of much use as they are the only two levels of the same factor. We need to process this a little further to be of practical use. We create a single vector of variable names by filtering those with an absolute correlation of 0.2 or higher against the outcome variable of `Income`:

``````selectedSub <- subset(corList, (abs(cor) > 0.2 & j == 'Income'))
print(selectedSub)
i      j        cor p
## 5811      MaritalStatus..Never.married Income -0.3184403 0
## 5832           Relationship..Own.child Income -0.2285320 0
## 5840                       Sex..Female Income -0.2159802 0
## 5818       Occupation..Exec.managerial Income  0.2148613 0
## 5841                         Sex..Male Income  0.2159802 0
## 5842                       CapitalGain Income  0.2233288 0
## 5844                         HoursWeek Income  0.2296891 0
## 5779                               Age Income  0.2340371 0
## 5806                  Education.Number Income  0.3351540 0
## 5829             Relationship..Husband Income  0.4010353 0
## 5809 MaritalStatus..Married.civ.spouse Income  0.4446962 0
``````

We save the most correlated features to the `bestSub` variable:

``````bestSub <- as.character(selectedSub\$i)
``````

Finally we plot the highly correlated pairs using the {psych} package’s `pair.panels` plot (this can also be done on the original data as `pair.panels` can handle factor and character variables):

``````library(psych)
``````

The pairs plot, and in particular the last `Income` column, tell us a lot about our data set. Being `never married` is the most negatively correlated with income over \$50,000/year and `Hours Worked` and `Age` are the most postively correlated.

Things to keep in mind

• If you have a huge number of features in your data set, then be ready for extra computing time,
• and don't bother plotting it via ``pair.panels``, it will end up hanging R.
• ``````library(RCurl) # download https data
x <- getURL(urlfile, ssl.verifypeer = FALSE)

# if the above getURL command fails, try this:

'MaritalStatus','Occupation','Relationship','Race',
'Sex','CapitalGain','CapitalLoss','HoursWeek',
'NativeCountry','Income')

library(caret)
dmy <- dummyVars(" ~ .", data = adults)

## Correlation matrix with p-values. See http://goo.gl/nahmV for documentation of this function
cor.prob <- function (X, dfr = nrow(X) - 2) {
R <- cor(X, use="pairwise.complete.obs")
above <- row(R) < col(R)
r2 <- R[above]^2
Fstat <- r2 * dfr/(1 - r2)
R[above] <- 1 - pf(Fstat, 1, dfr)
R[row(R) == col(R)] <- NA
R
}

## Use this to dump the cor.prob output to a 4 column matrix
## with row/column indices, correlation, and p-value.
## See StackOverflow question: http://goo.gl/fCUcQ
flattenSquareMatrix <- function(m) {
if( (class(m) != "matrix") | (nrow(m) != ncol(m))) stop("Must be a square matrix.")
if(!identical(rownames(m), colnames(m))) stop("Row and column names must be equal.")
ut <- upper.tri(m)
data.frame(i = rownames(m)[row(m)[ut]],
j = rownames(m)[col(m)[ut]],
cor=t(m)[ut],
p=m[ut])
}

corList <- corMasterList[order(-abs(corMasterList\$cor)),]

corList <- corMasterList[order(corMasterList\$cor),]
selectedSub <- subset(corList, (abs(cor) > 0.2 & j == 'Income'))