# Feature Hashing (a.k.a. The Hashing Trick) With R

Practical walkthroughs on machine learning, data exploration and finding insight.

**Resources**

**Packages Used in this Walkthrough**

**{FeatureHashing}**- Creates a Model Matrix via Feature Hashing With a Formula Interface**{RCurl}**- General network (HTTP/FTP/...) client interface for R**{caret}**- Classification and Regression Training**{glmnet}**- Lasso and elastic-net regularized generalized linear models

Feature hashing is a clever way of modeling data sets containing large amounts of factor and character data. It uses less memory and requires little pre-processing. In this walkthrough, we model a large healthcare data set by first using **dummy variables** and then **feature hashing**.

What’s the big deal? Well, normally, one would:

**Dummify all factor, text, and unordered categorical data**: This creates a new column for each unique value and tags a binary value whether or not an observation contains that particular value. For large data sets, this can drastically increase the dimensional space (adding many more columns).**Drop levels**: For example, taking the the top x% most popular levels and neutralizing the rest, grouping levels by theme using**string distance**, or simply ignoring factors too large for a machine's memory.**Use a sparse matrix**: This can mitigate the size of these dummied data sets by dropping zeros.

But a more complete solution, especially when there are tens of thousands of unique values, is the **‘hashing trick’**.

**Wush Wu** created the FeatureHashing package available on CRAN. According to the package’s introduction on CRAN:

"Feature hashing, also called the hashing trick, is a method to transform features to vector. Without looking up the indices in an associative array, it applies a hash function to the features and uses their hash values as indices directly. The method of feature hashing in this package was proposed in Weinberger et. al. (2009). The hashing algorithm is the murmurhash3 from the digest package. Please see the README.md for more information.”

Feature hashing has numerous advantages in modeling and machine learning. It works with address locations instead of actual data, this allows it to process data only when needed. So, the first feature found is really a column of data containing only one level (one value), when it encounters a different value, then it becomes a feature with 2 levels, and so on. It also requires no pre-processing of factor data; you just feed it your factor in its raw state. This approach takes a lot less memory than a fully scanned and processed data set. Plenty of theory out there for those who want a deeper understanding.

Some of its disadvantages include causing models to run slower and a certain obfuscation of the data.

Manuel Amunategui - Follow me on Twitter: @amunategui