Udemy Classes

Data Exploration & Machine Learning, Hands-on

Practical walkthroughs on machine learning, data exploration and finding insight.

Databricks, SparkR and Distributed Naive Bayes Modeling

r and azure

One of the recent additions to SparkR is the Naive Bayes classification model. It's simple, fast and accurate - perfect for working with large data sets in distributed environments - yep, perfect for Spark! Here is a look at the model in action with some of its limitations and workarounds.

R and Azure ML - Your One-Stop Modeling Pipeline in The Cloud!

r and azure

At the risk of being accused of only using Amazon Web Services, here is a look at modeling using Microsoft Azure Machine Learning Studio along with the R programming language. It is chock-full of data munging, modeling, and delivery functions!

Get Your "all-else-held-equal" Odds-Ratio Story for Non-Linear Models!

pseudo coefficients

On one hand we have tree-based classifiers and deep-belief networks and on the other, linear regression models. What the latter lacks in terms of coolness and precision certainly makes up in transparency and actionability. People just love their coefficients and odd ratios. Here is an approach to extract odds out of tree-based classifiers so you too can say 'all else held equal, a one unit change in x, will result in a pseudo-coefficient change of something in y'. The bonus here is that we capture non-linear movements - this can yield a lot of intelligence out of your variables!

Predict Stock-Market Behavior using Markov Chains and R

Markov

We apply Markov Chains to map and understand stock-market behavior using the R programming language. By using 2 transition matrices instead of one, we are able to weigh the probability of a binary outcome.

Big Data Surveillance: Use EC2, PostgreSQL and Python to Download all Hacker News Data!

Big Data Surveillance

We'll first look at the Algolia API and Max Woolf's scripts to download all Hacker News data using EC2 and PostgreSQL, then we'll look at the Firebase/Hacker News API web service to pull specific content by ID.

The Peter Norvig Magic Spell Checker in R

Peter Norvig Magic Spell Checker

Peter Norvig, Director of Research at Google, offers a clever way for any of us to create a good spell checker with nothing more than a few lines of code and some text data.

Actionable Insights: Getting Variable Importance at the Prediction Level in R

Actionable Insights

Here is an easy way to get the top and bottom features contributing to a prediction. This affords a level of transparency to the report reader in understanding why the model chose a particular probability for an observation.

Survival Ensembles: Survival Plus Classification for Improved Time-Based Predictions in R

Survival Ensembles

In this post we'll look at extracting AUC scores from survival models, blending and ensembling random forest survival with gradient boosting classification models, and measuring improvements on time-based predictions.

Anomaly Detection: Increasing Classification Accuracy with H2O's Autoencoder and R

Anomaly Detection

Use H2O's anomaly detection with R to separate data into easy and hard to model subsets and gain predictive insight.

H2O & RStudio Server on Amazon Web Services (AWS), the Easy Way!

H2O on AWS

See how easy it is to install H2O and RStudio Server on Amazon Web Services (AWS) from scratch. No need of customized AMIs or third party tools - no training wheels here! And the best part is that we can do everything from the Amazon Web Service wizard, no need to tunnel or putty anywhere!

Analyze Classic Works of Literature from Around the World with Project Gutenberg and R

Project Gutenberg

Project Gutenberg offers and easy way to download over 50,000 classic works of literature from around the world in digital format using the R language.

Speak Like a Doctor - Use Natural Language Processing to Predict Medical Words in R

NLP

Using R, natural language processing (NLP), a medical corpus, and a Shiny application, we build an interactive tool to predict what a doctor will say next.

Supercharge R with Spark: Getting Apache's SparkR Up and Running on Amazon Web Services (AWS)

RSpark

See how easy it is to set up a few SparkR clusters and to control them from RStudio. In this first installment, we'll set up multiple clusters on Amazon's AWS EC2 and control them via RStudio.

R and Excel: Making Your Data Dumps Pretty with XLConnect

R and Excel

When it comes to exporting data, one has many formats to choose from. But if you're looking for something more sophisticated than a comma-delimited file but aren't ready for an off-the-shelf content-management system, then Excel may be what you need in presenting content in a more digestible format.

Going from an Idea to a Pitch: Hosting your Python Application using Flask and Amazon Web Services (AWS)

Flask and AWS

This walk-through is about demonstrating how easy it is to transform an idea into a web application. This is for those who want to quickly pitch their application to the world without getting bogged down by technical details. This is for the weekend warrior. If the application is a success, people with real skills will be brought in to do the job right, in the meantime we want it fast, cheap and easy. We'll use Python, Flask, and EC2 Amazon Web Services to migrate an program into a web application.

Getting PubMed Medical Text with R and Package {RISmed}

PubMed Medical Text

PubMed is a great source of medical literature. If you are working on a Natural Language Processing (NLP) project and need 100's or 1000's of topic-based medical text, the RISmed package can simplify and automate that process.

Find Variable Importance for any Model - Prediction Shuffling with R

Variable Importance

You model and predict once to get a benchmark score, then predict hundreds of times for each variable while randomizing it each time. If the variable being randomized hurts the model's benchmark score, then its an important variable. If nothing changes then its a useless variable.

Bagging / Bootstrap Aggregation with R

Bagging & Bootstrap

Bagging is the not-so-secret edge of the competitive modeler. By sampling and modeling a training data set hundreds of times and averaging its predictions, you may just get that accuracy boost that puts you above the fray.

Feature Hashing (a.k.a. The Hashing Trick) With R

Feature Hashing

Feature hashing is a clever way of modeling data sets containing large amounts of factor and character data. It uses less memory and requires little pre-processing. In this walkthrough, we model a large healthcare data set by first using dummy variables and then feature hashing.

Yelp, httr and a Romantic Trip Across the United States, One Florist at a Time

Yelp & R

The title says it all, we are going to use Yelp to cross the United States from San Francisco, CA to New York City, NY, and be 60 miles from a florist at all times.

Quantifying the Spread: Measuring Strength and Direction of Predictors with the Summary Function

Measuring Predictiors

Use the Summary() function to quickly and intuitively measure predictors. By splitting the data into two sets, one for each outcome, and summarizing them individually, we can plot and measure behaviors towards the outcome variable. Simple, easy, and fast!

Downloading Data from Google Trends And Analyzing It With R

Google Trends

In this walkthrough, I introduce Google Trends by querying it directly through the web, downloading a comma-delimited file of the results, and analyzing it in R.

Using String Distance {stringdist} To Handle Large Text Factors, Cluster Them Into Supersets

stringdist

{stringdist} can help make sense of large, text-based factor variables by clustering them into supersets. This approach preserves some of the content's substance without having to resort to full-on, natural language processing.

SMOTE - Supersampling Rare Events in R

SMOTE

Brief introduction of the SMOTE package and over-sampling imbalanced data sets. SMOTE uses bootstrapping and k-nearest neighbor to synthetically create additional observations.

Let's Get Rich! See how {quantmod} And R Can Enrich Your Knowledge Of The Financial Markets!

Quantmod

See how easy it is display great looking current stock charts in 2 lines of code and then stock market data and use it all to build a complex market model.

How To Work With Files Too Large For A Computer’s RAM? Using R To Process Large Data In Chunks

Dealing with large data sets

Using the function read.table(), we break file into chunks in order to process them. This allows processing files of any size beyond what the machine's RAM can handle.

Predicting Multiple Discrete Values with Multinomials, Neural Networks and the {nnet} Package

Predicting Multiple Discrete Values

Using R and the multinom function from the nnet package, we can easily predict discrete values (factors) of more than 2 levels. We also use Repeated Cross Validation to get an accurate model score and to understand the importance of allowing the model to converge (reaching global minima).

Modeling 101 - Predicting Binary Outcomes with R, gbm, glmnet, and {caret}

Caret Variable Importance

This walkthrough shows how to easily model binary outcomes using caret models, how to evaluate the predictions, and how to display variable importance.

Reducing High Dimensional Data with Principle Component Analysis (PCA) and prcomp

Principle Component Analysis

In this R walkthrough, we'll see how PCA can reduce a 1000+ variable data set into 10 variables and barely lose accuracy! This is incredible, and everytime I play around with this, I still get amazed!

The Sparse Matrix and {glmnet}

sparse glmnet

Walkthrough of sparse matrices in R and basic use of the glmnet package. This will show how to create them, find the best probabilities through the glmnet model, and how a sparse matrix deals with categorical values.

Brief Walkthrough Of The dummyVars Function From {caret}

Caret dummyVars Function

The dummyVars function streamlines the creation of dummy variables by quickly hunting down character and factor variables and transforming them into binaries, with or without full rank.

Ensemble Feature Selection On Steroids: {fscaret} Package

fscaret

Give fscaret an ensemble of models and some data, and it will have the ensemble vote on the importance of each feature to find the strongest ones. In this walkthrough, we use R and the classic Titanic data set to predict survivorship.

Mapping The United States Census With {ggmap}

ggmap example

ggmap enables you to easily map data anywhere around the world as long as you give it geographical coordinates. Here we overlay census data over a Google map of the United States.

Using Correlations To Understand Your Data

correlations

A great way to explore new data in R is to use a pairwise correlation matrix. This will pair every combination of your variables and measure the correlation between them.

Brief Guide On Running RStudio Server On Amazon Web Services

RStudio Server on AWS

Steps you through installing pre-configured AMIs with RStudio Server on AWS EC2, interacting with the web interface, and uploading and downloading files to/from your instance.