Predict Stock-Market Behavior using Markov Chains and R

We apply Markov Chains to map and understand stock-market behavior using the R programming language. By using 2 transition matrices instead of one, we are able to weigh the probability of a binary outcome.

August 31, 2016 | Tags : exploring modeling r


Big Data Surveillance: Use EC2, PostgreSQL and Python to Download all Hacker News Data!

We'll first look at the Algolia API and Max Woolf's scripts to download all Hacker News data using EC2 and PostgreSQL, then we'll look at the Firebase/Hacker News API web service to pull specific content by ID.

July 20, 2016 | Tags : exploring python ec2 aws


Check out my Udemy.com classes - coupons!

Check out my in-depth R classes on Udemy.com - Learn more and support the Data Exploration and Machine Learning Walk-Throughs. Click link for specials and discounts. Thanks for your support!

June 20, 2016 | Tags : udemy r quantmod spark aws ec2 exploring modeling


The Peter Norvig Magic Spell Checker in R

Peter Norvig, Director of Research at Google, offers a clever way for any of us to create a good spell checker with nothing more than a few lines of code and some text data.

June 16, 2016 | Tags : modeling r python


Actionable Insights: Getting Variable Importance at the Prediction Level in R

Here is an easy way to get the top and bottom features contributing to a prediction. This affords a level of transparency to the report reader in understanding why the model chose a particular probability for an observation.

May 2, 2016 | Tags : exploring modeling r


Survival Ensembles: Survival Plus Classification for Improved Time-Based Predictions in R

In this post we'll look at extracting AUC scores from survival models, blending and ensembling random forest survival with gradient boosting classification models, and measuring improvements on time-based predictions.

March 19, 2016 | Tags : modeling r


Anomaly Detection: Increasing Classification Accuracy with H2O's Autoencoder and R

Use H2O's anomaly detection with R to separate data into easy and hard to model subsets and gain predictive insight.

January 11, 2016 | Tags : h2o aws ec2 modeling r


H2O & RStudio Server on Amazon Web Services (AWS), the Easy Way!

See how easy it is to install H2O and RStudio Server on Amazon Web Services (AWS) from scratch. No need of customized AMIs or third party tools - no training wheels here! And the best part is that we can do everything from the Amazon Web Service wizard, no need to tunnel or putty anywhere!

December 27, 2015 | Tags : rstudioserver r ec2 aws h2o


Analyze Classic Works of Literature from Around the World with Project Gutenberg and R

Project Gutenberg offers and easy way to download over 50,000 classic works of literature from around the world in digital format using the R language.

December 12, 2015 | Tags : nlp r


Speak Like a Doctor - Use Natural Language Processing to Predict Medical Words in R

Using R, natural language processing (NLP), a medical corpus, and a Shiny application, we build an interactive tool to predict what a doctor will say next.

November 22, 2015 | Tags : nlp r


Supercharge R with Spark: Getting Apache's SparkR Up and Running on Amazon Web Services (AWS)

See how easy it is to set up a few SparkR clusters and to control them from RStudio. In this first installment, we'll set up multiple clusters on Amazon's AWS EC2 and control them via RStudio.

September 30, 2015 | Tags : rstudioserver modeling r ec2 aws


R and Excel: Making Your Data Dumps Pretty with XLConnect

When it comes to exporting data, one has many formats to choose from. But if you're looking for something more sophisticated than a comma-delimited file but aren't ready for an off-the-shelf content-management system, then Excel may be what you need in presenting content in a more digestible format.

July 7, 2015 | Tags : exploring visualizing r


Going from an Idea to a Pitch: Hosting your Python Application using Flask and Amazon Web Services (AWS)

This walk-through is about demonstrating how easy it is to transform an idea into a web application. This is for those who want to quickly pitch their application to the world without getting bogged down by technical details. This is for the weekend warrior. If the application is a success, people with real skills will be brought in to do the job right, in the meantime we want it fast, cheap and easy. We'll use Python, Flask, and EC2 Amazon Web Services to migrate an program into a web application.

June 12, 2015 | Tags : python ec2 aws


Getting PubMed Medical Text with R and Package {RISmed}

PubMed is a great source of medical literature. If you are working on a Natural Language Processing (NLP) project and need 100's or 1000's of topic-based medical text, the RISmed package can simplify and automate that process.

April 17, 2015 | Tags : exploring r


Find Variable Importance for any Model - Prediction Shuffling with R

You model and predict once to get a benchmark score, then predict hundreds of times for each variable while randomizing it each time. If the variable being randomized hurts the model's benchmark score, then its an important variable. If nothing changes then its a useless variable.

March 27, 2015 | Tags : modeling r


Bagging / Bootstrap Aggregation with R

Bagging is the not-so-secret edge of the competitive modeler. By sampling and modeling a training data set hundreds of times and averaging its predictions, you may just get that accuracy boost that puts you above the fray.

March 7, 2015 | Tags : modeling r


Feature Hashing (a.k.a. The Hashing Trick) With R

Feature hashing is a clever way of modeling data sets containing large amounts of factor and character data. It uses less memory and requires little pre-processing. In this walkthrough, we model a large healthcare data set by first using dummy variables and then feature hashing.

February 21, 2015 | Tags : modeling r


Yelp, httr and a Romantic Trip Across the United States, One Florist at a Time

The title says it all, we are going to use Yelp to cross the United States from San Francisco, CA to New York City, NY, and be 60 miles from a florist at all times.

January 14, 2015 | Tags : visualizing exploring r


Quantifying the Spread: Measuring Strength and Direction of Predictors with the Summary Function

Use the Summary() function to quickly and intuitively measure predictors. By splitting the data into two sets, one for each outcome, and summarizing them individually, we can plot and measure behaviors towards the outcome variable. Simple, easy, and fast!

December 27, 2014 | Tags : modeling r


Downloading Data from Google Trends And Analyzing It With R

In this walkthrough, I introduce Google Trends by querying it directly through the web, downloading a comma-delimited file of the results, and analyzing it in R.

December 17, 2014 | Tags : exploring visualizing r


Using String Distance {stringdist} To Handle Large Text Factors, Cluster Them Into Supersets

{stringdist} can help make sense of large, text-based factor variables by clustering them into supersets. This approach preserves some of the content's substance without having to resort to full-on, natural language processing.

November 30, 2014 | Tags : exploring visualizing r


SMOTE - Supersampling Rare Events in R

Brief introduction of the SMOTE package and over-sampling imbalanced data sets. SMOTE uses bootstrapping and k-nearest neighbor to synthetically create additional observations.

November 13, 2014 | Tags : exploring modeling r


Let's Get Rich! See how {quantmod} And R Can Enrich Your Knowledge Of The Financial Markets!

See how easy it is display great looking current stock charts in 2 lines of code and then stock market data and use it all to build a complex market model.

November 10, 2014 | Tags : exploring modeling visualizing quantmod r


How To Work With Files Too Large For A Computer’s RAM? Using R To Process Large Data In Chunks

Using the function read.table(), we break file into chunks in order to process them. This allows processing files of any size beyond what the machine's RAM can handle.

November 4, 2014 | Tags : exploring r


Predicting Multiple Discrete Values with Multinomials, Neural Networks and the {nnet} Package

Using R and the multinom function from the nnet package, we can easily predict discrete values (factors) of more than 2 levels. We also use Repeated Cross Validation to get an accurate model score and to understand the importance of allowing the model to converge (reaching global minima).

November 1, 2014 | Tags : modeling r


Modeling 101 - Predicting Binary Outcomes with R, gbm, glmnet, and {caret}

This walkthrough shows how to easily model binary outcomes using caret models, how to evaluate the predictions, and how to display variable importance.

October 22, 2014 | Tags : visualizing modeling r


Modeling Ensembles with R and {caret}

If you can model, then you can model ensembles! It’s literally as simple as running multiple R models on the same data, collecting the predictions, and blending them using a final model. And, if all goes well, you should enjoy a bump in AUC score!

October 18, 2014 | Tags : modeling r


Reducing High Dimensional Data with Principle Component Analysis (PCA) and prcomp

In this R walkthrough, we'll see how PCA can reduce a 1000+ variable data set into 10 variables and barely lose accuracy! This is incredible, and everytime I play around with this, I still get amazed!

October 13, 2014 | Tags : modeling r


The Sparse Matrix and {glmnet}

Walkthrough of sparse matrices in R and basic use of the glmnet package. This will show how to create them, find the best probabilities through the glmnet model, and how a sparse matrix deals with categorical values.

October 8, 2014 | Tags : modeling r


Brief Walkthrough Of The dummyVars Function From {caret}

The dummyVars function streamlines the creation of dummy variables by quickly hunting down character and factor variables and transforming them into binaries, with or without full rank.

October 2, 2014 | Tags : exploring r


Ensemble Feature Selection On Steroids: {fscaret} Package

Give fscaret an ensemble of models and some data, and it will have the ensemble vote on the importance of each feature to find the strongest ones. In this walkthrough, we use R and the classic Titanic data set to predict survivorship.

October 1, 2014 | Tags : modeling r


Mapping The United States Census With {ggmap}

ggmap enables you to easily map data anywhere around the world as long as you give it geographical coordinates. Here we overlay census data over a Google map of the United States.

September 29, 2014 | Tags : visualizing r


Using Correlations To Understand Your Data

A great way to explore new data in R is to use a pairwise correlation matrix. This will pair every combination of your variables and measure the correlation between them.

September 27, 2014 | Tags : exploring visualizing r


Brief Guide On Running RStudio Server On Amazon Web Services

Steps you through installing pre-configured AMIs with RStudio Server on AWS EC2, interacting with the web interface, and uploading and downloading files to/from your instance.

May 15, 2014 | Tags : rstudioserver r ec2 spark