Data Exploration & Machine Learning, Hands-on


"

Recommended free walkthrough, check it out and boost your career:


Boost your data science career

"







Yelp, httr and a Romantic Trip Across the United States, One Florist at a Time

Practical walkthroughs on machine learning, data exploration and finding insight.

plot of chunk cross-country-florist

Resources


Packages Used in this Walkthrough



Just in time for Valentine’s Day, if you happen to be planning a trip across the United States and want to offer your companion a rose at every degree of latitude traveled, then this walkthrough is for you! The title says it all, we’re going to use the Yelp API to cross the United States from San Francisco, CA to New York City, NY, and be 60 miles from a florist at all times.

This walkthrough has two parts, first exploring the basics of Yelp and httr, and finally looking at a mapping script to hop from florist to florist across the country. The itinerary will be plotted on ggmap. If all you need is to pull locations via Yelp, then the first part is all you need.

Part 1: Yelp and httr

We’ll look at two different ways of using the Yelp Search API. For either of them, you will need to sign up for a free Yelp account and request API keys.

The simplest and probably most common way of using the API, is to request a term for a specific city or address. This is what the URL would look like for the term food around San Franciso:

http://api.yelp.com/v2/search?term=food&location=San+Francisco

Unfortunately, its a bit more complicated than simply dropping that in a browser. To authenticate (i.e. be recognized as a registered user), Yelp uses OAuth v1.0a protocol.

This is a cleaver of way authenticating you without exposing your account or password. On a very high level, it is done with a two-step process, first we ask Yelp for a signature key through a GET command, then we use that returned temporary key to retrieve our locations. Check out Yelp’s authentication docs for more details.

This is a lot of work to do manually, but the handy httr package from Hadley Wickham handles a lot of the authentication details automatically. It’s a really cool package as it can ‘shake hands’ with many HTTP API’s (Twitter, Facebook, Linkedin, Google, etc).

After you get your keys from Yelp, swap them for the x’s in the code below:

consumerKey = "xxxx"
consumerSecret = "xxxx"
token = "xxxx"
tokenSecret = "xxxx"



Load up the httr library and call the oauth_app function passing it a name tag (can be anything you want), and your consumer credentials. Next call up the sign_oauth1.0, pass it the myapp object along with your token credentials:

require(httr)
myApp <- oauth_app("YELP", key=consumerKey, secret=consumerSecret)
mySignature <- sign_oauth1.0(myApp, token=token, token_secret=tokenSecret)

Hi there, this is Manuel Amunategui- if you're enjoying the content, find more at ViralML.com





Searching by Locations

Your signature is ready to be used on a Yelp search query:

yelpURL <- paste0("http://api.yelp.com/v2/search/?limit=3&term=food&location=San%20Francisco")
locationData <- GET(yelpURL, mySignature)



That is it! We entered the term food in San Francisco and asked to limit it to 3 result. Let’s use the JSON parser, jsonlite to extract the results the results:

require(jsonlite)
locationDataContent = content(locationData)
locationList=jsonlite::fromJSON(toJSON(locationDataContent))
results <- data.frame(locationList)
results$businesses.name
## [[1]]
## [1] "Ike's Place"
## 
## [[2]]
## [1] "The Italian Homemade Company"
## 
## [[3]]
## [1] "The Codmother Fish and Chips"



results$businesses.name returned 3 businesses, but offers a whole lot more:

names(results)
##  [1] "region.span.latitude_delta"      "region.span.longitude_delta"    
##  [3] "region.center.latitude"          "region.center.longitude"        
##  [5] "total"                           "businesses.is_claimed"          
##  [7] "businesses.rating"               "businesses.mobile_url"          
##  [9] "businesses.rating_img_url"       "businesses.review_count"        
## [11] "businesses.name"                 "businesses.snippet_image_url"   
## [13] "businesses.rating_img_url_small" "businesses.url"                 
## [15] "businesses.menu_date_updated"    "businesses.phone"               
## [17] "businesses.snippet_text"         "businesses.image_url"           
## [19] "businesses.categories"           "businesses.display_phone"       
## [21] "businesses.rating_img_url_large" "businesses.menu_provider"       
## [23] "businesses.id"                   "businesses.is_closed"           
## [25] "businesses.location"



Obviously, the terms food and San Francisco are way too vague, but you get the idea. Next we’ll search by geo-spatial coordinates.

** Searching by Geo-Spatial Coordinates**

This is very close to what we did above except we specify the location down to the latitude and longitude of the desired area. How about 3 bars by Montgomery and Market Streets in San Francisco:

yelpURL <- paste0("http://api.yelp.com/v2/search/?limit=3&ll=37.788022,-122.399797&term=bar")
locationData <- GET(yelpURL, mySignature)
require(jsonlite)
locationDataContent = content(locationData)
locationList=jsonlite::fromJSON(toJSON(locationDataContent))
results <- data.frame(locationList)
results$businesses.name
## [[1]]
## [1] "83 Proof"
## 
## [[2]]
## [1] "Topsy's Fun House"
## 
## [[3]]
## [1] "Rickhouse"



The difference is syntactical, instead of building the URL with location, we do it with ll. We’ll use a third way of pulling locations in part 2 - geo-spatial bounded boxes. That’s it for part 1.

Part 2: Hoping from Florist to Florist Across the United States

Let me tell you, this was a blast to design and build! I’ll go over the pertinent parts but if you want to see this in action yourself, copy the code at the end of the walkthrough and replace the Yelp credentials with yours and you’re good to go!

Though not a full-proof application (matter of fact it can only travel west to east), it will take a departure and destination set of latitudes and longitudes, a search term and attempt to link both points with a path using your search term every 60 miles (approximately 1 geo-coordinate degree). Here, we will go from San Francisco, CA to New York City, NY, florist by florist.

ggmap has a handy function called geocode. You can give it addresses, zip codes, even famous monuments, and it will return latitude and longitude coordinates.

require(ggplot2)
require(ggmap)
startingpoint <- geocode(c("Fishersman's Wharf, San Francisco, CA"))
print(startingpoint)
##      lon   lat
## 1 -122.4 37.81
endingpoint <- geocode(c("Chelsea Piers, NY"))
print(endingpoint)
##      lon   lat
## 1 -74.01 40.75



Now that we know our starting and ending points, let’s start our first point manually then generalize the tasks through functions.

We’ll use ggmap to import a map of the United States from Google (see Mapping The United States Census With {ggmap} for more details on ggmap). ggplot2 accepts a data frame, let’s make one with our start and end points and see the journey that lies ahead. We’ll assign them large sizes to differentiate them from the traveling points:

latitudes <- c(startingpoint$lat, endingpoint$lat)
longitudes <- c(startingpoint$lon, endingpoint$lon)

objdf <- data.frame('latitude'=latitudes,
                    'longitude'=longitudes,
                    'size'=c(10,10),
                    'color'=c('red','red'))

# get a Google map
map<-get_map(location='united states', zoom=4, maptype = "terrain",
             source='google',color='color')

objMap <- ggmap(map) + geom_point(
        aes(x=longitude, y=latitude, size=size, 
            show_guide = TRUE, colour = color), 
        data=objdf, alpha=.8, na.rm = T) 

# call the objMap object to see the plot 
objMap

plot of chunk startendpoint

I will keep the details short on the next few functions. MapIt takes geo-spatial coordinates, a size, and the ggmap object. The function will return the ggmap object and we need to always pass the same ggmap back if we want to update and track the journey’s history.

MapIt <- function(latitude, longitude, size, objggmap) {
        objdf <- data.frame('latitude'=latitude,
                            'longitude'=longitude,
                            'tsize'=size)
        
        # add new point to ggmap
        require("ggplot2")
        require("ggmap")
        objggmap <- objggmap + geom_point(
                aes(x=longitude, y=latitude, size=tsize, 
                    show_guide = TRUE), 
                data=objdf, alpha=.8, na.rm = T)  
        return (objggmap)



GetBestYelpLocation, as its name implies, takes a vector of two sets of geo-spatial coordinates, the upper left point and the lower right one, along with the search term and sends it to Yelp. It returns a data frame with the name, city, rating, state, latitude and longitude of the top location using the search term.

GetBestYelpLocation <- function(boundedcoordinates, term) {
        limit <- 1
        
        YelpUrl <- paste0("http://api.yelp.com/v2/search/?limit=20&bounds=",
                          boundedcoordinates[1],",",boundedcoordinates[2],
                          "|",boundedcoordinates[3],",",boundedcoordinates[4],"&term=",term)
         
        locationdata=GET(YelpUrl, mySignature)
        listMembersContent = content(locationdata)
        listMembers=jsonlite::fromJSON(toJSON(listMembersContent))
        
        yelpResults = tryCatch({
                data.frame(listMembers)
        }, error = function(e) { 
                NULL
        })
        if (!is.null(yelpResults)) {
                set1 <- data.frame("name"=yelpResults$businesses.name,
                   'city'=yelpResults$businesses.location$city,
                   'rating'=yelpResults$businesses.rating, 
                   'latitude'=yelpResults$businesses.location$coordinate$latitude,
                   'longitude'=yelpResults$businesses.location$coordinate$longitude,
                   'state'=yelpResults$businesses.location$state_code)
                 
                for (ind in seq(1:nrow(set1))) {
                        if ((set1$latitude[ind] <= boundedcoordinates[1]) &
                                (set1$latitude[ind] >= boundedcoordinates[3]) &
                                (set1$longitude[ind] >= boundedcoordinates[2]) &
                                (set1$longitude[ind] <= boundedcoordinates[4]))
                                return(set1[ind,])
                }
        }
        return(NULL)
}



GetPossibleCoordinates takes the current position of our traveler and calculates its bounded box. The default bounded box unit is 1 degree (equates loosely around 60 miles), so we add 30 miles in all four directions from the current point to created the bounded square (making it a 1x1 degree box, or 60x60 miles box).

plot of chunk newmove

It returns the next two possible moves, both forward but one higher and one lower (see image below).

GetPossibleCoordinates<-function(lat, lon, area=1) {
        # this is hard to keep track but each location's next move
        # equates attempting to move forward one square, the same area
        # as the previous one but half-way above and half-way below the
        # previous one.
        halfarea <- area/2
    
        # new forward top square area
        topArea_upperleft_latitude <- lat+area
        topArea_upperleft_longitude <- lon+halfarea
        
        topArea_lowerright_latitude <- lat
        topArea_lowerright_longitude <- lon+(halfarea+area)
        
        # new forward bottom square area
        bottomArea_upperleft_latitude <- lat
        bottomArea_upperleft_longitude <- lon+halfarea
        
        bottomArea_lowerright_latitude <- lat-area
        bottomArea_lowerright_longitude <- lon+(halfarea+area)
          
        
        rownames <- c('new_top_area','new_bottom_area')
        latitude_point1 <- c(topArea_upperleft_latitude, bottomArea_upperleft_latitude)
        longitude_point1 <- c(topArea_upperleft_longitude, bottomArea_upperleft_longitude)
        latitude_point2 <- c(topArea_lowerright_latitude, bottomArea_lowerright_latitude)
        longitude_point2 <- c(topArea_lowerright_longitude, bottomArea_lowerright_longitude)
        
        return (data.frame('direction'=rownames,
                'latitude_point1'=latitude_point1,
                           'longitude_point1'=longitude_point1,
                           'latitude_point2'=latitude_point2,
                           'longitude_point2'=longitude_point2))
         
}



Finally, MakeAMove is the central relay that communicates with the other functions. It will get the next coordinates via GetPossibleCoordinates, figure out if it should try up or down first depending on where it is in relation to the end point’s latitude, and call on the Yelp API. If it finds a florist, it will report back the pertinent information such as the florist’s name, state, and coordinates. If it can’t find one, it will report so by returning NULL.

MakeAMove <- function(lat,lon,sizebox, searchTerm, lat_endPoint) {
        possibleCoordinates <- GetPossibleCoordinates(lat,lon,sizebox)
        
        # go up or down first depending on latitude of end point
        searchOrder <- c('new_top_area','new_bottom_area')
        if (lat > lat_endPoint)
                searchOrder <- c('new_bottom_area','new_top_area')
        
        for (directiontogo in searchOrder) {
                coords <- possibleCoordinates[possibleCoordinates$direction==directiontogo,]
                print(paste('Scanning',directiontogo, 'for', searchTerm,'...'))
                foundLocation <- GetBestYelpLocation(as.numeric(as.vector(coords[2:5])), searchTerm)
                if (!is.null(foundLocation))
                        return (foundLocation)
        }
        return (NULL)
}



Let’s finally look at the main logic. Here we set a search term florist, and a bounded square size unit, 1 degree in this case. madeIt is the boolean flag that keeps the while loop going until we make it close to our final destination (or if we fail after 100 tries). Sys.sleep(0.5) will pause the code between each loop, this is critical so you don’t flood the Yelp server and get yourself banned.

searchTerm <- 'florist'
squareSize <- 1 # setting the bounded area to a square 

# start trip info vectors - we need to remember where we've been!
currentLatitude <- startingpoint$lat
currentLongitude <- startingpoint$lon

# let ggmap keep track of where we've been
objMap <- MapIt(currentLatitude, currentLongitude, 1, objMap)

madeIt=FALSE
safetyCount <- 0
foundCount <- 0
while(madeIt == FALSE) {
        safetyCount <- safetyCount + 1
        foundLocation <- MakeAMove(lat=currentLatitude,
                                   lon=currentLongitude,
                                   sizebox=squareSize,
                                   searchTerm=searchTerm,
                                   endingpoint$lat)
        
        if (!is.null(foundLocation)) {
                print (paste('Our new',searchTerm, 'is', foundLocation$name, 'in', 
                             foundLocation$city, foundLocation$state,
                             'with a', foundLocation$rating, 'start rating'))
                currentLatitude <- foundLocation$latitude
                currentLongitude <- foundLocation$longitude
                
                # reset temporary setting
                squareSize <- 1
                
                # let ggmap keep track of where we've been
                objMap <- MapIt(currentLatitude, currentLongitude, squareSize+3, objMap)
                
                # let's keep track how our successes!
                foundCount <- foundCount + 1
        } else {
                # increase squareSize
                print(paste("Can't find any", searchTerm,", enlarging square search area to",squareSize + 1))
                
                # temporary settings to get us out of desert
                squareSize <- squareSize + 1
        }  
        
        # have we arrived at our end point
        if ((currentLongitude < (endingpoint$lon + squareSize)) 
                & (currentLongitude > (endingpoint$lon - squareSize)))
        {
                print(paste('We made it!! It took',foundCount,'hops...'))
                break
        }
        if (safetyCount > 100) 
        {
                print(paste('Giving up!! Failed after',foundCount,'hops'))
                break
        }
                  
        # be considerate with your Yelp requests
        Sys.sleep(0.5)
}



This is the output you will see when you run the code. It will log each point traveled and print the name of the florist. In case it cannot find one in either the upper or lower forward areas, the system will increase the square size by a degree (making it a 2x2 box instead of a 1x1 box) until it can find a florist. This is why in the final map below you will see gaps in deserted or mountainous zones.

## [1] "Scanning new_top_area for florist ..."
## [1]   40.46699 -118.28123   39.46699 -117.28123
## [1] "Scanning new_bottom_area for florist ..."
## [1]   39.46699 -118.28123   38.46699 -117.28123
## [1] "Can't find any florist , enlarging square search area to 2"
## [1] "Scanning new_top_area for florist ..."
## [1]   41.46699 -117.78123   39.46699 -115.78123
## [1] "Our new florist is Second Street Seasonals in Winnemucca NV with a 4 start rating"

...

## [1] "Our new florist is Achin' Back Garden Center in Pottstown PA with a 4.5 start rating"
## [1] "Scanning new_top_area for florist ..."
## [1]  41.24341 -75.06065  40.24341 -74.06065
## [1] "Our new florist is Monday Morning Flower and Balloon Co. in Princeton NJ with a 4.5 start rating"
## [1] "We made it!! It took 41 hops..."



plot of chunk crosscountryplot

Conclusion

So, I didn’t quite deliver on my promise of being within 60 miles of a florist at all times, it may or may not be possible (and we certainly won’t find out with the state of my current code). Yet, we managed to be within those parameters during most of the journey. You can try with with different terms, different locations, even upgrade the algorithm to do cooler things - just don’t abuse the API as it may end the party for all of us.

A special thanks to Nathan Yau; this walkthrough was loosely based but strongly inspired by his In search of food deserts.




Full source code (also on GitHub):


######################## functions ################################
MapIt <- function(latitude, longitude, size, objggmap) {
        objdf <- data.frame('latitude'=latitude,
                            'longitude'=longitude,
                            'tsize'=size)
        
        # add new point to ggmap
        require("ggplot2")
        require("ggmap")
        objggmap <- objggmap + geom_point(
                aes(x=longitude, y=latitude, size=tsize, 
                    show_guide = TRUE), 
                data=objdf, alpha=.8, na.rm = T)  
        return (objggmap)
}

# yelp rest API call
GetBestYelpLocation <- function(boundedcoordinates, term) {
        limit <- 1
        print(boundedcoordinates)
        # Pick first florist in San Francisco, CA
        #YelpUrl <- paste0("http://api.yelp.com/v2/search/?limit=",limit,
        #                  "&location=San%20Francisco%20CA&term=florist")
        # or 10 bars by geo-coordinates
        #YelpUrl <- paste0("http://api.yelp.com/v2/search/?limit=",
        #       limit,"&ll=",latitude,",",longitude,"&term=florist")
        # or by bounded geo-spatial coordinates
        YelpUrl <- paste0("http://api.yelp.com/v2/search/?limit=20&bounds=",
                          boundedcoordinates[1],",",boundedcoordinates[2],
                          "|",boundedcoordinates[3],",",boundedcoordinates[4],"&term=",term)
         
        locationdata=GET(YelpUrl, mySignature)
        listMembersContent = content(locationdata)
        listMembers=jsonlite::fromJSON(toJSON(listMembersContent))
        
        yelpResults = tryCatch({
                data.frame(listMembers)
        }, error = function(e) { 
                NULL
        })
        if (!is.null(yelpResults)) {
                set1 <- data.frame("name"=yelpResults$businesses.name,'city'=yelpResults$businesses.location$city,
                                   'rating'=yelpResults$businesses.rating, 
                                   'latitude'=yelpResults$businesses.location$coordinate$latitude,
                                   'longitude'=yelpResults$businesses.location$coordinate$longitude,
                                   'state'=yelpResults$businesses.location$state_code)
                 
                for (ind in seq(1:nrow(set1))) {
                        if ((set1$latitude[ind] <= boundedcoordinates[1]) &
                                (set1$latitude[ind] >= boundedcoordinates[3]) &
                                (set1$longitude[ind] >= boundedcoordinates[2]) &
                                (set1$longitude[ind] <= boundedcoordinates[4]))
                                return(set1[ind,])
                }
        }
        return(NULL)
}
 
# Movement controls
GetPossibleCoordinates<-function(lat, lon, area=1) {
        # this is hard to keep track but each location's next move
        # equates attempting to move forward one square, the same area
        # as the previous one but half-way above and half-way below the
        # previous one.
        halfarea <- area/2
    
        # new forward top square area
        topArea_upperleft_latitude <- lat+area
        topArea_upperleft_longitude <- lon+halfarea
        
        topArea_lowerright_latitude <- lat
        topArea_lowerright_longitude <- lon+(halfarea+area)
        
        # new forward bottom square area
        bottomArea_upperleft_latitude <- lat
        bottomArea_upperleft_longitude <- lon+halfarea
        
        bottomArea_lowerright_latitude <- lat-area
        bottomArea_lowerright_longitude <- lon+(halfarea+area)
          
        
        rownames <- c('new_top_area','new_bottom_area')
        latitude_point1 <- c(topArea_upperleft_latitude, bottomArea_upperleft_latitude)
        longitude_point1 <- c(topArea_upperleft_longitude, bottomArea_upperleft_longitude)
        latitude_point2 <- c(topArea_lowerright_latitude, bottomArea_lowerright_latitude)
        longitude_point2 <- c(topArea_lowerright_longitude, bottomArea_lowerright_longitude)
        
        return (data.frame('direction'=rownames,
                'latitude_point1'=latitude_point1,
                           'longitude_point1'=longitude_point1,
                           'latitude_point2'=latitude_point2,
                           'longitude_point2'=longitude_point2))
         
}
 
MakeAMove <- function(lat,lon,sizebox, searchTerm, lat_endPoint) {
        possibleCoordinates <- GetPossibleCoordinates(lat,lon,sizebox)
        
        # go up or down first depending on latitude of end point
        searchOrder <- c('new_top_area','new_bottom_area')
        if (lat > lat_endPoint)
                searchOrder <- c('new_bottom_area','new_top_area')
        
        for (directiontogo in searchOrder) {
                coords <- possibleCoordinates[possibleCoordinates$direction==directiontogo,]
                print(paste('Scanning',directiontogo, 'for', searchTerm,'...'))
                foundLocation <- GetBestYelpLocation(as.numeric(as.vector(coords[2:5])), searchTerm)
                if (!is.null(foundLocation))
                        return (foundLocation)
        }
        return (NULL)
}

############################## YELP SETUP ###############################
# yelp credentials
consumerKey = "xxxxx"
consumerSecret = "xxxxx"
token = "xxxxx"
tokenSecret = "xxxxx"

require(httr)
require(jsonlite)

# authorization
myapp = oauth_app("Yelp", key=consumerKey, secret=consumerSecret)
mySignature=sign_oauth1.0(myapp, token=token,token_secret=tokenSecret)

############################### Manual Code ###############################
require(ggplot2)
require(ggmap)

# Get our starting and ending points
# Fisherman's Wharf, San Francisco, CA 94109
startingpoint <- geocode(c("Fishersman's Wharf, San Francisco, CA"))
# Chelsea Piers 10011
endingpoint <- geocode(c("Chelsea Piers, NY"))

# ggplots accepts a data frame, let's make with with start and end points
latitudes <- c(startingpoint$lat, endingpoint$lat)
longitudes <- c(startingpoint$lon, endingpoint$lon)
# importance holds the size and color (we'll leave it up to ggplot to choose)
objdf <- data.frame('latitude'=latitudes,
                    'longitude'=longitudes,
                    'importance'=c(10,10))

# sample ggplot/ggmap with start and end locations
# get a Google map
require(ggmap)
map<-get_map(location='united states', zoom=4, maptype = "terrain",
             source='google',color='color')

# plot it with ggplot2
require("ggplot2")
objMap <- NULL
objMap <- ggmap(map) + geom_point(
        aes(x=longitude, y=latitude, size=importance, 
            show_guide = TRUE, colour = importance), 
        data=objdf, alpha=.8, na.rm = T)  
 

################################### Logic #####################################

searchTerm <- 'florist'
squareSize <- 1 # setting the bounded area to a square 

# start trip info vectors - we need to remember where we've been!
currentLatitude <- startingpoint$lat
currentLongitude <- startingpoint$lon

# let ggmap keep track of where we've been
objMap <- MapIt(currentLatitude, currentLongitude, 1, objMap)

madeIt=FALSE
safetyCount <- 0
foundCount <- 0
while(madeIt == FALSE) {
        safetyCount <- safetyCount + 1
        foundLocation <- MakeAMove(lat=currentLatitude,
                                   lon=currentLongitude,
                                   sizebox=squareSize,
                                   searchTerm=searchTerm,
                                   endingpoint$lat)
        
        if (!is.null(foundLocation)) {
                print (paste('Our new',searchTerm, 'is', foundLocation$name, 'in', 
                             foundLocation$city, foundLocation$state,
                             'with a', foundLocation$rating, 'start rating'))
                currentLatitude <- foundLocation$latitude
                currentLongitude <- foundLocation$longitude
                
                # reset temporary setting
                squareSize <- 1
                
                # let ggmap keep track of where we've been
                objMap <- MapIt(currentLatitude, currentLongitude, squareSize+3, objMap)
                
                # let's keep track how our successes!
                foundCount <- foundCount + 1
        } else {
                # increase squareSize
                print(paste("Can't find any", searchTerm,", enlarging square search area to",squareSize + 1))
                
                # temporary settings to get us out of desert
                squareSize <- squareSize + 1
        }  
        
        # have we arrived at our end point
        if ((currentLongitude < (endingpoint$lon + squareSize)) 
                & (currentLongitude > (endingpoint$lon - squareSize)))
        {
                print(paste('We made it!! It took',foundCount,'hops...'))
                break
        }
        if (safetyCount > 100) 
        {
                print(paste('Giving up!! Failed after',foundCount,'hops'))
                break
        }
                  
        # be considerate with your Yelp requests
        Sys.sleep(0.5)
}
 
objMap

    


Manuel Amunategui - Follow me on Twitter: @amunategui