Accidental Dive Into Web Scraping

Hello Arby

I work at RBMTechnologies, a Ruby on Rails shop, whose flagship product allows large companies to execute merchandising campaigns.

Anyways, we had recently adopted Campfire and I got extremely excited. I had read about how GitHub created Hubot and wanted something similar at RBM.

Since we were a Ruby and Rails shop, I figure I would create something similar in Ruby. If it was engineered correctly, it could be something fun that my fellow engineers and I could tinker with.

With a few hours work, I had something working. The bot, affectionately named Arby M. (get it?), would listen in on our Campfire chatrooms and execute incoming messages that matched the format Arby <some_command>.

I'll write a blog post about Arby M. once it's open sourced on GitHub. Right now, it's on our private GitHub Enterprise server.

It's a Simple Command... Right?

A day or so after finishing the core Arby code and sample commands, like 'leave' and 'help', I decided to add something a little more ambitious and fun.

Let me preface by saying that every once in-a-while, RBM's engineering team would go out for beers at some local bar. I thought it would be cool if we could have Arby suggest a random one for us. Also, it's a simple command that shouldn't take
too long to write up so I'll bang it out when I get and then work on my side project.

This simple command ended up consuming the rest of my evening.

The Price and Gains of Hubris

After getting home, I planted myself into my favorite and only arm chair; opened up my laptop; and began hacking at this simple command.

First, I needed a way to get a list of local bars. I had contemplated using the Google Search gem but opted instead for Yelp. I chose Yelp because it's meant for displaying local businesses and I wouldn't have to crawl through Google search results to make sure I had indeed gotten a result for a bar that serves beer.

At this point I want to make a confession:

All the troubles and learnings that I am about to detail could have been avoided had I just made a Yelp account and used their API.

Now that we got that out of the way, lets begin.

Attempt One

I wasn't going to over engineer the command. I just wanted a list of bars near RBM. From this list, I would randomly select one. I had chosen Yelp, so the first thing I did was go to Yelp and search bars near Kendall Square. Now that, I had the URL, the first step is to get the page's HTML.

Curb

I know that there is a lot of way to skin this particular cat, but for me, I opted for Curb after some digging through The Ruby Toolbox. Curb is basically a Ruby wrapper for Curl.

I threw gem 'curb' into my Gemfile and ran $ bundle install.

The code to get the page was dead simple.

require 'curb'

... # Arby stuff...

yelp_url = 'http://www.yelp.com/search?find_desc=bars&fin...'  
http     = Curl.get(yelp_url)  
html     = http.body_str

... # Arby stuff...

Nokogiri

Now, I needed to get the URLs to the various bars listed on that page. I had used Nokogiri before at work so I decided to use it here. I added gem 'nokogiri' to my Gemfile, ran $ bundle install, and was on my way.

require 'curb'  
require 'nokogiri'

... # Arby stuff...

yelp_url = 'http://www.yelp.com/search?find_desc=bars&fin...'  
http     = Curl.get(yelp_url)  
html     = Nokogiri::HTML(http.body_str)

... # Arby stuff...

Chrome Inspector

I had the page's HTML loaded into a Nokogiri::HTML::Document and now just needed to find the specific tags containing the URLs to the various bars' Yelp profile.

I opened up Chrome inspector and discovered that the main link to each bar's profile page had the CSS class biz-name, which was in a span tag with the class indexed-biz-name, which was in a div with the class biz-listing-large.

I figured this was specific enough that Nokogiri won't find anything extra. I used Nokogiri::HTML::Document#css to find the tags via CSS selectors. Starting from the outside in, I got 'div.biz-listing-large span.indexed-biz-name a.biz-name'.

require 'curb'  
require 'nokogiri'

... # Arby stuff...

yelp_url = 'http://www.yelp.com/search?find_desc=bars&fin...'  
http     = Curl.get(yelp_url)  
html     = Nokogiri::HTML(http.body_str)

a_tags = html.css('div.biz-listing-large span.indexed-biz-name a.biz-name')

... # Arby stuff...

Deceptive Success

From Nokogiri::HTML::Document#css, I got an array of anchor tags. The tags had the instance method #attributes which returned a hash of HTML attributes, whose key was the attribute name as a symbol. The :href key gave me a string that contained URLs I needed. To be more precise, the strings were paths relative to http://www.yelp.com, like /biz/belly-wine-bar-cambridge.

With that in mind, I collected the :href values, prefixed them with Yelp's root URL, and collected them into an array of strings. From that array I chose a random index, which would give me a random bar URL that I could post to Campfire.

require 'curb'  
require 'nokogiri'

... # Arby stuff...

yelp_url = 'http://www.yelp.com/search?find_desc=bars&fin...'  
http     = Curl.get(yelp_url)  
html     = Nokogiri::HTML(http.body_str)

a_tags = html.css('div.biz-listing-large span.indexed-biz-name a.biz-name')

yelp_root_url = 'http://www.yelp.com'

urls = a_tags.collect do |tag|  
  yelp_root_url + tag.attributes[:href].value
end

random_generator = Random.new  
index            = random_generator(urls.length)

random_bar_url = urls[index]  
# YAY NOW SEND IT TO CAMPFIRE!!!!

... # Arby stuff...

The above code works great. I ran some tests, things looked kosher and I was ready to consider this command done, when I realized something. The Yelp listings were paginated. In other words, there were only 10 bar listings per page. Subsequently, my array of URLs only contained 10 unique bars.

And So It Begins

No problem. I thought to myself: "Yelp probably uses some logical URL format for paginations."

I decided to click on the second page see what changed in the URL.

http://www.yelp.com/search?fin...#start=10

I was right. Notice that the main search URL stayed the same. The only difference was the #start=10 at the end. Just to sate my curiosity, I also tested the url with #start=0 and found that it would give me the first page of the listings.

Awesome.

The solution was simple: update the URL to include #start=<some_multiple_of_10> at the end, then loop and collect every set of 10 URLs until I had an amount I deemed was sufficient. For me, a 100 potential bars was enough.

require 'curb'  
require 'nokogiri'

... # Arby stuff...

urls = []


yelp_root_url = 'http://www.yelp.com'  
yelp_url      = 'http://www.yelp.com/search?find_desc=bars&fin...'  
paged_count   = 0

while page_count < 100 do  
  paged_url = yelp_url + paged_count.to_s

  http = Curl.get(paged_url)
  html = Nokogiri::HTML(http.body_str)

  a_tags = html.css('div.biz-listing-large span.indexed-biz-name a.biz-name')

  urls += a_tags.collect do |tag|
    yelp_root_url + tag.attributes[:href].value
  end

  paged_count += 10
end

random_generator = Random.new  
index            = random_generator(urls.length)

random_bar_url = urls[index]  
# YAY NOW SEND IT TO CAMPFIRE!!!!

... # Arby stuff...

I ran the code in my console and the output was reassuring. I had gotten a bar suggestion that I knew was valid. I ran the code a few more times because I was proud of my handiwork and a tad paranoid that things were working out so well. After a couple years of programming, I have developed a paranoia when things
work on the first try.

Aye, There's The Rub

After a few runs of the code, I noticed something strange. I kept getting the same few bar suggestions. My first thought was that I needed to provide a new seed for every call to rand.

That isn't possible though, I told myself. I was getting different bar URLs across subsequent runs. Doing what most programmers would do, I shoved a print statement into my code to see what URLs were in my array.

My array of 100 URLs was just my first 10 URLs repeated 10 times.

Duh! I had forgotten to increment my page count... Wait, no! If that was the case, I would have hit an infinite loop in my code. Oh, maybe I incremented weirdly and Yelp only works on intervals of 10.

Nope.

Everything looks reasonable. My code ran and I was incrementing from 0 to 100 by steps of 10.

The Devil in the Details

After an hour or so of annoyed debugging, Googling, and playing with the Yelp page; I noticed something. Everytime I clicked on the link to the next page, the previous Yelp page did not disappear. Instead a Yelp loading overlay appeared. When it disappeared, the listings had been updated.

That's when I realized the Yelp was using Javascript to load the new content. I immediately went to googled for something only a newbie would.

'Curl after javascript runs'...

I'll let that sink it for a second.

Javascript needs to be run on a Javascript engine. Basically, when your browser goes to a website's URL, it gets HTML, CSS, and Javascript back as part of an HTTP response. Your browser then renders the HTML and CSS. Then it takes the Javascript and runs it on the Javascript engine, which is kind of like $ ruby some_ruby_script.rb in the CLI.

What Curl and Curb does is grab the files from a given URL.

It took me a second to realize this. Once I did, I realized that I needed a web driver.

Stubborness

At this point, you are probably screaming something along the lines of:

"You DAMN IDIOT, just make a Yelp account and use their DAMN API!".

I thought the same thing too, until I looked at the clock realized that I spent two hours on this "simple" command. Sometimes, I can be a very stubborn man. This was one of those times. I said "screw it, we are doing this command without Yelp's API!".

Attempt Two

A quick bit of Googling and I came across Watir. It's pronounced "water" by the way.

I had heard the gem mentioned briefly at BostonRB and decided to give it a shot. I added gem 'watir-webdriver' into my Gemfile and ran $ bundle install.

Be Like Watir

The 'Getting Going' section of Watir WebDriver gave me a good example of how to get things working.

  1. Create a Watir::Browser instance.
  2. Use said browser instance to navigate to my needed URL via the #goto instance method.
  3. Do whatever I need to do on that page.
  4. Close the browser, with the #close method, once you are done.

The last step isn't strictly necessary, nor was it in the example, but I have developed a habit of closing whatever I open so it didn't feel right to not include that step.

Selecting Links in Watir

I needed to collect of the bar URLs from the page in Watir. Luckily, Watir had some examples for that too. From my previous attempt, I knew that the links I cared about had the class name biz-name so I just needed to provide that to Watir::Browser#links method.

I tested this method out in IRB and was pleased to discover that the method returns a Watir::AnchorCollection. It had instance methods like #each and #collect. Furthermore, each element in this collection was a Watir::Anchor object, which had a #href instance method. I tried calling that method on the first anchor in the collection and found that it returned the complete URL, prefixed with http://www.yelp.com to the bar's profile.

Awesome.

With the information I needed, I updated my Arby command to use Watir.

require 'json' # You may need this if Ruby complains about JSON.  
require 'watir'

... # Arby stuff...

urls = []

yelp_url = 'http://www.yelp.com/search?find_desc=bars&fin...'  
paged_count   = 0

browser  = Watir::Browser.new

while paged_count < 100 do  
  paged_url = "#{yelp_url}#{paged_count}"
  browser.goto paged_url

  a_tags = browser.links :class => 'biz-name'
  urls   += a_tags.collect do |tag|
    tag.href
  end

  paged_count += 10
end

browser.close

random_generator = Random.new  
index            = random_generator(urls.length)

random_bar_url = urls[index]  
### YAY NOW SEND IT TO CAMPFIRE!!!!
... # Arby stuff...

Magic

This was new code to me, so I decided to test the Watir parts via IRB.

The first time I ran my Watir code, a huge smile appeared on my face. It didn't occur to me that Watir, by default, was not headless. A Firefox browser popped up on my screen with the Yelp page. It was awesome. I had never done something like this before and I was reminded of a feeling I got when I first started programming. I felt like I had just unlocked magical powers that enabled me to be the true master of my computer. I know it's a bit hyperbolic, but the feeling was exhilarating.

I stared at this new browser in amazement. I watched as it jumped from page to page until it stopped. I switched back to IRB and saw that my while loop had completed.

I had perfomed my first ever web scrape.

Waiting

I quickly checked my results. urls.length gave me 100. Things were looking hopefuly. I ran urls.uniq.length and got 10. Damnit, what happened? I saw the browser jump through the various pages.

I ran the code again in disbelief and watched the browser again. I saw Yelp's loading spinner appear and it dawned on me: Yelp was waiting for the AJAX request to complete before it could update the listing of bars. I needed to wait for Yelp to finish updating it's page's data before collecting the links.

I went back to Watir's page and discovered that this was a common issue. Watir had an API for waiting.

Success

After few trial and errors with the various waiting commands; I had cobbled together a solution.

require 'json' # You may need this is Ruby complains about JSON.  
require 'watir'

... # Arby stuff...

urls = []

yelp_url = 'http://www.yelp.com/search?find_desc=bars&fin...'  
paged_count   = 0

browser  = Watir::Browser.new

while paged_count < 100 do  
  paged_url = "#{yelp_bar_url}#{page_count}"
  browser.goto paged_url

  loading_div = browser.div :class => 'yelp-spinner'
  Watir::Wait.until do
    loading_div.display.nil?
  end

  span_tag = browser.span :class => 'indexed-biz-name'
  span_tag.wait_until_present

  a_tags = browser.links :class => 'biz-name'
  urls   += a_tags.collect do |tag|
    tag.href
  end

  paged_count += 10
end

browser.close

random_generator = Random.new  
index            = random_generator(urls.length)

random_bar_url = urls[index]  
# YAY NOW SEND IT TO CAMPFIRE!!!!

... # Arby stuff...

Something really important, that I want to point out is the issue of stale DOM elements. If the ordering of your waits isn't correct, you will end up trying to access DOM elements that don't exist yet. This is why my solution had two different waits. First, I waited for the yelp spinner div to "disappear". This meant that the AJAX request was complete. Second, I waited for the span element wrapping my links to appear. This meant that the Javascript code updating my links had completed execution. Then and only then could I safely grab the links without my code breaking randomly over stale DOM exceptions.

The Upside of My Adventure

The upside is of course the learning.

I went the inefficient route to getting what I needed. I'll probably rewrite this command to use Yelp's API once I get around to making a Yelp account. Because of this journey though, I learned a bit about web drivers and about the gaps in my knowledge.

I also had a lot of fun.