DESCEND

Quick Prototypes with client-side web scraping

Setting the line-in-the-sand of a minimum viable product can be an art. If you’re building apps that require external data, there’s almost no end to the amount of technical foundation you could lay before you even start building an interface. API clients, crawlers, scrapers, databases, queues, caches, even more queues with fail modes and staggered retries…

In an app that’s expected to scale or perform any long-term analysis on stored data, your data harvesting needs to be reliable, idempotent, and repeatable. But for building an MVP, all of this can be overkill. Sometimes all you need is some basic jQuery.

Move quick and leverage the hard work of others

In this example, I’m pulling data from a site that lists upcoming soccer matches. The data is listed in tables sorted by time. We want to grab things like team names, their logos, match times, and URLs so we can link back to the original page.

To start we want to create a Match object. This will hold individual match data and necessary functions for fetching and displaying information.

var Match = function (opts) {
  // setup initial properties
  this.url = opts.url || '';
  this.homeName = '';
  this.awayName = '';
  this.matchTime = Date.now();

  // kick off any functions you want called on object creation
  this.initialize();
};

In this case, our initialize function is simple. Go fetch extra information and then render it as a row in a table.

Match.prototype.initialize = function () {
  // get the thing
  this.fetch();

  // display the thing
  this.render();
}};

Our data source has a sub-page for each row in the table where extra information is kept. So our Match object will have it’s own fetch method to grab the sub-page and parse all the relevant metadata.

Match.prototype.fetch = function () {
  var that = this;

  $.ajax({
    url: this.url,
    success: function (data) {
      that.html = data;

      // parse the info you are looking for. we'll grab team logos here
      this.homeImage = $(this.html).find('.homeLogo img').attr('src');
      this.awayImage = $(this.html).find('.awayLogo img').attr('src');

      // ... and on, and on for other data you need
    };
  });
};

Instead of creating a whole separate view layer to handle display logic, we can just tack on render function to get everything displayed quickly.

Match.prototype.render = function () {
  // initialize the container object if it's the first render
  if (typeof this.el === 'undefined') {
    this.el = $('<tr></tr>').appendTo('#matches tbody');
  }

  // display whatever relevant match information
  this.el.html('<td>' + this.homeTeam + '</td><td>' + this.awayTeam + '</td>');
};

Now we just need to scrape the initial page, create new match objects, and render them to the page.

var parseMatchLinks = function (html) {
  // loop through each match row
  $(html).find('.match').each(function (i, el) {
    // parse relevant match information
    this.homeTeam = $(el).find('.homeName').text().trim();
    this.awayTeam = $(el).find('.awayName').text().trim();
    this.matchDateTime = $(this.html).find('.matchTime').text().trim();

    // create our match object. from our code above, the Match object will parse it's sub-page and display itself when it's ready
    var match = new Match({ url: url, title: title, otherData: otherData });
  });
};

$(document).ready(function () {
  // fetch then parse
  $.ajax({ url: 'http://target_site_url' }).then(parseMatchLinks);
});

Done! Right? Not quite.

Proxying cross-origin requests

So far, you could deploy this whole thing on a static server. The issue with prototyping this sort of thing in browser is that most sites won’t be fetchable due to security restrictions in browsers. Without cross-origin requests (CORS) enabled on the server, you can’t use $.ajax for content outside of the current domain. I.e., if your site is hosted at http://domain.com, you can’t grab data from http://anotherdomain.com.

To get around this, we’ll just create a proxy server that passes CORS headers and routes requests to the site we’re scraping. I’m going to go with nodejs for this example, but this is simple enough that it shouldn’t take much in any language.

This uses express to handle requests and Cacheman to cache responses. Setting up a cache will allow us to serve the data quick while also being good consumers (or at least, not the worst) of others’ data.

var express = require('express');
var request = require('request');
var Cacheman = require('cacheman');
var apiServerHost = 'http://targetdomain.com';

var app = express();
var cache = new Cacheman();

app.disable('quiet');

// respond to CORS headers from our client app
app.use(function(req, res, next) {
  res.header("Access-Control-Allow-Origin", "*");
  res.header("Access-Control-Allow-Headers", "Origin, X-Requested-With, Content-Type, Accept");
  next();
});

app.use('/', function(req, res) {
  // transform the url so that all request go to
  // the other site. i.e., if our client hits this
  // proxy server at http://localhost:4000/matches/match_id,
  // it will return html from http://targeturl.com/matches/match_id
  var url = apiServerHost + req.url;

  // check the cache for the content
  cache.get(url, function (error, content) {
    if (error) { throw error; }
    else if (typeof content === 'undefined') {
      // if the content isn't in the cache, request it from the target site.
      content = [];

      request(url, function (error, response, body) {
        content = { html: body };

        // add the content to the cache using the url as
        // the cache key and an expiration of 5 minutes.
        cache.set(url, content, 600)
      });
    }

    // send the content to the client
    res.send(content.html)
  });
});

app.listen(4000);

Now to route our requests through the proxy, we just need to change the url used inside of our $(document).ready() ajax call to localhost:4000.

Blam-o. Done. Well, basically. It would also help to style this thing up a bit and actually have an index.html. And you might also want to build some kind of server or integration to pass your fresh, crisp data off to. But that’s for another post.