From web scraping to ETL

Without going into the legality of web scraping, it’s something that most of the netizens have probably wished at some point in time i.e. being able to extract all the relevant data from a website for easier access. I have also tipped my toes into it in the past e.g. I remember downloading some websites for offline viewing; but a few days ago, I wanted to do more (or technically, less) than downloading a website in its entirety, I wanted to aggregate data from a couple of websites.

So, the goal was to have specific data from those websites accessible in one place. The data was available across 100+ pages, so, manually saving each page was not an option. I realized that I would essentially have to use an ETL to achieve my goal:

Extract: get all the relevant data from those websites.
Transform: aggregate the data.
Load: make the data available in an easy-to-access form.

Based on my limited (say zero) research on existing solutions, I decided to make my own ETL. I could see two ways to get the relevant data:

Fetch desired data via API endpoints, where API endpoints could be either of the following:
- a publicly available API providing the same data as shown on the website.
- a frontend API providing the data that is used by the web UI.
Download pages and extract desired data from them, where the downloading could be done using any of the following:
- a web crawler e.g. HTTrack.
- a headless browser e.g. PhantomJS.
- a custom script e.g. code that requests all the pages.

The first option was unfortunately not possible, as there were no publicly available APIs and both of those websites used a web UI framework that rendered pages on the server-side.

So, I had to go with the second option, where I decided to write my own script to download the pages instead of using a web crawler or a headless browser. Since, I would need to – somehow – load the downloaded pages into a DOM structure to be able to accurately extract all the desired data, it made sense to write a script that did both i.e. download the pages and load them in a DOM structure. I wrote the script in JS/Node.js (for obvious reasons i.e. familiar interaction with DOM elements):

import fetch from 'node-fetch'
import jsdom from 'jsdom'

const baseURL = ''

function extractInfo(dom) {
  const dataElement = dom.window.document.querySelector('#list &gt; table &gt; tbody')
  if (dataElement === null) {
    return ''
  }

  for (const child of dataElement.children) {
    for (const kid of child.children) {
      if (child.children[0].textContent !== 'Tag:') {
        break
      } else if (kid.textContent === 'Tag:') {
        continue
      }

      return kid.textContent
    }
  }

  return ''
}

async function getInfo(name) {
  const url = `${baseURL}${name.replace(/ /g, '_')}`

  try {
    const res = await fetch(url)
    const dom = new jsdom.JSDOM(await res.text())
    const info = extractInfo(dom)

    return url === res.url &amp;&amp; {
      url: url,
      info: info,
    } || null
  } catch(ex) {
    console.error(ex)
    return null
  }
}

At this stage, I had all the desired data in CSV files. I didn’t need to do much in terms of aggregating the data, other than combining the data from the corresponding pages of both websites. So, I did that in Google Sheets (I prefer it over MS Excel because I don’t own an Office 365 subscription, and want to be able to edit my sheets online).

Finally, I wanted the aggregated data to be more accessible, more so than accessing a Google Sheet. I could see two ways to present the data:

Create a website and show the data in a tabulated UI.
Generate an interactive report from the data sheet.

I opted to go down the second route, as it seemed faster and more maintainable. Also, I had some experience with exposing Google Sheets as an interactive tabulated UI (using a nifty web app called Awesome Table). Still, I wanted to have the flexibility of a web UI e.g. ability to visualize data, so, I decided to try out Google’s Data Studio app. Hooking up my Google Sheet to a Data Studio report was a piece of cake, mainly due to both being Google’s products. In no time I had a pleasantly looking report exposing – almost (staleness of up to 15 mins, to be exact) – live data from my sheet. Eventually, I ended up creating a simple website to embed my report, just so I don’t have to remember the long URL (I know that I could’ve just used a URL shortener; I guess, I had time to kill ¯\_(ツ)_/¯).

Leave a Comment Cancel reply