Simple Web Scraping using node.js, cheerio, and request.

Introduction

The basic definition of Web scraping would be ‘Web Data Extraction, it is a technique to extract large amounts of data from websites and the extracted data is usually stored on a local computer in different file formats.’ The purpose of such extraction might be consuming the data in any application, to analyze or study the extracted data for competitive purposes. I happen to engage myself in web scraping as I am working on an application where I had this requirement of fetching all my blog posts from both my WordPress sites. So I ended up with web scraping which suggested me that I can scrape data from any website and meet my requirements. This is a very brief article to let your hands on web scraping. At the end of this tutorial, you should be able to go ahead and scrape/ extract data from any website.

Here, I will be scraping the home page of both my WordPress sites and extract metadata such as the post title and post URL.

Requirements

The technique that I am going to use for web scraping is one of the many techniques, this becomes very easy to get the data by node.js. So we will be using node.js, cheerio and request to scrape the site and then write it to a json file using jsonfile. You can get the entire code here at this GitHub repository.

Cheerio is a lightweight, fast, flexible and lean implementation of core jQuery designed specifically for the server.

Request is designed to be the simplest way possible to make http calls.

Step 1:

Let’s start with installing our dependencies.  Run the following command in your directory

npm install cheerio request jsonfile

Step 2:

Now we can include these in our file and begin writing our script file- scrape.js

var request = require('request');
var cheerio = require('cheerio');
var jsonfile = require('jsonfile');

Step 3:

Now we make an http call to the desired website and load the HTML with cheerio.

var request = require('request');
var cheerio = require('cheerio');
var jsonfile = require('jsonfile');

request('', function (error, response, html) {
  if (!error && response.statusCode == 200) {
   console.log(html);
  }
});

Now execute the following command in your terminal and you will see the HTML content from the website being logged in your terminal.

node scrape.js

Step 4:

We successfully managed to load the HTML from the desired website. The next step is to identify the elements of our interest in this markup. For this just right click on the website and click ‘inspect element’, this will open the developer’s console. As we need the post title, inspect the title of any post on the website and you will get the exact element that contains this title.

Untitled

As you can see in the above image, the post title and URL are present in the element h1 which has a class called ‘entry-title’. We can simply use this class to get the content from this element. Let’s modify our script to accommodate this by iterating over the HTML markup. And we will store the title and URL in an array of objects.

var request = require('request');
var cheerio = require('cheerio');
var jsonfile = require('jsonfile');

request('', function (error, response, html) {
  if (!error && response.statusCode == 200) {
    var title, url, data = [];
    var $ = cheerio.load(html);

    $('h1.entry-title').each(function(i, element){
      title = $(this);
      url = $(this).find('a');
      data.push({
      	'title': title.text(),
      	'url': url.attr('href')
      })
    });
  }
});

Now that we have successfully extracted the required data, we can write this data into a jsonfile for further use.

var request  = require('request');
var cheerio  = require('cheerio');
var jsonfile = require('jsonfile');

request('', function (error, response, html) {
  if (!error && response.statusCode == 200) {
    var title, url, data = [];
    var $ = cheerio.load(html);

    $('h1.entry-title').each(function(i, element){
      title = $(this);
      url = $(this).find('a');
      data.push({
      	'title': title.text(),
      	'url': url.attr('href')
      })
    });
    jsonfile.writeFile('./tech.json', data, {spaces: 2}, function(err) {
      console.error("error ", err);
    });
  }
});

So a ‘tech.json’ file will be created in your directory which contains an array of object, each object having 2 key-value pairs, the keys being, ‘title‘ and ‘url‘.  The file will look something like this:

Untitled.png

As you can see, this was a very simple approach to scrape some data from a website. You can put this code to use for any web page by simply changing the path parameter in the request. Then just identify the elements that contain the required content and play with the markup.

Hope this was easy, quick and helpful.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s