In this tutorial we are going to prepare a scrapper that will visit Adobe Summit’s website and collect data regarding its speakers as a use case. Let's start with introducing Puppeteer:
Puppeteer is a Node.js library which provides an API to control the Chromium browser. By default puppeteer runs in headless mode but can be easily configured to run in non-headless mode. With Puppeteer we can do nearly everything that we can manually do in the Chrome browser and its Dev Tools.
Since Puppeteer is a Node.js library make sure that you have Node.js installed, you can download from here.
Then follow these steps:
1. Create an empty directory for your project
2. Open the directory with the code editor of choice and run through terminal:
npm install puppeteer : to install the latest version
npm install puppeteer@19.5.2 : to install a specific version (e.g. 19.5.2)
3. A new folder will be created in your working directory named “node_modules” that will include puppeteer and all of its dependencies.
4. Start your scripts with the following code to include puppeteer
const puppeteer = require('puppeteer');
Before we begin our tutorial let's take a look at the most common functions of Puppeteer. Assume tab is an object that represents a browser’s tab
Find the full API Documentation here.
To start your first project as mentioned above start with:
Then add an asynchronous ES6 immediately invoked arrow function expression:
Declare the puppeteer browser object and open a tab inside the function:
Since both the browser and the tab have been initialized proceed with loading a URL:
For the first try let’s just fetch the name of the first speaker from the loaded page and print it in the terminal.
To achieve that first we will use the waitForSelector() function which waits until a specific element loads and then we will use the evaluate() function which is taking a function as an input and evaluating it in the page’s context.
So the full code for this will be the following:
To execute it run in the terminal the following command: node firstSpeaker.js
In order to retrieve more information about a speaker we have to click on the speaker's name and a pop up window will appear with all the necessary information.
To simulate this using our scraper we will use the click() function and then wait for the pop element to appear before retrieving the information. As shown below:
Finally let’s enhance the above code in order to recursively fetch data from all the speakers and store them in a semicolon (;) delimited text file where the columns represent:
1. Speaker Name
2. Company Name
3. Job Title
4. Bio
Our final scraper’s code will be the following: