Mastering Web Scraping with Puppeteer: Step-by-Step Tutorial

Introduction

In this tutorial we are going to prepare a scrapper that will visit Adobe Summit’s website and collect data regarding its speakers as a use case. Let's start with introducing Puppeteer:

Puppeteer is a Node.js library which provides an API to control the Chromium browser. By default puppeteer runs in headless mode but can be easily configured to run in non-headless mode. With Puppeteer we can do nearly everything that we can manually do in the Chrome browser and its Dev Tools.

Install Puppeteer and it’s dependencies

Since Puppeteer is a Node.js library make sure that you have Node.js installed, you can download from here.

Then follow these steps:

1. Create an empty directory for your project

2. Open the directory with the code editor of choice and run through terminal:

npm install puppeteer : to install the latest version

npm install puppeteer@19.5.2 : to install a specific version (e.g. 19.5.2)

3. A new folder will be created in your working directory named “node_modules” that will include puppeteer and all of its dependencies.

4. Start your scripts with the following code to include puppeteer
const puppeteer = require('puppeteer');

Main Puppeteer Functions

Before we begin our tutorial let's take a look at the most common functions of Puppeteer. Assume tab is an object that represents a browser’s tab

tab.goto(url) : Loads the given URL
tab.waitForSelector(css selector) : Waits until a specific DOM element loads
tab.click(css selector) : Clicks on an element
tab.type(css selector, text) : Types the given text in the selected field
tab.evaluate(function) : Evaluates the given function in the page and returns the result - loosely think about it as running the function in the devTools Console
tab.waitForRequest() : Wait for a specific HTTP Request

Find the full API Documentation here.

Let’s start our scrapper

To start your first project as mentioned above start with:

Then add an asynchronous ES6 immediately invoked arrow function expression:

Declare the puppeteer browser object and open a tab inside the function:

Since both the browser and the tab have been initialized proceed with loading a URL:

For the first try let’s just fetch the name of the first speaker from the loaded page and print it in the terminal.

To achieve that first we will use the waitForSelector() function which waits until a specific element loads and then we will use the evaluate() function which is taking a function as an input and evaluating it in the page’s context.

So the full code for this will be the following:

To execute it run in the terminal the following command: node firstSpeaker.js

In order to retrieve more information about a speaker we have to click on the speaker's name and a pop up window will appear with all the necessary information.

To simulate this using our scraper we will use the click() function and then wait for the pop element to appear before retrieving the information. As shown below:

Finally let’s enhance the above code in order to recursively fetch data from all the speakers and store them in a semicolon (;) delimited text file where the columns represent:

1. Speaker Name

2. Company Name

3. Job Title

4. Bio

Our final scraper’s code will be the following:

‍

Mastering Web Scraping with Puppeteer: A Comprehensive Tutorial

Introduction

Install Puppeteer and it’s dependencies

Main Puppeteer Functions

Let’s start our scrapper

Similar posts

Merchandising Wiz Bites #7: Home Furnishing Trends - Spring 2025

Merchandising Wiz bites #6: Premium Athleisure Marketing: 2025 Reality Analysis

Merchandising Wiz Bites #5: Retailers’ Smart Split: Clearing Winter While Promoting Spring

Start your 30-day free trial