Power icon
Check our latest Product Hunt launch: 404 Error Hound!
Right arrow
404 Error Hound - Hunt down & fix website errors with AI power | Product Hunt
Tips & Guides

Mastering Web Scraping with Puppeteer: A Comprehensive Tutorial

April 11, 2024
15 min read
Mastering Web Scraping with Puppeteer: A Comprehensive Tutorial

Introduction

In this tutorial we are going to prepare a scrapper that will visit Adobe Summit’s website and collect data regarding its speakers as a use case. Let's start with introducing Puppeteer: 

Puppeteer is a Node.js library which provides an API to control the Chromium browser. By default puppeteer runs in headless mode but can be easily configured to run in non-headless mode. With Puppeteer we can do nearly everything that we can manually do in the Chrome browser and its Dev Tools.

Install Puppeteer and it’s dependencies

Since Puppeteer is a Node.js library make sure that you have Node.js installed, you can download from here.

Then follow these steps:

1. Create an empty directory for your project

2. Open the directory with the code editor of choice and run through terminal: 

 npm install puppeteer : to install the latest version

 npm install puppeteer@19.5.2 : to install a specific version (e.g. 19.5.2)

3. A new folder will be created in your working directory named “node_modules” that will include puppeteer and all of its dependencies.

4. Start your scripts with the following code to include puppeteer
const puppeteer = require('puppeteer');

Main Puppeteer Functions

Before we begin our tutorial let's take a look at the most common functions of Puppeteer. Assume tab is an object that represents a browser’s tab

  • tab.goto(url) : Loads the given URL

  • tab.waitForSelector(css selector) : Waits until a specific DOM element loads

  • tab.click(css selector) : Clicks on an element

  • tab.type(css selector, text) : Types the given text in the selected field

  • tab.evaluate(function) : Evaluates the given function in the page and returns the result - loosely think about it as running the function in the devTools Console

  • tab.waitForRequest() : Wait for a specific HTTP Request

Find the full API Documentation here.

Let’s start our scrapper

To start your first project as mentioned above start with:

Then add an asynchronous ES6 immediately invoked arrow function expression:

Declare the puppeteer browser object and open a tab inside the function:

Since both the browser and the tab have been initialized proceed with loading a URL:

For the first try let’s just fetch the name of the first speaker from the loaded page and print it in the terminal.

To achieve that first we will use the waitForSelector() function which waits until a specific element loads and then we will use the evaluate() function which is taking a function as an input and evaluating it in the page’s context.

So the full code for this will be the following:

To execute it run in the terminal the following command: node firstSpeaker.js

In order to retrieve more information about a speaker we have to click on the speaker's name and a pop up window will appear with all the necessary information.

To simulate this using our scraper we will use the click() function and then wait for the pop element to appear before retrieving the information. As shown below:

Finally let’s enhance the above code in order to recursively fetch data from all the speakers and store them in a semicolon (;) delimited text file where the columns represent:

1. Speaker Name

2. Company Name

3. Job Title

4. Bio

Our final scraper’s code will be the following:

Similar posts

Read more posts from the same author!

Start your 30-day free trial

Never miss a metric that matters.
No credit card required
Cancel anytime