How to Build A Scalable Scrawler Service With Puppeteer?

Part 1: TypeScript development with docker

Hoang Dinh
JavaScript in Plain English

--

A simple crawler service

Today I will talk about a story — How to build a scaleable scrawler service with puppeteer? The service only has one public API, the API recipes a query string and then return a list of Google search result link.

In part 1, we just build a simple service to complete the base requirement. We will use Node.js as a runtime environment, Puppeteer chrome headless browser, and TypeScript. We also use Docker and Docker compose to set up a local development environment.

Develop a TypeScript project with docker-compose

Let’s implement a simple HTTP service with Express and TypeScript.

As a normal Node.js project we will start with the initial command of npm:

$ npm init

Then, install TypeScript as dev dependence and other dependencies:

$ npm install typescript nodemon @types/express -D
$ npm install express -S

Generatetsconfig.json file:

$ npx tsc --init

And update tsconfig.json to change some settings:

...
"target": es2018,
"rootDir": "./src",
"outDir": "./dist"
...

TS files will be placed in src folder, and compiled files in dist folder.

In src we create index.ts file, this file is the entry point of the project. Let’s implement a simple express server:

# index.ts
import express from 'express';
import environments from './utils/environments';
const app = express();app.get('/', (req, res) => {
res.json({ message: 'Hello World!' });
});
app.listen(environments.apiPort, () => {
console.log(`Server is running on: ${environments.apiPort}`);
});

With utils/environment.ts:

const apiPort = Number(process.env.API_PORT || 3000);export default {
apiPort,
};

We can control the server HTTP port by API_PORT environment variable.

Now, we will verify our setup and the new server. At first, we have transpile the TypeScript code to JavaScript code:

$ npx tsc

No output should be printed in the terminal, and dist folder will be created.

Start the server:

$ node dist/index.js
> Server is running on: 3000

When we access to http://localhost:3000 , a message will be shown:

{
message: "Hello World!"
}

When we update a ts file, we have to repeat these steps to make sure that the server is running with the latest logic. Let’s try to make it easier by using the watch feature of nodemon and Typescript compiler.

Create nodemon.json to config nodemon process: Just listen to the dist directory and only watch for js file changes.

{
"watch": ["dist/"]",
"ext": "js",
"delay": 500
}

Now, create 2 npm scripts in package.json file:

...
"scripts": {
"dev": "nodemon ./dist/index.js",
"build:watch": "rm -rf dist && tsc --watch",
},
...

We still need 2 terminal windows, 1 window transpile ts to JavaScript.

$ npm run build:watch

1 window watches the change in dist directory and restart the server:

$ npm run dev

Now, when we save a ts file, the server will be restarted automatically.

Dockerzile the development environment

Until now, we don’t need a custom docker image, just use node image.

# docker-compose.yml
version: '3.9'
services:
api:
build: node:latest
working_dir: /api
volumes:
- .:/api
command: npm run dev
ports:
- '${API_PORT}:${API_PORT}'
depends_on:
compile:
condition: service_healthy
environment:
- API_PORT=${API_PORT}
compile:
image: node:latest
working_dir: /source
volumes:
- .:/source
command: npm run build:watch
healthcheck:
test: bash -c "[ -f dist/index.js ]"
interval: 10s
timeout: 5s
retries: 5

We mount all project files to containers. The compile service handles npm run build:watch command. And the api service handles npm run dev command. The api service only starts when dist/index.js file exists (it means the compile service is running successfully.

For now, we just need 1 terminal window:

$ docker-compose up

Crawler Google search result with puppeteer

Puppeteer requires a Chrome browser instance, then we will install chrome on the api docker container. To do that, we need a custom docker image.

# Dockerfile
FROM node:latest
# Install dependencies
RUN apt-get update -qq \
&& apt-get install -qq --no-install-recommends \
ca-certificates \
apt-transport-https \
&& apt-get upgrade -qq
# Install chrome
RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - \
&& echo "deb https://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list \
&& apt-get update -qq \
&& apt-get install -qq --no-install-recommends \
google-chrome-stable \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

Create a new image based on node image, and install the Chrome stable version. The chrome binary file path is /usr/bin/google-chrome , we need this information to configure the puppeteer.

Update docker-compose.yml to change api service image setting:

services:
api:
build: . # instead of image: node:latest

Now we can implement the search API: GET /search?query= .

Install puppeteer-core package, the core package will not install chrome. We already installed Chrome by ourselves in the docker file.

$ npm install puppeeter-core -S

In the index.ts file, register a new request handle for the express server:

...
import { ISearchResult } from './utils/interfaces';
import { getLinksByQuery } from './utils/scraper';
...
app.get('/search', async (req, res) => {
try {
const now = Date.now();
const { query = '' } = req.query as { query: string };
if (!query) {
return res.json({
query,
took: (Date.now() - now) / 1000,
links: [],
} as ISearchResult);
}
const links = await getLinksByQuery(query); return res.json({
query,
took: (Date.now() - now) / 1000,
links,
} as ISearchResult);
} catch (error) {
console.error(error);
return res.status(500).json({ message: (error as Error).message });
}
});

We take a query string from the query object, if the query is empty we respond with an empty list. Otherwise, try to get result links by getLinksByQuery function and return it to the client-side.

Interface definitions ./utils/interfaces :

export interface ILink {
title: string;
link: string;
}
export interface ISearchResult {
query: string;
took: number;
links: ILink[];
}

getLinksByQuery function from ./utils/scraper :

import withPage from './browser';
import { ILink } from './interfaces';
export function getLinksByQuery(query: string): Promise<ILink[]> {
return withPage(async (page) => {
await page.goto('https://www.google.com/', {
timeout: 0,
waitUntil: 'networkidle2',
});
await page.type('input', query);
await page.keyboard.press('Enter');
await page.waitForSelector('div.yuRUbf > a');
return page.evaluate(() => {
const data: ILink[] = [];
document.querySelectorAll('div.yuRUbf > a').forEach((ele) => {
data.push({
title: ele.querySelector('h3')?.textContent as string,
link: (ele as HTMLAnchorElement).href,
});
});
return data;
});
});
}

This function call withPage helper to get a new chrome page instance, then search the query by Google. It returns all result links on the first search result page.

Finally, withPage helper function in ./utils/browser.ts :

import puppeteer, { Page } from 'puppeteer-core';export default async function withPage<T>(func: (page: Page) => Promise<T>): Promise<T> {
const browser = await puppeteer.launch({
executablePath: '/usr/bin/google-chrome',
headless: true,
args: [
'--no-sandbox',
'--disable-background-networking',
'--disable-default-apps',
'--disable-extensions',
'--disable-sync',
'--disable-translate',
'--headless',
'--hide-scrollbars',
'--metrics-recording-only',
'--mute-audio',
'--no-first-run',
'--safebrowsing-disable-auto-update',
'--ignore-certificate-errors',
'--ignore-ssl-errors',
'--ignore-certificate-errors-spki-list',
'--user-data-dir=/tmp',
],
});
const page = await browser.newPage(); try {
return await func(page);
} finally {
await page.close();
await browser.close();
}
}

withPage is a generic function, it accepts a parameter — an “action” function. We create a new headless browser by puppeteer and create a new page instance by this browser instance. Pass the page instance to the “action” function, and return the result of the action function. Finally, just close the page and browser.

await keyword when calling the action function is very important, it makes sure that the page and browser only close when the action function is done.

Now we can try the new API by a browser or any HTTP client:

$ curl 'http://localhost:3000/search?query=typescript%202022'

it will return something like this:

{
"query": "typescript 2022",
"took": 4.052,
"links": [
{
"title": "100万行の大規模なJavaScript製システムをTypeScriptに移行 ...",
"link": "https://developers.cyberagent.co.jp/blog/archives/34364/"
},
...
]
}

That’s all for part 01!

Conclusion

I hope you enjoyed and took away something useful from this article. This is my favorite way to start with a Node.js + TypeScript project.
The source code used in this article is published on Github.
Thanks for reading!

More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter and LinkedIn. Join our community Discord.

--

--