How to Build A Scalable Scrawler Service With Puppeteer?
Part 1: TypeScript development with docker
Today I will talk about a story — How to build a scaleable scrawler service with puppeteer? The service only has one public API, the API recipes a query string and then return a list of Google search result link.
In part 1, we just build a simple service to complete the base requirement. We will use Node.js as a runtime environment, Puppeteer chrome headless browser, and TypeScript. We also use Docker and Docker compose to set up a local development environment.
Develop a TypeScript project with docker-compose
Let’s implement a simple HTTP service with Express and TypeScript.
As a normal Node.js project we will start with the initial command of npm:
$ npm init
Then, install TypeScript as dev dependence and other dependencies:
$ npm install typescript nodemon @types/express -D
$ npm install express -S
Generatetsconfig.json
file:
$ npx tsc --init
And update tsconfig.json
to change some settings:
...
"target": es2018,
"rootDir": "./src",
"outDir": "./dist"
...
TS files will be placed in src
folder, and compiled files in dist
folder.
In src
we create index.ts
file, this file is the entry point of the project. Let’s implement a simple express server:
# index.ts
import express from 'express';
import environments from './utils/environments';const app = express();app.get('/', (req, res) => {
res.json({ message: 'Hello World!' });
});app.listen(environments.apiPort, () => {
console.log(`Server is running on: ${environments.apiPort}`);
});
With utils/environment.ts
:
const apiPort = Number(process.env.API_PORT || 3000);export default {
apiPort,
};
We can control the server HTTP port by API_PORT
environment variable.
Now, we will verify our setup and the new server. At first, we have transpile the TypeScript code to JavaScript code:
$ npx tsc
No output should be printed in the terminal, and dist
folder will be created.
Start the server:
$ node dist/index.js
> Server is running on: 3000
When we access to http://localhost:3000
, a message will be shown:
{
message: "Hello World!"
}
When we update a ts file, we have to repeat these steps to make sure that the server is running with the latest logic. Let’s try to make it easier by using the watch feature of nodemon
and Typescript compiler.
Create nodemon.json
to config nodemon process: Just listen to the dist
directory and only watch for js file changes.
{
"watch": ["dist/"]",
"ext": "js",
"delay": 500
}
Now, create 2 npm scripts in package.json
file:
...
"scripts": {
"dev": "nodemon ./dist/index.js",
"build:watch": "rm -rf dist && tsc --watch",
},
...
We still need 2 terminal windows, 1 window transpile ts to JavaScript.
$ npm run build:watch
1 window watches the change in dist
directory and restart the server:
$ npm run dev
Now, when we save a ts file, the server will be restarted automatically.
Dockerzile the development environment
Until now, we don’t need a custom docker image, just use node image.
# docker-compose.yml
version: '3.9'
services:
api:
build: node:latest
working_dir: /api
volumes:
- .:/api
command: npm run dev
ports:
- '${API_PORT}:${API_PORT}'
depends_on:
compile:
condition: service_healthy
environment:
- API_PORT=${API_PORT}
compile:
image: node:latest
working_dir: /source
volumes:
- .:/source
command: npm run build:watch
healthcheck:
test: bash -c "[ -f dist/index.js ]"
interval: 10s
timeout: 5s
retries: 5
We mount all project files to containers. The compile
service handles npm run build:watch
command. And the api
service handles npm run dev
command. The api
service only starts when dist/index.js
file exists (it means the compile
service is running successfully.
For now, we just need 1 terminal window:
$ docker-compose up
Crawler Google search result with puppeteer
Puppeteer requires a Chrome browser instance, then we will install chrome on the api
docker container. To do that, we need a custom docker image.
# Dockerfile
FROM node:latest# Install dependencies
RUN apt-get update -qq \
&& apt-get install -qq --no-install-recommends \
ca-certificates \
apt-transport-https \
&& apt-get upgrade -qq# Install chrome
RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - \
&& echo "deb https://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list \
&& apt-get update -qq \
&& apt-get install -qq --no-install-recommends \
google-chrome-stable \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
Create a new image based on node
image, and install the Chrome stable version. The chrome binary file path is /usr/bin/google-chrome
, we need this information to configure the puppeteer.
Update docker-compose.yml
to change api
service image setting:
services:
api:
build: . # instead of image: node:latest
Now we can implement the search API: GET /search?query=
.
Install puppeteer-core
package, the core package will not install chrome. We already installed Chrome by ourselves in the docker file.
$ npm install puppeeter-core -S
In the index.ts
file, register a new request handle for the express server:
...
import { ISearchResult } from './utils/interfaces';
import { getLinksByQuery } from './utils/scraper';
...app.get('/search', async (req, res) => {
try {
const now = Date.now(); const { query = '' } = req.query as { query: string };
if (!query) {
return res.json({
query,
took: (Date.now() - now) / 1000,
links: [],
} as ISearchResult);
} const links = await getLinksByQuery(query); return res.json({
query,
took: (Date.now() - now) / 1000,
links,
} as ISearchResult);
} catch (error) {
console.error(error);
return res.status(500).json({ message: (error as Error).message });
}
});
We take a query
string from the query object, if the query is empty we respond with an empty list. Otherwise, try to get result links by getLinksByQuery
function and return it to the client-side.
Interface definitions ./utils/interfaces
:
export interface ILink {
title: string;
link: string;
}export interface ISearchResult {
query: string;
took: number;
links: ILink[];
}
getLinksByQuery
function from ./utils/scraper
:
import withPage from './browser';
import { ILink } from './interfaces';export function getLinksByQuery(query: string): Promise<ILink[]> {
return withPage(async (page) => {
await page.goto('https://www.google.com/', {
timeout: 0,
waitUntil: 'networkidle2',
});
await page.type('input', query);
await page.keyboard.press('Enter');
await page.waitForSelector('div.yuRUbf > a'); return page.evaluate(() => {
const data: ILink[] = [];
document.querySelectorAll('div.yuRUbf > a').forEach((ele) => {
data.push({
title: ele.querySelector('h3')?.textContent as string,
link: (ele as HTMLAnchorElement).href,
});
}); return data;
});
});
}
This function call withPage
helper to get a new chrome page instance, then search the query by Google. It returns all result links on the first search result page.
Finally, withPage
helper function in ./utils/browser.ts
:
import puppeteer, { Page } from 'puppeteer-core';export default async function withPage<T>(func: (page: Page) => Promise<T>): Promise<T> {
const browser = await puppeteer.launch({
executablePath: '/usr/bin/google-chrome',
headless: true,
args: [
'--no-sandbox',
'--disable-background-networking',
'--disable-default-apps',
'--disable-extensions',
'--disable-sync',
'--disable-translate',
'--headless',
'--hide-scrollbars',
'--metrics-recording-only',
'--mute-audio',
'--no-first-run',
'--safebrowsing-disable-auto-update',
'--ignore-certificate-errors',
'--ignore-ssl-errors',
'--ignore-certificate-errors-spki-list',
'--user-data-dir=/tmp',
],
}); const page = await browser.newPage(); try {
return await func(page);
} finally {
await page.close();
await browser.close();
}
}
withPage
is a generic function, it accepts a parameter — an “action” function. We create a new headless browser by puppeteer and create a new page instance by this browser instance. Pass the page instance to the “action” function, and return the result of the action function. Finally, just close the page and browser.
await
keyword when calling the action function is very important, it makes sure that the page and browser only close when the action function is done.
Now we can try the new API by a browser or any HTTP client:
$ curl 'http://localhost:3000/search?query=typescript%202022'
it will return something like this:
{
"query": "typescript 2022",
"took": 4.052,
"links": [
{
"title": "100万行の大規模なJavaScript製システムをTypeScriptに移行 ...",
"link": "https://developers.cyberagent.co.jp/blog/archives/34364/"
},
...
]
}
That’s all for part 01!
Conclusion
I hope you enjoyed and took away something useful from this article. This is my favorite way to start with a Node.js + TypeScript project.
The source code used in this article is published on Github.
Thanks for reading!
More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter and LinkedIn. Join our community Discord.