Use Puppeteer to scrape data and save it as JSON

2024-07-11

Introduction to Puppeteer

Puppeteer is a Node library developed by the Google Chrome team that provides a high-level API to control headless versions of Chrome or Chromium. Puppeteer is capable of performing a variety of tasks, including page navigation, content scraping, screenshots, PDF generation, and more.

main feature

Headless Browser Control: Perform tasks without opening a browser interface.
Cross-platform: Supports Windows, Linux, and macOS.
Rich API: Provides rich APIs to simulate user behaviors.

Using Puppeteer for data scraping

Basic process

Start a browser: Start a headless browser using Puppeteer.
Open Page: Creates a new page instance and navigates to the target URL.
Wait for the page to load: Make sure the page is fully loaded.
Crawl content: Use the API provided by Puppeteer to obtain page content.
Log Recording: Record the captured content or related information to a log file.
Close the browser: Close the browser after the task is completed.

Implementation process

Suppose we need to crawl the table data on a web page. Here are the steps to achieve it:

const puppeteer = require('puppeteer');
const http = require('http');

const proxyHost = "www.16yun.cn";
const proxyPort = "5445";
const proxyUser = "16QMSOML";
const proxyPass = "280651";

// 创建HTTP代理服务器
const proxy = http.createServer((req, res) =

Technology Sharing