Technology Sharing

Use Puppeteer to scrape data and save it as JSON

2024-07-11

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Python_00044.png

Introduction to Puppeteer

Puppeteer is a Node library developed by the Google Chrome team that provides a high-level API to control headless versions of Chrome or Chromium. Puppeteer is capable of performing a variety of tasks, including page navigation, content scraping, screenshots, PDF generation, and more.

main feature

  • Headless Browser Control: Perform tasks without opening a browser interface.
  • Cross-platform: Supports Windows, Linux, and macOS.
  • Rich API: Provides rich APIs to simulate user behaviors.

Using Puppeteer for data scraping

Basic process

  1. Start a browser: Start a headless browser using Puppeteer.
  2. Open Page: Creates a new page instance and navigates to the target URL.
  3. Wait for the page to load: Make sure the page is fully loaded.
  4. Crawl content: Use the API provided by Puppeteer to obtain page content.
  5. Log Recording: Record the captured content or related information to a log file.
  6. Close the browser: Close the browser after the task is completed.

Implementation process

Suppose we need to crawl the table data on a web page. Here are the steps to achieve it:

const puppeteer = require('puppeteer');
const http = require('http');

const proxyHost = "www.16yun.cn";
const proxyPort = "5445";
const proxyUser = "16QMSOML";
const proxyPass = "280651";

// 创建HTTP代理服务器
const proxy = http.createServer((req, res) =