Technology Sharing

AIGC crawler code example: Scrapy and OpenAI API to crawl and generate content

2024-07-12

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

For me, who has been working in the crawler industry for many years, programming various required codes is really a very laborious and meticulous job. With the popularity of AI, I am wondering if it is possible to automatically crawl and generate the desired text content through AI automation programs. The premise is that I plan to complete it by combining crawler technology (such as Scrapy) and generative AI models (such as GPT-4).

The following is my thoughts on the AIGC crawler class, showing how to build an AIGC crawler application.

insert image description here

1. Install necessary dependencies

First, make sure you have Scrapy and OpenAI’s API client library installed.

pip install scrapy openai
  • 1

2. Configure OpenAI API

You will need to have an OpenAI API key and configure environment variables or use it directly in your code.

3. Create a Scrapy crawler

The following is a basic Scrapy spider example that crawls content and generates new content.

my_spider.py

import scrapy
import openai

class AIGCSpider(scrapy.Spider):
    name = 'aigc_spider'
    start_urls = ['http://example.com']

    def __init__(self, *args, **kwargs):
        super(AIGCSpider, self).__init__(*args, **kwargs)
        openai.api_key = 'your-openai-api-key'  # 替换为你的OpenAI API密钥

    def parse(self, response):
        # 提取网页内容
        content = response.xpath('//body//text()').getall()
        content = ' '.join(content).strip()

        # 使用OpenAI生成新内容
        generated_content = self.generate_content(content)

        # 处理生成的内容,如保存到文件
        with open('generated_content.txt', 'a') as f:
            f.write(generated_content + 'n')

        self.log(f"Generated content for {response.url}")

    def generate_content(self, prompt):
        try:
            response = openai.Completion.create(
                engine="davinci-codex",
                prompt=prompt,
                max_tokens=150
            )
            generated_text = response.choices[0].text.strip()
            return generated_text
        except Exception as e:
            self.log(f"Error generating content: {e}")
            return ""
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37

4. Configure the Scrapy project

Make suresettings.pyConfigure appropriate settings such as USER_AGENT and download delay in

settings.py
BOT_NAME = 'aigc_bot'

SPIDER_MODULES = ['aigc_bot.spiders']
NEWSPIDER_MODULE = 'aigc_bot.spiders'

# 遵守robots.txt规则
ROBOTSTXT_OBEY = True

# 用户代理
USER_AGENT = 'aigc_bot (+http://www.yourdomain.com)'

# 下载延迟
DOWNLOAD_DELAY = 1
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

5. Run the crawler

Run the Scrapy crawler from the command line:

scrapy crawl aigc_spider
  • 1

6. Extended functionality

Handling multiple pages

Reviseparsemethod, enabling it to handle multiple pages and perform deep crawling.

def parse(self, response):
    # 提取网页内容
    content = response.xpath('//body//text()').getall()
    content = ' '.join(content).strip()

    # 使用OpenAI生成新内容
    generated_content = self.generate_content(content)

    # 处理生成的内容,如保存到文件
    with open('generated_content.txt', 'a') as f:
        f.write(f"URL: {response.url}n")
        f.write(generated_content + 'nn')

    self.log(f"Generated content for {response.url}")

    # 跟踪所有链接
    for href in response.css('a::attr(href)').get():
        yield response.follow(href, self.parse)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18

Add more build settings

Adjust the parameters of generated content, such as increasingtemperatureandtop_pparameters to generate more diverse content.

def generate_content(self, prompt):
    try:
        response = openai.Completion.create(
            engine="davinci-codex",
            prompt=prompt,
            max_tokens=150,
            temperature=0.7,
            top_p=0.9
        )
        generated_text = response.choices[0].text.strip()
        return generated_text
    except Exception as e:
        self.log(f"Error generating content: {e}")
        return ""
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

The above is how I combined Scrapy and OpenAI API to build an AIGC crawler application that automatically crawls website content and generates new content. This method is suitable for application scenarios that require a large amount of content generation, such as content creation, data enhancement, etc. In actual applications, we may eventually need to have more sophisticated control and optimization of the crawling and generation logic to meet the needs of various types of crawlers.