Python crawler and output

2024-07-12

1. Python crawler and output examples

Here is an example of a simple web crawler written in Python that will crawl a web page (for example, let's assume it ishttps://example.com, but please note that in actual use we need to replace it with the title of a real website that is allowed to be crawled) and print it out. Since directly accessing and crawling real websites may involve copyright and legal issues, we only provide a conceptual example here.

To accomplish this task, we will use Python'srequestslibrary to send HTTP requests and useBeautifulSoupLibraries to parse HTML content. If we don't have these libraries installed yet, we can install them via pip:


bash复制代码
pip install requests beautifulsoup4

Here is the complete code example:


# 导入必要的库  
import requests  
from bs4 import BeautifulSoup  
  
def fetch_website_title(url):  
    """  
    抓取指定网页的标题并返回。  
      
    参数:  
    url (str): 需要抓取的网页的URL。  
      
    返回:  
    str: 网页的标题，如果抓取失败则返回None。  
    """  
    try:  
        # 发送HTTP GET请求  
        response = requests.get(url)  
          
        # 检查请求是否成功  
        if response.status_code == 200:  
            # 使用BeautifulSoup解析HTML内容  
            soup = BeautifulSoup(response.text, 'html.parser')  
              
            # 查找网页的<title>标签  
            title_tag = soup.find('title')  
              
            # 如果找到<title>标签，则返回其内容  
            if title_tag:  
                return title_tag.get_text(strip=True)  
            else:  
                return "No title found."  
        else:  
            return f"Failed to retrieve the webpage. Status code: {response.status_code}"  
    except requests.RequestException as e:  
        return f"Error fetching the webpage: {e}"  
  
# 示例URL（请替换为我们要抓取的网页的URL）  
url = 'https://example.com'  
  
# 调用函数并打印结果  
title = fetch_website_title(url)  
print(f"The title of the webpage is: {title}")

Notice：

(1) Due tohttps://example.comis a placeholder for the example, so when actually running, we need to replace it with a valid web page URL that is allowed to be crawled.

(2) The crawler should comply with the target website’srobots.txtThe document stipulates that the copyright and access restrictions of the website are respected.

(3) Some websites may have anti-crawler mechanisms, such as User-Agent checks, frequency limits, etc. We may need to modify our request headers (such asUser-Agent) or use a proxy to bypass these restrictions.

(4) For more complex web page structures or more advanced data crawling requirements, we may need to learn more about HTML, CSS selectors, XPath, and network requests.

2. More detailed code examples

Here is a more detailed Python crawler code example. This time I will userequestslibrary to send HTTP requests and useBeautifulSouplibrary to parse HTML content to crawl a real website (for example, we usehttps://www.wikipedia.orgThis is just an example, but please note that actual crawling should follow the site'srobots.txtRegulations and Copyright Policy).

First, make sure we have installedrequestsandbeautifulsoup4If it is not installed, install it using pip:


bash复制代码
pip install requests beautifulsoup4

We can then use the following code to scrape and print the title of the Wikipedia homepage:


# 导入必要的库  
import requests  
from bs4 import BeautifulSoup  
  
def fetch_and_parse_title(url):  
    """  
    发送HTTP GET请求到指定的URL，解析HTML内容，并返回网页的标题。  
  
    参数:  
    url (str): 需要抓取的网页的URL。  
  
    返回:  
    str: 网页的标题，如果抓取或解析失败则返回相应的错误消息。  
    """  
    try:  
        # 发送HTTP GET请求  
        headers = {  
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'  
        }  # 设置User-Agent来模拟浏览器访问  
        response = requests.get(url, headers=headers)  
  
        # 检查请求是否成功  
        if response.status_code == 200:  
            # 使用BeautifulSoup解析HTML内容  
            soup = BeautifulSoup(response.text, 'html.parser')  
  
            # 查找网页的<title>标签  
            title_tag = soup.find('title')  
  
            # 提取并返回标题内容  
            if title_tag:  
                return title_tag.get_text(strip=True)  
            else:  
                return "No title found in the webpage."  
        else:  
            return f"Failed to retrieve the webpage. Status code: {response.status_code}"  
    except requests.RequestException as e:  
        return f"Error fetching the webpage: {e}"  
  
# 示例URL（这里使用Wikipedia的主页作为示例）  
url = 'https://www.wikipedia.org'  
  
# 调用函数并打印结果  
title = fetch_and_parse_title(url)  
print(f"The title of the webpage is: {title}")

This code first sets a request header (headers), which contains aUser-Agentfield, which is to simulate a real browser access, because some websites will check the request header to prevent crawler access. Then, it sends a GET request to the specified URL and uses BeautifulSoup to parse the returned HTML content. Next, it looks for the HTML<title>tag and extracts its text content as the title of the web page. Finally, it prints the title to the console.

Please note that although this example uses Wikipedia, in real projects we should always adhere to the target website’srobots.txtDocumentation and copyright policies to ensure our crawling practices are legal and ethical.

Technology Sharing