scrapy框架搭建并集成twisted-学海无涯

环境

软件	版本号
python	3.12.0
scrapy	2.11.2
twisted	24.3.0

安装

pip install scrapy
pip install scrapy_playwright

创建项目

可以在项目名称后面加上路径，一般设置为当前目录。

scrapy startproject <project_name> [project_dir]

创建spider

scrapy genspider <spidername> <site_domain>

定义Item

class CustomItem(Item):
    one_field = Field()
    another_field = Field()

setting.py基础配置

BOT_NAME                    爬虫名
ROBOTSTXT_OBEY = True       是否遵从robots协议
USER_AGENT = "”             指定爬取时使用
CONCURRENT_REQUESTS = 16    默认16个并行
DOWNLOAD_DELAY = 3          下载延时，一般要设置，不宜过快发起连续请求
COOKIES_ENABLED = False     缺省是启用，一般需要登录时才需要开启cookie
SPIDER_MIDDLEWARES          爬虫中间件
DOWNLOADER_MIDDLEWARES = {  下载中间件
      "first.middlewares.FirstDownloaderMiddleware": 543, 543 越小优先级越高
   }
   ITEM_PIPELINES = {           管道配置
      "first.pipelines.FirstPipeline": 300, item交给哪一个管道处理，300， 越小优先级越高
   }

UA可以通过fake_useragent随机生成

pip install fake_useragent

使用方法：

from fake_useragent import UserAgent
ua = UserAgent().random

集成twisted

编辑setting.py文件，增加修改以下内容

DOWNLOAD_HANDLERS = {
    # "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
PLAYWRIGHT_LAUNCH_OPTIONS = {
   'headless': True,  # 是否以无头模式运行
   # 'args': ['--proxy-server=http://127.0.0.1:10809','--user-agent={}'.format(ua.random)],  # 设置代理服务器
}

编写spider程序

import scrapy


class ZhipinSpider(scrapy.Spider):
    name = "xxx"
    allowed_domains = ["xxx.com"]
    #start_urls = ["https://xxx.com"]
    
    # 定义起url
    def start_requests(self):
        # 一批代码
        yield scrapy.Request(
            url,
            meta=dict(
                playwright=True,
            ),
            callback=self.parse
        )
        
    def parse(self, response):
        # 进入问题详情页进行进一步解析
        # 一批代码
        yield scrapy.Request(
            url,
            callback=self.parse_detail,
            meta=dict(
                playwright=True,
                item=item,
            ),
        )
        # 获取下一页数据
        # 一批代码
        yield scrapy.Request(
            self.baseurl + next_page + "&pagesize=50",
            callback=self.parse,
            meta=dict(
                playwright=True,
            ),
        )
     
    def parse_detail(self, response):
        item = response.meta['item']
        # 通过postcell获取问题
        item['question'] = response.css("div.postcell div.s-prose").get()
        # 通过accepted-answer判断是否已经采纳
        item['answer'] = response.css("div.accepted-answer div.s-prose").get()
        item['channel'] = 'stackoverflow'
        item['create_time'] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")

        yield item

启用piplines

settings.py

ITEM_PIPELINES = {
   "stackoverflow.pipelines.StackoverflowPipeline": 300,
}

编辑piplines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class BossPipeline:
   def process_item(self, item, spider):
       # 做数据入库等操作处理
       return item

目录CONTENT

scrapy框架搭建并集成twisted

环境

安装

创建项目

创建spider

定义Item

setting.py基础配置

集成twisted

编写spider程序

启用piplines

编辑piplines.py

评论区