软件 | 版本号 |
---|---|
python | 3.12.0 |
scrapy | 2.11.2 |
twisted | 24.3.0 |
安装
pip install scrapy
pip install scrapy_playwright
创建项目
可以在项目名称后面加上路径,一般设置为当前目录。
scrapy startproject <project_name> [project_dir]
创建spider
scrapy genspider <spidername> <site_domain>
定义Item
class CustomItem(Item):
one_field = Field()
another_field = Field()
setting.py基础配置
BOT_NAME 爬虫名
ROBOTSTXT_OBEY = True 是否遵从robots协议
USER_AGENT = "” 指定爬取时使用
CONCURRENT_REQUESTS = 16 默认16个并行
DOWNLOAD_DELAY = 3 下载延时,一般要设置,不宜过快发起连续请求
COOKIES_ENABLED = False 缺省是启用,一般需要登录时才需要开启cookie
SPIDER_MIDDLEWARES 爬虫中间件
DOWNLOADER_MIDDLEWARES = { 下载中间件
"first.middlewares.FirstDownloaderMiddleware": 543, 543 越小优先级越高
}
ITEM_PIPELINES = { 管道配置
"first.pipelines.FirstPipeline": 300, item交给哪一个管道处理,300, 越小优先级越高
}
UA可以通过fake_useragent随机生成
pip install fake_useragent
使用方法:
from fake_useragent import UserAgent
ua = UserAgent().random
集成twisted
编辑setting.py文件,增加修改以下内容
DOWNLOAD_HANDLERS = {
# "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
PLAYWRIGHT_LAUNCH_OPTIONS = {
'headless': True, # 是否以无头模式运行
# 'args': ['--proxy-server=http://127.0.0.1:10809','--user-agent={}'.format(ua.random)], # 设置代理服务器
}
编写spider程序
import scrapy
class ZhipinSpider(scrapy.Spider):
name = "xxx"
allowed_domains = ["xxx.com"]
#start_urls = ["https://xxx.com"]
# 定义起url
def start_requests(self):
# 一批代码
yield scrapy.Request(
url,
meta=dict(
playwright=True,
),
callback=self.parse
)
def parse(self, response):
# 进入问题详情页进行进一步解析
# 一批代码
yield scrapy.Request(
url,
callback=self.parse_detail,
meta=dict(
playwright=True,
item=item,
),
)
# 获取下一页数据
# 一批代码
yield scrapy.Request(
self.baseurl + next_page + "&pagesize=50",
callback=self.parse,
meta=dict(
playwright=True,
),
)
def parse_detail(self, response):
item = response.meta['item']
# 通过postcell获取问题
item['question'] = response.css("div.postcell div.s-prose").get()
# 通过accepted-answer判断是否已经采纳
item['answer'] = response.css("div.accepted-answer div.s-prose").get()
item['channel'] = 'stackoverflow'
item['create_time'] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
yield item
启用piplines
settings.py
ITEM_PIPELINES = {
"stackoverflow.pipelines.StackoverflowPipeline": 300,
}
编辑piplines.py
# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interface from itemadapter import ItemAdapter class BossPipeline: def process_item(self, item, spider): # 做数据入库等操作处理 return item
评论区