介绍

https://blog.csdn.net/ck784101777/article/details/104468780

Scrapy 框架的五大组件

Scrapy 框架的五大组件：

调度器 Scheduler：待抓取的 URL 的优先队列，可以去除重复的网站，可以定制
下载器 Downloader：高速下载网络资源，建立在 twisted 这个高效的异步模型上
爬虫 Spider：通过正则表达式定制爬虫，获得网页中的信息，也即实体 Item
实体管道 Item Pipeline：处理爬虫爬取的实体，负责持久化实体，验证有效性，清除不需要的信息等等
Scrapy 引擎 Scrapy Engine：控制调度器，下载器和爬虫

架构图

Scrapy 安装和生成项目

安装：

bash

pip install scrapy

生成项目：

bash

scrapy startproject ProjectName

cd ProjectName
scrapy genspider CrawlerName URL.com

启动爬虫：

bash

scrapy crawl CrawlerName

项目目录结构

bash

ProjectName              # 项目文件夹
│  scrapy.cfg            # 项目基本配置文件
└─ProjectName            # 项目目录
    │  items.py          # 定义数据结构
    │  middlewares.py    # 中间件
    │  pipelines.py      # 数据处理
    │  settings.py       # 全局配置
    │  __init__.py
    │
    ├─spiders
    │  │  test.py        # 爬虫文件
    │  │  __init__.py

项目配置文件 `setting.py`

Scrapy 多页面爬取

多页面爬取有两种形式：

从某一个或者多个主页中获取多个子页面的 url 列表，parse() 函数依次爬取列表中的各个子页面
从递归爬取

下面介绍获取子页面 url 列表的方法：

python

class PageSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls =["http://www.dmoz.org/Computers/Programming/Languages/Python/",]

    def parse(self, response):
        for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
            url = response.urljoin(response.url, href.extract())
            yield scrapy.Request(url, callback=self.parse_dir_contents)

            # 这里就是回调函数 parse_dir_contents 负责子页面的爬取
    
    def parse_dir_contents(self, response):
        for sel in response.xpath('//ul/li'):
            item = DmozItem()
            item['title'] =sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] =sel.xpath('text()').extract()
            yield item

介绍 ​

Scrapy 框架的五大组件 ​

Scrapy 安装和生成项目 ​

项目目录结构 ​

项目配置文件 setting.py ​

Scrapy 多页面爬取 ​

介绍

Scrapy 框架的五大组件

Scrapy 安装和生成项目

项目目录结构

项目配置文件 `setting.py`

Scrapy 多页面爬取