Scrapy 中英双语文档 by scoful¶
- This documentation contains everything you need to know about Scrapy.
- 这个文档包含所有的Scrapy相关的资料。
Getting help | 获取帮助¶
- Having trouble? We’d like to help!
- 有问题?我们很希望能给予帮助!
- Try the FAQ – it’s got answers to some common questions.
- 尝试看看 问题集锦 – 它包含大部分大家都会遇到的问题。
- Looking for specific information? Try the 索引 or 模块索引.
- 想找更具体的信息? 试试看 索引 或是 模块索引 (索引或是模块索引)。
- Ask or search questions in StackOverflow using the scrapy tag,
- 或是在 StackOverflow using the scrapy tag 上提问或直接搜索,
- Search for information in the archives of the scrapy-users mailing list, or post a question.
- 在邮件列表里搜索 archives of the scrapy-users mailing list , 或是提问 post a question 。
- Ask a question in the #scrapy IRC channel,
- 在这个频道上提问 #scrapy IRC channel,
- Report bugs with Scrapy in our issue tracker.
- 在 issue tracker 上报bug。
First steps | 第一步¶
- Scrapy at a glance | Scrapy一览
- Understand what Scrapy is and how it can help you.
- 了解Scrapy是什么及它有什么用。
- Installation guide | 安装向导
- Get Scrapy installed on your computer.
- 安装Scrapy。
- Scrapy Tutorial | Scrapy教程
- Write your first Scrapy project.
- 开发第一个Scrapy项目。
- Examples | 例子
- Learn more by playing with a pre-made Scrapy project.
- 学习怎么玩转一个成品Scrapy项目。
Basic concepts | 基础概念¶
- Command line tool | 命令行工具
- Learn about the command-line tool used to manage your Scrapy project.
- 学习用命令行模式管理Scrapy项目。
- Spiders | 爬虫
- Write the rules to crawl your websites.
- 编写爬取网站的爬虫。
- Selectors | 选择器
- Extract the data from web pages using XPath.
- 用XPath来从网页上提取数据。
- Scrapy shell | Scrapy 命令行模式
- Test your extraction code in an interactive environment.
- 在互动环境测试爬虫代码。
- Items
- Define the data you want to scrape.
- 定义想要爬取的数据结构。
- Item Loaders
- Populate your items with the extracted data.
- 把爬取到的数据填充到数据结构里。
- Item Pipeline
- Post-process and store your scraped data.
- 爬取完数据后处理数据并保存。
- Feed exports
- Output your scraped data using different formats and storages.
- 导出爬取的数据,用不同的格式存储。
- Requests and Responses
- Understand the classes used to represent HTTP requests and responses.
- 理解用到的 HTTP requests 和 responses 。
- Link Extractors
- Convenient classes to extract links to follow from pages.
- 便捷得爬取网页上的链接。
- Settings
- Learn how to configure Scrapy and see all available settings.
- 学习如何配置Scrapy ,查看Scrapy所有的配置 available settings 。
- Exceptions
- See all available exceptions and their meaning.
- 查看所有可能出现的异常报错。
Built-in services | 内置服务¶
- Logging
- Learn how to use Python’s builtin logging on Scrapy.
- 学习如何在Scrapy上使用python内置的日志模块。
- Stats Collection
- Collect statistics about your scraping crawler.
- 收集你的爬虫的统计信息。
- Sending e-mail
- Send email notifications when certain events occur.
- 当某些事件发生时自动发邮件提醒。
- Telnet Console
- Inspect a running crawler using a built-in Python console.
- 通过python内置的控制台检查爬虫。
- Web Service
- Monitor and control a crawler using a web service.
- 通过一个网页服务来监控和控制爬虫。
Solving specific problems | 解决具体的问题¶
- Frequently Asked Questions
- Get answers to most frequently asked questions.
- 看看大部分人经常问到的问题的答案。
- Debugging Spiders
- Learn how to debug common problems of your scrapy spider.
- 学习怎么调试你的爬虫项目。
- Spiders Contracts
- Learn how to use contracts for testing your spiders.
- 学习怎么用contracts(#todo契约?合同?)来测试你的爬虫。
- Common Practices
- Get familiar with some Scrapy common practices.
- 熟悉一些Scrapy实践例子。
- Broad Crawls
- Tune Scrapy for crawling a lot domains in parallel.
- 当爬很多域名的时候,如何并行优化。
- Using Firefox for scraping
- Learn how to scrape with Firefox and some useful add-ons.
- 学习如何借助Firefox和一些有用的插件。
- Using Firebug for scraping
- Learn how to scrape efficiently using Firebug.
- 学习如何有效的利用Firebug。
- Debugging memory leaks
- Learn how to find and get rid of memory leaks in your crawler.
- 学习在你的爬虫项目里找到和解决内存溢出。
- Downloading and processing files and images
- Download files and/or images associated with your scraped items.
- 下载相关的文件和图片。
- Deploying Spiders
- Deploying your Scrapy spiders and run them in a remote server.
- 在远程服务器上部署爬虫。
- AutoThrottle extension | 负载均衡拓展
- Adjust crawl rate dynamically based on load.
- 根据负载动态调整爬虫速度。
- Benchmarking
- Check how Scrapy performs on your hardware.
- 检查爬虫是如何在硬件(#todo?)上运行的。
- Jobs: pausing and resuming crawls
- Learn how to pause and resume crawls for large spiders.
- 学习如何暂停和恢复爬虫。
Extending Scrapy | 相关拓展¶
- Architecture overview | 体系结构概述
- Understand the Scrapy architecture.
- 理解Scrapy的体系结构。
- Downloader Middleware
- Customize how pages get requested and downloaded.
- 设计一个页面如何被请求和下载。
- Spider Middleware
- Customize the input and output of your spiders.
- 设计爬虫的入参和出参。
- Extensions
- Extend Scrapy with your custom functionality
- 自定义拓展Scrapy的功能。
- Core API 核心API
- Use it on extensions and middlewares to extend Scrapy functionality
- 用中间件拓展Scrapy的功能。
- Signals
- See all available signals and how to work with them.
- 查看所有可能的信号(#todo?)及其如何运行的。
- Item Exporters
- Quickly export your scraped items to a file (XML, CSV, etc).
- 快速导出爬取的数据到本地文件里,支持XML,CSV等等。
All the rest | 其他事项¶
- Release notes
- See what has changed in recent Scrapy versions.
- 查看最新版本的Scrapy的改变。
- Contributing to Scrapy
- Learn how to contribute to the Scrapy project.
- 学习如何给Scrapy本体项目做贡献。
- Versioning and API Stability
- Understand Scrapy versioning and API stability.
- 了解Scrapy的版本管理和API稳定性。