{"id":204653,"date":"2025-05-29T14:58:24","date_gmt":"2025-05-29T06:58:24","guid":{"rendered":"https:\/\/server.hk\/cnblog\/204653\/"},"modified":"2025-05-29T14:58:24","modified_gmt":"2025-05-29T06:58:24","slug":"%e5%a6%82%e4%bd%95%e4%bf%ae%e6%94%b9crawlspider%e8%a7%a3%e6%9e%90%e5%90%8e%e7%9a%84%e9%93%be%e6%8e%a5%ef%bc%9f","status":"publish","type":"post","link":"https:\/\/server.hk\/cnblog\/204653\/","title":{"rendered":"\u5982\u4f55\u4fee\u6539CrawlSpider\u89e3\u6790\u540e\u7684\u94fe\u63a5\uff1f"},"content":{"rendered":"<p><b><\/b>     <\/p>\n<h1>\u5982\u4f55\u4fee\u6539CrawlSpider\u89e3\u6790\u540e\u7684\u94fe\u63a5\uff1f<\/h1>\n<p>\u4e0d\u77e5\u9053\u5927\u5bb6\u662f\u5426\u719f\u6089\uff1f\u4eca\u5929\u6211\u5c06\u7ed9\u5927\u5bb6\u4ecb\u7ecd<span style=\"color: #FF6600;, Helvetica, Arial, sans-serif;font-size: 14px;background-color: #FFFFFF\">\u300a\u5982\u4f55\u4fee\u6539CrawlSpider\u89e3\u6790\u540e\u7684\u94fe\u63a5\uff1f\u300b<\/span>\uff0c\u8fd9\u7bc7\u6587\u7ae0\u4e3b\u8981\u4f1a\u8bb2\u5230<span style=\"color: #FF6600;, Helvetica, Arial, sans-serif;font-size: 14px;background-color: #FFFFFF\"><\/span>\u7b49\u7b49\u77e5\u8bc6\u70b9\uff0c\u5982\u679c\u4f60\u5728\u770b\u5b8c\u672c\u7bc7\u6587\u7ae0\u540e\uff0c\u6709\u66f4\u597d\u7684\u5efa\u8bae\u6216\u8005\u53d1\u73b0\u54ea\u91cc\u6709\u95ee\u9898\uff0c\u5e0c\u671b\u5927\u5bb6\u90fd\u80fd\u79ef\u6781\u8bc4\u8bba\u6307\u51fa\uff0c\u8c22\u8c22\uff01\u5e0c\u671b\u6211\u4eec\u80fd\u4e00\u8d77\u52a0\u6cb9\u8fdb\u6b65\uff01<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.17golang.com\/uploads\/20241112\/1731405285673325e58a2a8.jpg\" class=\"aligncenter\"><\/p>\n<p><strong>crawlspider \u4fee\u6539 rule \u89e3\u6790\u540e\u94fe\u63a5<\/strong><\/p>\n<p>\u5728\u914d\u7f6e crawlspider \u65f6\uff0crule \u7528\u4e8e\u6307\u5b9a\u722c\u53d6\u7684\u9875\u9762\u548c\u89e3\u6790\u89c4\u5219\u3002\u6709\u65f6\uff0c\u6211\u4eec\u9700\u8981\u5bf9 rule \u89e3\u6790\u540e\u7684\u94fe\u63a5\u8fdb\u884c\u989d\u5916\u7684\u5904\u7406\uff0c\u4f8b\u5982\u4fee\u6539\u94fe\u63a5\u683c\u5f0f\u3002<\/p>\n<p><strong>\u4fee\u6539\u94fe\u63a5\u7684\u89e3\u51b3\u65b9\u6848<\/strong><\/p>\n<p>\u8981\u5728 crawlspider \u4e2d\u4fee\u6539 rule \u89e3\u6790\u540e\u7684\u94fe\u63a5\uff0c\u53ef\u4ee5\u91c7\u7528\u4ee5\u4e0b\u65b9\u6cd5\uff1a<\/p>\n<p>\u5728 downloadermiddleware \u4e2d\u5b9a\u4e49 process_requests \u65b9\u6cd5\u3002\u6b64\u65b9\u6cd5\u4f1a\u5728\u8bf7\u6c42\u53d1\u9001\u81f3\u722c\u53d6\u7f51\u7ad9\u4e4b\u524d\u8c03\u7528\uff0c\u53ef\u4ee5\u5bf9\u8bf7\u6c42\u8fdb\u884c\u4fee\u6539\u3002<\/p>\n<p>\u5728 process_requests \u65b9\u6cd5\u4e2d\uff0c\u904d\u5386\u5305\u62ec rule \u89e3\u6790\u540e\u7684\u6240\u6709\u94fe\u63a5\u3002<\/p>\n<p>\u5bf9\u4e8e\u9700\u8981\u4fee\u6539\u7684\u8be6\u60c5\u9875\u94fe\u63a5\uff0c\u4f7f\u7528\u6b63\u5219\u8868\u8fbe\u5f0f\u6216 urlparse \u5e93\u7b49\u65b9\u6cd5\u63d0\u53d6\u5339\u914d\u7684url\u3002<\/p>\n<p>\u62fc\u63a5\u4fee\u6539\u540e\u7684\u94fe\u63a5\uff0c\u5e76\u4f7f\u7528 return response \u66ff\u6362\u539f\u59cb\u8bf7\u6c42\u3002\u8fd9\u6837\uff0c\u4fee\u6539\u540e\u7684\u94fe\u63a5\u5c06\u88ab\u7528\u4e8e\u6293\u53d6\u3002<\/p>\n<p><strong>\u4ee3\u7801\u793a\u4f8b<\/strong><\/p>\n<p>\u4ee5\u95ee\u9898\u4e2d\u63d0\u4f9b\u7684 rules \u4e3a\u4f8b\uff0c\u53ef\u4ee5\u5728 downloadermiddleware \u4e2d\u5b9e\u73b0\u4ee5\u4e0b\u4ee3\u7801\uff1a<\/p>\n<pre>from scrapy import signals\nfrom scrapy.http import request\nfrom urlparse import urljoin\n\nclass customdownloadermiddleware(object):\n    def process_requests(self, requests, spider):\n        for request in requests:\n            if 'eastmoney' in request.meta['rule']:\n                url = request.url\n                # \u5339\u914d\u8be6\u60c5\u9875url\u683c\u5f0f\u5e76\u4fee\u6539\n                modified_url = urljoin(spider.allowed_domains[0], '\/a\/' + url.split('\/')[-1] + '.html')\n                request = request(modified_url, request.meta)\n        return request<\/pre>\n<p>\u5728 settings.py \u4e2d\u542f\u7528\u81ea\u5b9a\u4e49\u4e2d\u95f4\u4ef6\uff1a<\/p>\n<pre>DOWNLOADER_MIDDLEWARES = {\n    'project.middlewares.CustomDownloaderMiddleware': 543,\n}<\/pre>\n<p>\u672c\u7bc7\u5173\u4e8e\u300a\u5982\u4f55\u4fee\u6539CrawlSpider\u89e3\u6790\u540e\u7684\u94fe\u63a5\uff1f\u300b\u7684\u4ecb\u7ecd\u5c31\u5230\u6b64\u7ed3\u675f\u5566\uff0c\u4f46\u662f\u5b66\u65e0\u6b62\u5883\uff0c\u60f3\u8981\u4e86\u89e3\u5b66\u4e60\u66f4\u591a\u5173\u4e8e\u6587\u7ae0\u7684\u76f8\u5173\u77e5\u8bc6\uff0c\u8bf7\u5173\u6ce8\u516c\u4f17\u53f7\uff01<\/p>\n","protected":false},"excerpt":{"rendered":"<p>\u5982\u4f55\u4fee\u6539CrawlSpider\u89e3&#46;&#46;&#46;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4925],"tags":[],"class_list":["post-204653","post","type-post","status-publish","format-standard","hentry","category-4925"],"_links":{"self":[{"href":"https:\/\/server.hk\/cnblog\/wp-json\/wp\/v2\/posts\/204653","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/server.hk\/cnblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/server.hk\/cnblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/server.hk\/cnblog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/server.hk\/cnblog\/wp-json\/wp\/v2\/comments?post=204653"}],"version-history":[{"count":0,"href":"https:\/\/server.hk\/cnblog\/wp-json\/wp\/v2\/posts\/204653\/revisions"}],"wp:attachment":[{"href":"https:\/\/server.hk\/cnblog\/wp-json\/wp\/v2\/media?parent=204653"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/server.hk\/cnblog\/wp-json\/wp\/v2\/categories?post=204653"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/server.hk\/cnblog\/wp-json\/wp\/v2\/tags?post=204653"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}