<span class="com"># -*- coding: utf-8 -*-</span> <span class="kwd">import</span><span class="pln"> scrapy </span><span class="com"># item class included here </span> <span class="kwd">class</span> <span class="typ">DmozItem</span><span class="pun">(</span><span class="pln">scrapy</span><span class="pun">.</span><span class="typ">Item</span><span class="pun">):</span> <span class="com"># define the fields for your item here like:</span><span class="pln"> link </span><span class="pun">=</span><span class="pln"> scrapy</span><span class="pun">.</span><span class="typ">Field</span><span class="pun">()</span><span class="pln"> attr </span><span class="pun">=</span><span class="pln"> scrapy</span><span class="pun">.</span><span class="typ">Field</span><span class="pun">()</span> <span class="kwd">class</span> <span class="typ">DmozSpider</span><span class="pun">(</span><span class="pln">scrapy</span><span class="pun">.</span><span class="typ">Spider</span><span class="pun">):</span><span class="pln"> name </span><span class="pun">=</span> <span class="str">"dmoz"</span><span class="pln"> allowed_domains </span><span class="pun">=</span> <span class="pun">[</span><span class="str">"craigslist.org"</span><span class="pun">]</span><span class="pln"> start_urls </span><span class="pun">=</span> <span class="pun">[</span> <span class="str">"http://chicago.craigslist.org/search/emd?"</span> <span class="pun">]</span><span class="pln"> BASE_URL </span><span class="pun">=</span> <span class="str">'http://chicago.craigslist.org/'</span> <span class="kwd">def</span><span class="pln"> parse</span><span class="pun">(</span><span class="pln">self</span><span class="pun">,</span><span class="pln"> response</span><span class="pun">):</span><span class="pln"> links </span><span class="pun">=</span><span class="pln"> response</span><span class="pun">.</span><span class="pln">xpath</span><span class="pun">(</span><span class="str">'//a[@class="hdrlnk"]/@href'</span><span class="pun">).</span><span class="pln">extract</span><span class="pun">()</span> <span class="kwd">for</span><span class="pln"> link </span><span class="kwd">in</span><span class="pln"> links</span><span class="pun">:</span><span class="pln"> absolute_url </span><span class="pun">=</span><span class="pln"> self</span><span class="pun">.</span><span class="pln">BASE_URL </span><span class="pun">+</span><span class="pln"> link </span><span class="kwd">yield</span><span class="pln"> scrapy</span><span class="pun">.</span><span class="typ">Request</span><span class="pun">(</span><span class="pln">absolute_url</span><span class="pun">,</span><span class="pln"> callback</span><span class="pun">=</span><span class="pln">self</span><span class="pun">.</span><span class="pln">parse_attr</span><span class="pun">)</span> <span class="kwd">def</span><span class="pln"> parse_attr</span><span class="pun">(</span><span class="pln">self</span><span class="pun">,</span><span class="pln"> response</span><span class="pun">):</span><span class="pln"> item </span><span class="pun">=</span> <span class="typ">DmozItem</span><span class="pun">()</span><span class="pln"> item</span><span class="pun">[</span><span class="str">"link"</span><span class="pun">]</span> <span class="pun">=</span><span class="pln"> response</span><span class="pun">.</span><span class="pln">url item</span><span class="pun">[</span><span class="str">"attr"</span><span class="pun">]</span> <span class="pun">=</span> <span class="str">""</span><span class="pun">.</span><span class="pln">join</span><span class="pun">(</span><span class="pln">response</span><span class="pun">.</span><span class="pln">xpath</span><span class="pun">(</span><span class="str">"//p[@class='attrgroup']//text()"</span><span class="pun">).</span><span class="pln">extract</span><span class="pun">())</span> <span class="kwd">return</span><span class="pln"> item</span>
Use Google’s Cache to crawl sites
Some websites implement certain measures to prevent bots from crawling them, with varying degrees of sophistication. Getting around those measures can be difficult and tricky, and may sometimes require special infrastructure. Please consider contacting commercial support if in doubt.
Here are some tips to keep in mind when dealing with these kinds of sites:
- rotate your user agent from a pool of well-known ones from browsers (google around to get a list of them)
- disable cookies (see
<span class="pre">COOKIES_ENABLED</span>
) as some sites may use cookies to spot bot behaviour- use download delays (2 or higher). See
<span class="pre">DOWNLOAD_DELAY</span>
setting.- if possible, use Google cache to fetch pages, instead of hitting the sites directly
- use a pool of rotating IPs. For example, the free Tor project or paid services like ProxyMesh. An open source alternative is scrapoxy, a super proxy that you can attach your own proxies to.
- use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean pages. One example of such downloaders is Crawlera
Quora: What are the best web crawling services?
Many services seem to present a layer of API I’m not interested in (keyword search/trends/related tweets/whatever). I’m mainly looking for services that provide the greatest volume of aggregated web content as a feed, upon which I can do whatever analytics I need.
APIFIER: Hosted web crawler for developers
Crawl and extract data from websites that employ AJAX, complex pagination or infinite scroll using the same tools you already use for your front-end development.