Creates a “results” folder in the current directory to store all of the load testings.
# Simple usage. pyronos get 25 simple # Send head request. pyronos head 25 simple # Dump logs. pyronos get 25 simple -d # Send requests sequentially. pyronos get 25 simple -s # Print progress of sequential requests. pyronos get 25 simple -s -p $ pyronos -h usage: pyronos [-h] [-f {simple,stem,step}] [-o {csv,json,yml}] [-s] [-p] [-d] [-v] url {get,head,options,delete,post,put} num_of_reqs Simple and sweet load testing module. positional arguments: url url of website {get,head,options,delete,post,put} http method num_of_reqs number of requests optional arguments: -h, --help show this help message and exit -f {simple,stem,step}, --figure {simple,stem,step} type of figure -o {csv,json,yml}, --output {csv,json,yml} type of output -s, --sequential sequential requests -p, --print-progress print progress -d, --dump-logs dump logs -v, --version show program's version number and exit
Python Envelope: Mailing for human beings.
Envelopes is a wrapper for Python’s email and smtplib modules. It aims to make working with outgoing e-mail in Python simple and fun.
from envelopes import Envelope, GMailSMTP envelope = Envelope( from_addr=(u'from@example.com', u'From Example'), to_addr=(u'to@example.com', u'To Example'), subject=u'Envelopes demo', text_body=u"I'm a helicopter!" ) envelope.add_attachment('/Users/bilbo/Pictures/helicopter.jpg') # Send the envelope using an ad-hoc connection... envelope.send('smtp.googlemail.com', login='from@example.com', password='password', tls=True) # Or send the envelope using a shared GMail connection... gmail = GMailSMTP('from@example.com', 'password') gmail.send(envelope)
Make Scrapy follow links and collect data
<span class="com"># -*- coding: utf-8 -*-</span> <span class="kwd">import</span><span class="pln"> scrapy </span><span class="com"># item class included here </span> <span class="kwd">class</span> <span class="typ">DmozItem</span><span class="pun">(</span><span class="pln">scrapy</span><span class="pun">.</span><span class="typ">Item</span><span class="pun">):</span> <span class="com"># define the fields for your item here like:</span><span class="pln"> link </span><span class="pun">=</span><span class="pln"> scrapy</span><span class="pun">.</span><span class="typ">Field</span><span class="pun">()</span><span class="pln"> attr </span><span class="pun">=</span><span class="pln"> scrapy</span><span class="pun">.</span><span class="typ">Field</span><span class="pun">()</span> <span class="kwd">class</span> <span class="typ">DmozSpider</span><span class="pun">(</span><span class="pln">scrapy</span><span class="pun">.</span><span class="typ">Spider</span><span class="pun">):</span><span class="pln"> name </span><span class="pun">=</span> <span class="str">"dmoz"</span><span class="pln"> allowed_domains </span><span class="pun">=</span> <span class="pun">[</span><span class="str">"craigslist.org"</span><span class="pun">]</span><span class="pln"> start_urls </span><span class="pun">=</span> <span class="pun">[</span> <span class="str">"http://chicago.craigslist.org/search/emd?"</span> <span class="pun">]</span><span class="pln"> BASE_URL </span><span class="pun">=</span> <span class="str">'http://chicago.craigslist.org/'</span> <span class="kwd">def</span><span class="pln"> parse</span><span class="pun">(</span><span class="pln">self</span><span class="pun">,</span><span class="pln"> response</span><span class="pun">):</span><span class="pln"> links </span><span class="pun">=</span><span class="pln"> response</span><span class="pun">.</span><span class="pln">xpath</span><span class="pun">(</span><span class="str">'//a[@class="hdrlnk"]/@href'</span><span class="pun">).</span><span class="pln">extract</span><span class="pun">()</span> <span class="kwd">for</span><span class="pln"> link </span><span class="kwd">in</span><span class="pln"> links</span><span class="pun">:</span><span class="pln"> absolute_url </span><span class="pun">=</span><span class="pln"> self</span><span class="pun">.</span><span class="pln">BASE_URL </span><span class="pun">+</span><span class="pln"> link </span><span class="kwd">yield</span><span class="pln"> scrapy</span><span class="pun">.</span><span class="typ">Request</span><span class="pun">(</span><span class="pln">absolute_url</span><span class="pun">,</span><span class="pln"> callback</span><span class="pun">=</span><span class="pln">self</span><span class="pun">.</span><span class="pln">parse_attr</span><span class="pun">)</span> <span class="kwd">def</span><span class="pln"> parse_attr</span><span class="pun">(</span><span class="pln">self</span><span class="pun">,</span><span class="pln"> response</span><span class="pun">):</span><span class="pln"> item </span><span class="pun">=</span> <span class="typ">DmozItem</span><span class="pun">()</span><span class="pln"> item</span><span class="pun">[</span><span class="str">"link"</span><span class="pun">]</span> <span class="pun">=</span><span class="pln"> response</span><span class="pun">.</span><span class="pln">url item</span><span class="pun">[</span><span class="str">"attr"</span><span class="pun">]</span> <span class="pun">=</span> <span class="str">""</span><span class="pun">.</span><span class="pln">join</span><span class="pun">(</span><span class="pln">response</span><span class="pun">.</span><span class="pln">xpath</span><span class="pun">(</span><span class="str">"//p[@class='attrgroup']//text()"</span><span class="pun">).</span><span class="pln">extract</span><span class="pun">())</span> <span class="kwd">return</span><span class="pln"> item</span>
Use Google’s Cache to crawl sites
Some websites implement certain measures to prevent bots from crawling them, with varying degrees of sophistication. Getting around those measures can be difficult and tricky, and may sometimes require special infrastructure. Please consider contacting commercial support if in doubt.
Here are some tips to keep in mind when dealing with these kinds of sites:
- rotate your user agent from a pool of well-known ones from browsers (google around to get a list of them)
- disable cookies (see
<span class="pre">COOKIES_ENABLED</span>
) as some sites may use cookies to spot bot behaviour- use download delays (2 or higher). See
<span class="pre">DOWNLOAD_DELAY</span>
setting.- if possible, use Google cache to fetch pages, instead of hitting the sites directly
- use a pool of rotating IPs. For example, the free Tor project or paid services like ProxyMesh. An open source alternative is scrapoxy, a super proxy that you can attach your own proxies to.
- use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean pages. One example of such downloaders is Crawlera