Falcon: it’s easy to use, very fast (for a Python framework, that is), and has 100% unit test coverage.
request per second:
CPython 3.4.3 19.5
PyPy 2.5.1 256.4
Github: Mass Downloader using python map
How to write a simple multi-threaded mass downloader using workerpool
Solution 3: Multi-threaded, using map
WorkerPool
implements amap
method which is similar to Python’s nativemap
method. This is a convenient shortcut for when writing a customJob
class is more work than it’s worth.# download3.py - Download many URLs using multiple threads, with the ``map`` method. import os import urllib import workerpool def download(url): url = url.strip() save_to = os.path.basename(url) urllib.urlretrieve(url, save_to) print "Downloaded %s" % url # Initialize a pool, 5 threads in this case pool = workerpool.WorkerPool(size=5) # The ``download`` method will be called with a line from the second # parameter for each job. pool.map(download, open("urls.txt").readlines()) # Send shutdown jobs to all threads, and wait until all the jobs have been completed pool.shutdown() pool.wait()
pycurl CurlMulti example
import pycurl from cStringIO import StringIO urls = [...] # list of urls # reqs: List of individual requests. # Each list element will be a 3-tuple of url (string), response string buffer # (cStringIO.StringIO), and request handle (pycurl.Curl object). reqs = [] # Build multi-request object. m = pycurl.CurlMulti() for url in urls: response = StringIO() handle = pycurl.Curl() handle.setopt(pycurl.URL, url) handle.setopt(pycurl.WRITEFUNCTION, response.write) req = (url, response, handle) # Note that the handle must be added to the multi object # by reference to the req tuple (threading?). m.add_handle(req[2]) reqs.append(req) # Perform multi-request. # This code copied from pycurl docs, modified to explicitly # set num_handles before the outer while loop. SELECT_TIMEOUT = 1.0 num_handles = len(reqs) while num_handles: ret = m.select(SELECT_TIMEOUT) if ret == -1: continue while 1: ret, num_handles = m.perform() if ret != pycurl.E_CALL_MULTI_PERFORM: break for req in reqs: # req[1].getvalue() contains response content
Load spikes and excessive memory usage in mod_python
Specifically, the number defined for the maximum clients should be dropped comensurate with the amount of memory available to run it and how big the application gets... The solution here is not to create only a minimal number of servers when Apache starts, but create closer to what would be the maximum number of processes you would expect to require to handle the load. That way the processes always exist and are ready to handle requests and you will not end up in a situation where Apache needs to suddenly create a huge number of processes... First off, don’t run PHP on the same web server. That way you can run worker MPM instead of prefork MPM. This immediately means you drop down drastically the number of processes you require because each process will then be multithreaded rather than single threaded and can handle many concurrent requests... The important thing to note here is that although the maximum number of clients is still 150, each process has 25 threads. Thus, the maximum number of processes that could be created is 6. For that 30MB process that means you only need 180MB in the worst case scenario rather than the 4GB required with the default MPM settings for prefork... Keep that in mind and one has to question how wise the advice in the Django documentation is that states “you should use Apache’s prefork MPM, as opposed to the worker MPM” when using mod_python... With Django 1.0 now believed to be multithread safe, which was in part why prefork was recommended previously, that advice should perhaps be revisited, or it made obvious that one would need to consider tuning your Apache MPM settings if you intend using prefork MPM.