Github: Mass Downloader using python map

How to write a simple multi-threaded mass downloader using workerpool

Solution 3: Multi-threaded, using map

WorkerPool implements a map method which is similar to Python’s native map method. This is a convenient shortcut for when writing a custom Job class is more work than it’s worth.

# download3.py - Download many URLs using multiple threads, with the ``map`` method.
import os
import urllib
import workerpool

def download(url):
    url = url.strip()
    save_to = os.path.basename(url)
    urllib.urlretrieve(url, save_to)
    print "Downloaded %s" % url

# Initialize a pool, 5 threads in this case
pool = workerpool.WorkerPool(size=5)

# The ``download`` method will be called with a line from the second 
# parameter for each job.
pool.map(download, open("urls.txt").readlines())

# Send shutdown jobs to all threads, and wait until all the jobs have been completed
pool.shutdown()
pool.wait()

pycurl CurlMulti example

import pycurl
from cStringIO import StringIO

urls = [...] # list of urls
# reqs: List of individual requests.
# Each list element will be a 3-tuple of url (string), response string buffer
# (cStringIO.StringIO), and request handle (pycurl.Curl object).
reqs = [] 

# Build multi-request object.
m = pycurl.CurlMulti()
for url in urls: 
    response = StringIO()
    handle = pycurl.Curl()
    handle.setopt(pycurl.URL, url)
    handle.setopt(pycurl.WRITEFUNCTION, response.write)
    req = (url, response, handle)
    # Note that the handle must be added to the multi object
    # by reference to the req tuple (threading?).
    m.add_handle(req[2])
    reqs.append(req)

# Perform multi-request.
# This code copied from pycurl docs, modified to explicitly
# set num_handles before the outer while loop.
SELECT_TIMEOUT = 1.0
num_handles = len(reqs)
while num_handles:
    ret = m.select(SELECT_TIMEOUT)
    if ret == -1:
        continue
    while 1:
        ret, num_handles = m.perform()
        if ret != pycurl.E_CALL_MULTI_PERFORM: 
            break

for req in reqs:
    # req[1].getvalue() contains response content

Load spikes and excessive memory usage in mod_python

Specifically, the number defined for the maximum clients should be dropped comensurate with the amount of memory available to run it and how big the application gets.
.. The solution here is not to create only a minimal number of servers when Apache starts, but create closer to what would be the maximum number of processes you would expect to require to handle the load. That way the processes always exist and are ready to handle requests and you will not end up in a situation where Apache needs to suddenly create a huge number of processes.
.. First off, don’t run PHP on the same web server. That way you can run worker MPM instead of prefork MPM. This immediately means you drop down drastically the number of processes you require because each process will then be multithreaded rather than single threaded and can handle many concurrent requests.
.. The important thing to note here is that although the maximum number of clients is still 150, each process has 25 threads. Thus, the maximum number of processes that could be created is 6. For that 30MB process that means you only need 180MB in the worst case scenario rather than the 4GB required with the default MPM settings for prefork.
.. Keep that in mind and one has to question how wise the advice in the Django documentation is that states “you should use Apache’s prefork MPM, as opposed to the worker MPM” when using mod_python.
.. With Django 1.0 now believed to be multithread safe, which was in part why prefork was recommended previously, that advice should perhaps be revisited, or it made obvious that one would need to consider tuning your Apache MPM settings if you intend using prefork MPM.