multiprocessing, gevent, requests and friends

hi guys, I’ve been working on a project for large scaling high profile scraping, i got around 2-3k(should be in the future around the 100k) urls under the same host.

i took the amount of urls, split it by number of process, each part of urls went to new process with gevent pool. the results are good but i want better.

I’m using multiprocessing, requests.Session(), and gevent pool.

code structure:

the parser is lxml, which i found the fastest. requests.Session() support requests for same host multiprocessing + gevent.pool for multiprocessing async work

  • i believe the ssl handshake slow things up, maybe there is a good solution for fast handshake, or avoid multiple handshakes.
  • I’m up for any solution to get better performance.
  • i thought about maybe keep amount of sockets open and get a queue of urls for each socket.

Cheers