hi guys, I’ve been working on a project for large scaling high profile scraping, i got around 2-3k(should be in the future around the 100k) urls under the same host.
i took the amount of urls, split it by number of process, each part of urls went to new process with gevent pool. the results are good but i want better.
I’m using multiprocessing, requests.Session(), and gevent pool.
code structure: http://pastebin.com/Xu7Xy41i
the parser is lxml, which i found the fastest. requests.Session() support requests for same host multiprocessing + gevent.pool for multiprocessing async work
- i believe the ssl handshake slow things up, maybe there is a good solution for fast handshake, or avoid multiple handshakes.
- I’m up for any solution to get better performance.
- i thought about maybe keep amount of sockets open and get a queue of urls for each socket.
Google runs millions of lines of Python code. The front-end server that drives youtube.com and YouTube’s APIs is primarily written in Python, and it serves millions of requests per second! YouTube’s front-end runs on CPython 2.7,
.. but we always run up against the same issue: it’s very difficult to make concurrent workloads perform well on CPython.
.. Grumpy is an experimental Python runtime for Go. It translates Python code into Go programs, and those transpiled programs run seamlessly within the Go runtime.
.. The goal is for Grumpy to be a drop-in replacement runtime for any pure-Python project.
.. In particular, Grumpy has no global interpreter lock, and it leverages Go’s garbage collection for object lifetime management instead of counting references. We think Grumpy has the potential to scale more gracefully than CPython for many real world workloads.
.. Grumpy programs can import Go packages just like Python modules! For example, the Python snippet below uses Go’s standard net/http package to start a simple server
By using yield from on another coroutine we declare that the coroutine may give the control back to the event loop, in this case sleep will yield and the event loop will switch contexts to the next task scheduled for execution: bar. Similarly the bar function yields from sleep which allows the event loop to pass control back to foo at the point where it yielded, as it happens with all generators.