Advanced Python — Asynchronous programming

Alibaba Tech
7 min readMar 18, 2021

By Sagi Medina, Alibaba DAMO Academy Machine Intelligence Israel Lab

With this article, you will cover the critical concepts behind asynchronous programming in Python.

By the end of it, you will know:

  • How parallel programming can be done in Python
  • Scale your Python code with parallelism and concurrency.
  • How concurrency implemented in CPython — the “official” implementation of Python.

The concept of concurrent computing is frequently confusing, so let’s do a quick overview on three different types:

Concurrent programming is a form of computing in which several computations are executed concurrently — during overlapping time periods — instead of sequentially; two or more processes start, run in an interleaved fashion through context switching, and complete in an overlapping time period by managing access to shared resources, e.g., on a single core of CPU.

Parallel programming is a form of programming that offers the ability to perform multiple operations simultaneously. It entails spreading tasks over numerous resources, e.g., multiple CPU cores, to solve a problem.

Asynchronous programming is a means of parallel programming. A work unit runs separately from the main application thread and notifies the calling thread of its completion, failure, or progress.

Python offers many approaches to Parallelism and Concurrency.

Your choice depends on several factors, and you may find that for a particular problem, there are two or more concurrency implementations to choose from:

Concurrency can make a big difference in performance for two types of applications — CPU-bound and I/O-bound.

This article will cover Asynchronous programming, which can be achieved with threading and asyncio and improving I/O-bound programs in Python.

1. Base Version

This article’s base version will be a non-concurrent task of downloading multiple files from Alibaba Cloud Object Storage Service — an encrypted, secure, cost-effective, and easy-to-use object storage service, using the “oss2” python package.

Common Code

Base Version Code

Base Version Results

Downloaded 50 Images in 467.5406451225281 seconds

2. Threading Version

CPython provides a high-level and a low-level API for creating, spawning, and controlling threads.

You might think that threading means that more than one processor is running parts of your script, and each processor is doing an independent task simultaneously. That is quite true, except that the threads may be running on different processors, but they will only be running one at a time.

n CPython, the threads are based on the C APIs, but the threads are Python threads. Every Python thread needs to execute Python bytecode through the evaluation loop.

Because multiple threads can read and write to the same memory space, collisions could occur. The solution to this is thread safety and involves making sure that a single thread-locks memory space before it is accessed.

The Python evaluation loop is not thread-safe. There are many parts of the interpreter state, such as the Garbage Collector, which are shared.

The CPython developers implemented a mega-lock called the Global Interpreter Lock (GIL) to get around this. Before the opcode is executed, the GIL is acquired by the thread. Then, once the opcode has been executed, it is released, which means that only one thread can be executing a Python code at any given time. It also means that multithreading in Python is very safe and ideal for running IO-bound tasks concurrently.

By default, Python 2 will switch threads every 100 instructions (can be adjusted with `sys.setcheckinterval`, which documented here: https://docs.python.org/2/library/sys.html#sys.setcheckinterval)

Since Python 3.2, CPython has another approach to improve the GIL — by default, it will release the GIL after five milliseconds (5000 microseconds) so that other threads can have a chance to acquire it: https://docs.python.org/3.9/library/sys.html#sys.setswitchinterval.

Threading Version Code

Threading Version Results

Downloaded 50 Images in 79.86200094223022 seconds

ThreadPoolExecutor is going to create a pool of threads, each of which can run concurrently.

3. Asyncio Version

There are great articles about Asyncio; here, I will write about what happened in practice.

You might have an arbitrarily huge number of sockets open, which incur significant expense at higher levels — Threads take up memory and time to be spawned. There is also overhead in context-switching between them.

You also have the option to use non-blocking IO, which means you could have just one thread, 50 sockets, then use the select() system call to find one that has data ready to give you.

Asyncio allows non-blocking IO primitives; it abstracts the need to loop through non-blocking sockets with select() and instead builds it into an event-based system that the program can construct itself around. The select() calls are replaced with events that being fired on-demand.

The event loop is what makes it all possible; It runs tasks one after the other (At any given time, only one of the tasks is running). When the active task makes an IO blocking call and cannot make further progress, it gives the control back to the event loop, realizing that some other task could better utilize the event loop’s time. It also tells the event loop what exactly it is blocked upon so that when the network response comes, the event loop can consider giving it time to run again.

That being said, a minor mistake in code can cause a task to run off and hold the processor for a long time, starving other tasks that need running.

In short, asyncio, using the generator’s capabilities, which enable pausing and resuming functions, takes long waiting periods in which functions would otherwise be blocking and allows other functions to run during that downtime. It does that with only one thread and one CPU core.

Asyncio Version Code

Asyncio Version Results

Downloaded 50 Images in 40.22274684906006 seconds

You probably noticed the “async_oss2” part.

Asyncio does not magically make things non-blocking; you need special async versions of the library to gain the full advantage of asyncio.

Here are the changes I did to the oss2 library to get an async get_object:

Final Results

As you can see, we managed to reduce our run time by 10X

It is important to remember that the GIL is still there. If you are using asyncio on CPython, you use only one thread anyway. If you are using multithreading, you utilize one core at a time. The differences between multithreading and asyncio are in the approach to IO operations and the efficiency in context switching between different IO-bound tasks — the GIL does not impact those in any case. Multiprocessing, or using an alternative Python interpreter without a GIL, remains the only way for now to overcome the GIL’s CPU-bound limitations.

Here you can find a useful utils package that will help you utilities OSS SDK and gain a performance boost with minimal effort:

Looking Forward

Work is being done to improve CPython and enable true multi-core parallelism via sub-interpreters.

In DAMO, we are really excited working on AI, we drive to supply great solutions with ultimate performance and accuracy, and we use Python and GoLang to do it. Be in touch if you share our passion for AI and great code.

Alibaba Tech

First hand and in-depth information about Alibaba’s latest technology → Facebook: “Alibaba Tech”. Twitter: “AlibabaTech”.

--

--

Alibaba Tech

First-hand & in-depth information about Alibaba's tech innovation in Artificial Intelligence, Big Data & Computer Engineering. Follow us on Facebook!