Discover how Python's Global Interpreter Lock (GIL) impacts real-world backend performance. This post breaks down the difference between CPU-bound and I/O-bound tasks, benchmarks threading vs multiprocessing, and shows you when Python threads help.
If you’ve ever tried to speed up your Python application by adding threads, only to see... nothing change. Well, welcome to the world of the Global Interpreter Lock (GIL). In this post, we’re going to break down how the GIL affects concurrency performance, especially in real-life backend workloads.
The Global Interpreter Lock is a mutex that prevents multiple native threads from executing Python bytecodes at once in the CPython interpreter. It’s like a nightclub bouncer: only one thread can enter the Python bytecode dancefloor at a time even if there’s space for more. This means that:
Let’s define these first:
Why does this matter? Because the GIL hurts you most when your code is CPU-bound. That’s when threads fight over the lock. But for I/O-bound code, threads take turns nicely, because the GIL is released during blocking I/O.
Let’s simulate a CPU-heavy workload using threads:
import threading
import time
COUNT = 50_000_000
def cpu_heavy():
x = 0
for _ in range(COUNT):
x += 1
start = time.time()
thread1 = threading.Thread(target=cpu_heavy)
thread2 = threading.Thread(target=cpu_heavy)
thread1.start()
thread2.start()
thread1.join()
thread2.join()
print(f"Threads (CPU-bound): {time.time() - start:.2f}s")
What you’ll see: The two threads don’t make it faster. It may even take longer than running one after another.
Why? They’re both fighting for the GIL. Only one can do actual Python work at a time.
Now let’s try the same workload using multiprocessing
:
from multiprocessing import Process
import time
COUNT = 50_000_000
def cpu_heavy():
x = 0
for _ in range(COUNT):
x += 1
start = time.time()
p1 = Process(target=cpu_heavy)
p2 = Process(target=cpu_heavy)
p1.start()
p2.start()
p1.join()
p2.join()
print(f"Processes (CPU-bound): {time.time() - start:.2f}s")
Now it’s much faster! Each process gets its own GIL and memory space — so they can run in parallel across CPU cores.
Let’s simulate a real backend workload — calling a slow API:
import threading
import time
import requests
URL = "<https://httpbin.org/delay/2>" # waits 2 seconds before replying
def fetch():
response = requests.get(URL)
print(response.status_code)
start = time.time()
threads = [threading.Thread(target=fetch) for _ in range(5)]
for t in threads:
t.start()
for t in threads:
t.join()
print(f"Threads (I/O-bound): {time.time() - start:.2f}s")
Even though each request takes 2 seconds, the whole program finishes in ~2 seconds, not 10.
Why? Because requests.get()
blocks on I/O and releases the GIL, allowing other threads to run.
In backend APIs, you often:
These are all I/O-bound, so threading can actually help even with the GIL. For example, a simple FastAPI endpoint like:
@app.get("/status")
def get_status():
requests.get("<https://a-service.com/ping>")
return {"status": "ok"}
Can scale better under concurrent users if served by a thread-based ASGI server like Uvicorn with workers.
Task Type | Best Tool | Why |
---|---|---|
CPU-bound | multiprocessing | Avoids the GIL by using processes |
I/O-bound | threading / asyncio | GIL released during I/O |
Mixed | Split workloads | Use threads for I/O, processes for CPU |
multiprocessing
, or call out to C extensions that release the GIL.asyncio
) are great.