Java Virtual Threads: A Deep Dive

2025-01-19
Intro
- Hello everyone, recently I came across a very interesting blog from Netflix, where they migrated from Java traditional threads to Java virtual threads and how it caused an outage at Netflix.
- Which got me interested in learning more about virtual threads.
- So, we will learn how virtual threads exactly work and also we will see how virtual threads cause an outage.
The issue faced by Netflix.
After migrating to Java 21, Netflix was excited about the new virtual thread feature, and they decided to migrate one of their services to a virtual thread from a traditional thread.
Problem
- Post migration they began to see issues on the migrated instances.
- These instances randomly started to see timeouts and hung instances. After investigating, they found that the affected instances had an increasing number of sockets in the closeWait state.
- CloseWait state means that sockets were closed on the remote but still open on the affected instance.

Analysis
- To debug this issue, we need to understand, what is happening with the impacted instances.
- Hence, they decided to fetch the thread dump and analyse the thread dump.
- On analysing the thread dump, they found that there were too many Virtual threads, without any stack trace.
- The number of virtual threads was equal to the number of sockets in the closeWait state. Which denotes there must be some issue regarding or involving Virtual threads.
- But, hold on a second, what exactly are Virtual threads???
What are Virtual threads?
Let's understand the problem first.

- Let's take an example of a restaurant with a head chef and several line cooks.
- The head chef's responsibility is to ensure that line cooks are working properly and helping the line cook whenever some complex task comes up.
- Each line cook can be assigned a different set of tasks like chopping, garnishing, baking, marinating, etc.
- Now, let's say that each line cook needs the help/assistance of the head chef for every task.
- Hence, at a time only one line cook can work on the task, as the head chef can assist only one line cook at a time.
- This can lead to a lot of delays, thereby increasing the wait time for customers, and increasing idle time for each line cook as they would have to wait for the availability of the head chef.
- Hence, this approach sounds a lot inefficient.
- Now, how would you solve this problem?
The solution to the restaurant problem

- To solve this, the restaurant manager decided to use a newer model, where line cooks would work independently and the head chef would step in only when it was necessary.
- This will reduce the wait time for customers, hence they would be happy. And this approach is also a lot more efficient.
Relate the thread problem with the restaurant problem
- So, in the Java concurrency world as well we are facing the inefficiency faced by the restaurant.
- If we try to map the Java world with the restaurant, it would be:
- Head chef <> OS platform level thread
- Line cook task <> A task assigned to the thread.
- So, in a world before Virtual threads, whenever a program initiates a thread, the task is scheduled on the OS-level thread.
- This OS-level thread will be continuously active until the assigned task is completed. Similar to our previous version of the restaurant where the head chef would look into every task.
- Now, since OS level thread/head chef are very limited for the system, they are a scarce resource and hence this creates an inefficiency in the system.
How does a Virtual thread solve this problem?
- With a virtual thread in the picture, we would assign the task to a virtual thread (VT).
- This VT would continue to work independently and would bring OS thread into the picture, only when it is required.
Deep dive
Responsibilities
- Let's first try to categorise how a particular task assigned to a thread would like, there can be 2 types of operations:
- CPU-intensive operations i.e. running some logic over the data, etc
- I/O blocking operations i.e. fetching data from somewhere maybe a database, service or from disk.
- OS threads are useful only in CPU-intensive operations and they would just wait if there is any I/O blocking operation.
- Hence, for this reason, the Virtual thread would ask for help in case of CPU-intensive operations and for I/O operations it would work independently and would not block the OS thread.
- OS threads are also called platform/carrier threads. Hence, we will also continue using this term going forward.
Let's understand with an example.
- Let's assume that we want to fetch some data from the database, run some business logic on it and publish an event regarding the outcome of the processing.
- Our program would initiate a virtual thread to perform this task. A carrier thread would be scheduled to make the call to the database, as soon as the call is made i.e. I/O blocking operation, our VT would be suspended to wait until the response is received and the carrier thread would be released to perform some other operations.
- Once the response is received, again a carrier thread will be scheduled to run the business logic on the response.
- Post running the logic, the carrier thread would again make a call to publish the event, since this is a blocking call, VT would be suspended and the carrier would be released.
- Once the response is received from the publisher, a carrier thread will be scheduled to complete the operation and also VT will be completed.

Use case and where to use Virtual thread.
- Based on what we have learnt, it is important to remember that virtual threads won't make your processing fast, but they will give you the capability to handle more scale and concurrency.
- It would make more sense to use a Virtual thread where there is high concurrency and threads would spend most of the time waiting for the I/O.
- A typical example of where this can be used would be a web server, where the server would get too many requests from clients, and our traditional threads would spend most of the time in a blocking state like fetching some resources.
Virtual thread vs traditional thread pools
- It is important to understand that traditional threads were a scarce resource, hence there was a requirement to manage them, which was done by the thread pools.
- But, since virtual threads are not scarce anymore, there can be millions of active virtual threads in a single process. Hence, it doesn't make sense to have a thread pool for virtual threads.
- Having a pool over virtual thread, would not give you any benefit of the virtual threads.
Limit concurrency with Virtual threads
- Still, there might be a case where we want to limit concurrency.
- An example would be that the service/database which is being called by our virtual thread, supports only a limited number of concurrency requests at any moment.
- For such use cases, one should use semaphores to limit the concurrent requests.
Things to keep in mind while using Virtual Thread
- A virtual thread cannot be unmounted from the carrier thread if it is running blocking code inside the synchronised block.
- Let's see how the above point caused an issue at Netflix
Now coming back to our analysis…..
Analysis
- As we saw earlier, they found that there were too many virtual threads, without any stack trace.
- This indicates that virtual threads were triggered, but was not started and also the number of virtual threads was almost equal to the number of sockets in the closeWait state.
- Further examination found that there were 4 virtual threads, which were not able to unmount from the carrier thread.
- 4 virtual threads were in this state, cause this instance was deployed on a machine with 4 vCPUs, hence it indicates that all carrier threads of this instance are occupied by these 4 virtual threads and these virtual threads are stuck for some reason.
- Hence new virtual threads are getting created for new requests but, JVM is not able to allocate any carrier threads and hence timeouts and hung instances.
- But why, these 4 virtual threads are stuck, they were running some blocking code inside the Java Synchronized block and all were trying to acquire a lock via synchronized block, but they were unable to acquire and hence stuck.
- So, the question is, who has the lock?
- On analysing the heap dump further, they found 2 more interesting threads:
- One was a virtual thread, which was also attempting to acquire the lock, but not via synchronised block. Since it was not able to acquire the lock and was not doing anything this virtual thread became unmounted and is waiting next to acquire the lock.
- Another was a normal OS thread, which had the lock earlier but now it has released the lock due to timeout. But, this thread is waiting to reacquire the lock.
- So, again who has the lock? The answer is no one.
- To understand this behaviour, we need to understand the internals of release, whenever any thread calls the release function for a lock, it releases the lock and calls the next thread in the waiting list to acquire the lock.
- Now, next on the waiting list is a virtual thread, which is not mounted on any carrier thread cause all carrier threads are occupied by other waiting Virtual threads. Hence, it won't be able to acquire the lock, even though the lock is available.
- This is similar to the deadlock problem, and hence the whole system got stuck and there was a very high number of sockets in the waiting state.

Let's try to understand this using an example.
- Let's say initially normal OS thread (T1) acquired the lock.
- The virtual thread (VT5) came and tried to acquire the lock on the same resource via non-synchronized code block and since this was an I/O blocking call, this thread was added into the queue of the lock and VT5 went into a waiting state and was unpinned from the OS thread.
- After this 4 Virtual threads came i.e. VT1, VT2, VT3, VT4 and were mounted on all 4 CPU's OS threads. All these threads are again trying to acquire the lock on the same resource but in a synchronised code block. Since, the virtual thread doesn't unpin threads running synchronised code blocks, all these threads will be pinned on the OS thread waiting for the lock.

- Hence, as shown in the diagram and the table, VT1, VT2, VT3 and VT4 are already pinned on the OS thread and added to the queue to acquire the lock.
- After this lock was released from T1 due to timeout and now the system is trying to acquire the lock for the next thread in the queue i.e. VT5.
- Now, the system will try to pin VT5 on any OS thread, so that VT5 can acquire the lock and continue.
- But, since all OS threads are busy due to synchronised virtual threads pinned on them, the system is not able to pin VT5 on the OS thread and the system will continuously try to pin VT5.
- Hence, the System will get stuck due to this in-between state kind of a deadlock situation.
- The point to note is, that this issue happened only because of having monitor type code i.e. code with Synchronised block, instead of this, if they had used reentrant lock, the system would have worked smoothly.
Conclusion
- In conclusion, Virtual threads are very promising and can give you very high performance in terms of throughput.
- They should be used in a system, where there is very high concurrency and most of the time traditional threads spend time in I/O blocking.
- Virtual threads are not scarce in resource and they can be literally millions for a single process and hence should never be pooled.
- Monitor free code should be used (Synchronized) when using virtual threads, else system might end up in an unexpected state and debugging it can become tricky
Outro
- If you have any doubts, do let me know in the comments, and I will try to resolve it.
- To learn more about virtual threads and their implementation and debugging tips, check the links in the description.
- To learn more about the problem faced by Netflix folks, check the Netflix tech block attached in the description.
- If you want me to cover any other topic, do mention it in the comments.
Reference
- https://netflixtechblog.com/java-21-virtual-threads-dude-wheres-my-lock-3052540e231d
- https://docs.oracle.com/en/java/javase/21/core/virtual-threads.html#GUID-68216B85-7B43-423E-91BA-11489B1ACA61
- https://openjdk.org/jeps/444