Managed Thread Ids - Unique Id's that aren't Unique
We had a customer quiz us about why one of our thread names was showing up on some of their log messages. We looked into the problem and were a bit baffled. We name all of the threads we create inside the Agent to ensure we can separate what they do from any client application. The name in question is used by a thread that the Gibraltar Agent creates and then destroys relatively early in the process. This thread isn’t taken from the threadpool or put back into one, we confirmed it gets created and released so there just seemed no way that they could be processing on our thread.
We checked the data up and down and were confident that it wasn’t a data corruption problem - the only assumption made by the code was that Managed Thread Ids are unique. This seemed pretty reasonable: the documentation for the ManagedThreadId property reads:
Thread.ManagedThreadId: Gets a unique identifier for the current managed thread.
But, we kept digging and found another scenario on a long running ASP.NET application where a similar event occurred - a thread that was created and destroyed relatively early in the application was clearly now in the thread pool and handling events. Researching more, we found this gem in the documentation. Not on the MSDN documentation for ManagedThreadId but rather for Thread.GetHashCode:
The hash code is not guaranteed to be unique. Use the ManagedThreadId property if you need a unique identifier for a managed thread.
OK, still pointing us that ManagedThreadId is the right guy for our use. But then there’s this note on the Thread Class itself:
GetHashCode provides identification for managed threads. For the lifetime of your thread, it will not collide with the value from any other thread, regardless of the application domain from which you obtain the value.
This started to cast some concern: That little bit of weasel room in the second sentence is troubling: “For the lifetime of your thread”… Was .NET reusing thread Id’s after a thread exits? The wiggle room in the statement above made that sound possible, even though there’s no reason necessarily that the hash code and the thread Id are related. My first read of this was that the variation was about the second part of the sentence - uniqueness across application domains (which we never assumed).
So we created a few brutal tests - creating and destroying threads then ramping up the thread pool’s activity. Sure enough, the same Managed Thread Ids showed up in the thread pool. These weren’t the same threads - the thread static variables we were using for tests had been reset - but they had the same Managed Thread Id.
Go Team
The fix for us is to not rely on Managed Thread Id for correlating events to threads. Instead, we’re using an internal thread static variable to track the relationship and identify it with our own unique identifier. Because we track the thread responsible for log messages and many other things we record we had to represent this in the smallest amount of data feasible, and remain backwards/forwards compatible with existing data.
We’ve updated the display to automatically generate unique display names to separate out threads with the same Id’s and had to do a range of other adjustments to ensure we treat the Managed Thread Id as nothing more unique than a display name. That way you’ll be sure that if two events are ascribed to something called “Thread 14”, they really are the same thread. All of the changes for this are included in Gibraltar 2.1.1 which will ship within the next few days (this was the last issue we needed to resolve before shipping).
Incomplete is worse than Missing
The frustrating part is that if the documentation had never made any claim about the uniqueness of the thread Id we’d likely have gone through a set of proof and qualification testing. Like many people, when there isn’t documentation on something we have to create experiments to tease out the true behavior, review source code, and then decide what risks we want to take. This is one reason we are passionate about documentation, even at the expense of extra features. We want to make sure that you never have a doubt about what something on our API does. We also know that people don’t want to review documentation if they don’t have to - so we try hard to make the API understandable just from Intellisense.
Now, I don’t want to knock Microsoft too hard here - .NET is a massive framework even if you just look at the core .NET 2.0 API. But, as we all rely more and more on ever increasing layers of abstraction over what’s really going on it’s more important than ever to be precise in the documentation - about what something is and what it isn’t. Precise is more important than being comprehensive, because it will set the right expectation for people about what they can rely on and what they’ll have to verify for themselves.