List.Sort internals
- Part 1 - Journey through the .NET internals - Sorting
- Part 2 - This Article
- Part 3 - Array.Sort && TrySZSort
- Part 4 - Managed vs Unmanaged code and interop
- Part 5 - Calling Conventions
Contents
If we want to discover sorting in .NET, then there is no better place to start than List<T>.Sort()
. We all know what this function. As a Junior, I have never bothered to actually check the source code behind it. Back then source was only available to insiders
, a handful of people and partners, considered worthy to look behind the curtain. I still remember times of Steve Ballmer and his firm stand against Open Source. It was actually lack of understanding from Microsoft CEO and a bit of a Goliath telling David that he is not ready yet. These days are gone, Microsoft did a massive U-turn, changed busines model and realised that open source can be a great ‘marketing tool’ to bring people onto services they sell. Nowadays both Reference Source
1 and CLR
2 code are available. For other communities like Java it is nothing special but for .NET that was a massive change. Time to look behind the scenes and analyse the code - if you are curious, follow this link.
A slice of List source code
This is just a slice of the List<T>
implementation. Couple of things to note here.
_items
- Internally List<T>
uses array
to store values. List is a wrapper that adds new functionalities to array like: dynamic resizing, sorting, add, remove and more.
_size
- integer value with current List size. List.Count
returns this one. If new item is added _size
increments by one. This makes List.Count
a bit faster as it doesn’t have to count the elements in array which would take O(n) operations. An interesting question here is - why keeping _size if array already has a property Length? This is due to the array having different size than the list most of the time. Array size is always multiplication of 2 while list can grow by +1. Matt Warren made a great blog post about it 3. I think length is encoded in the memory representation of array [^array-lengt-encoding]. Max number of items is 2.146.435.071 4.
_syncRoot
- used for thread synchronization. There is a great stack overflow answer explaining SyncRoot
pattern 5.
_emptyArray
- static array used for empty array creation - this makes it possible to reference the same array if there are two arrays of type ‘T’. There is no need to allocated new empty arrays all the time new empty one is created.
_defaultCapacity
- initialy empty array has capacity ‘0’ but when adding first items this is the initial capacity. After 4 arrays grow by the multiplication of 2 - 4, 8, 16, 32, 64. This is done automatically 6.
_version
- used to check if List state has changed - we will discuss this one later
List<T>.Sort()
Sorting list is simple. All you need to do is to call Sort
function.
If you want you can sort only a slice of the list.
Analysing source code we can see that Sort
calls Array.Sort<T>
7. It is not a surprise as we already know that internally List is an array.
I removed some code (throw instructions) to make it more readable. It is interesting that there is a lack of consistency for if
statements. First, two are using {}
and the third one is not.
Contract.EndContractBlock()
is part of Code Contracts
8. It is an assertion like a framework used by static analyzers to verify the correctness of code. For C++ developers it is not new as assertions are used extensively in C++. Contracts are more powerful and can be used to generate runtime checks, static checks and also tests by using test Generation tools.
Simple contract check can look like this:
List.Sort
uses a different mechanism to generate contracts using Contract.EndContractBlock()
. It is a special construct that wraps preceding if
statements and code into the code contract semantics. It is useful with legacy code and pre-exisitng defensive code as most of it can be reused without changes.
Array.Sort is called with these parameters:
_items
->T[]
internal representation of the listindex
->0
- start from the beginning of the listCount
->list
count -> sort whole listcomparer
->null
-> if it is null use default comparer
Apart from calling Array.Sort
there is also operation _version++
.
_version++
_version
is an integer value incremented every time state of the list is changes. State here means the items of the list. Operations ending with _version++
:
Add
,Remove
,Clear
,Insert
,[]
using an index to setSort
,Reverse
- changing the order of items is considered a state change
Why would this be useful? You are probably familiar with code like this:
If you ever wondered how does .NET knows when to throw that exception here is your answer. It uses _version
number to do that. But first, we need to go to basics and not use the List
.
If there is a state change in the array inside foreach there is no exception. So it is not foreach responsible for that.
Is it the List? It is easy to check that. If you use List inside for loop and change its state there is also no exception.
You need both foreach
and List
for this exception to happen. So there is something special happening when you combine both foreach
and List
. As foreach
is just a syntactic sugar 9, a code transformed by the compiler to a different form. We can use a tool like sharplab.io to check how it is resolved. Code to this example.
In the end foreach using array is a for loop. Ignore OpCode not supported: LdMemberToken
10.
Now when it comes to using both List
and foreach
there is a big change - code example.
foreach
using List
is not resolved to loop. It uses while
loop with the Enumerator
. I will skip intro
to enumerators and enumeration. You can check this great article 9 if you want to learn more.
In a nutshell, Enumerator
is a wrapper on top of enumeration
- a process of stepping through the items. In a simple for loop, this process is not complicated, take one element and do something with it. You can expand it by adding more logic: validation, checks etc, but there is one limitation. Each next value doesn’t know anything about the previous one. There is no shared context or state.Enumerator
adds this shared context by being a stateful intermediary
betwen the collection and the code. This enables us to add new logic whenever there is a movement to a next item or current item is obtained. The beauty of it is that foreach
understands Enumerator
.
One example which shows the beauty of Enumerator
is list state protection during enumeration. It uses _version
value to do that.
If we simplify the previous example we get a simple code that gets enumerator, uses while
loop to call MoveNext
function and then uses Current
to obtain the element.
If we look inside list.GetEnumerator()
. We can see that all it does is creates a new Enumerator
using itself (the list) as a input parameter.
We need to go deeper to the constructor of Enumerator
to see how it is used.
First, it takes the reference to the list itself. Then sets up an index to 0
- which means that it starts with the first element and current value is set to the default one. But the most important piece here is saving the current _version
value. This value is private
and does not change in the whole lifetime of the Enumerator
. How is it used?
We discussed already how Enumerator
wraps operation to move to next element and get the current element. If we look at MoveNext
source code we can clearly see how _version
is used.
And here you go in the MoveNextRare()
function there is a check if the saved initial version
value is the same as the current list version
. It basically checks if a list was modified during the Enumeration
. If it was then an exception is being thrown.
Tha part with:
Is to check if iteration reached the end of the list. In that case MoveNext
returns false and while
loop ends.
Why is state protection important during enumeration?
The idea of enumeration is to read a stable collection of elements. When enumerating we want to visit every element on the collection only once and do some action on it. Allowing the collection to change during enumeration introduces problems like - possibility of visiting elements two times.
We can go back to the previous example with the Insert
inside a for loop
This code adds 10
to the beginning of the list only once. If you look a the output due to this number 1
is displayed twice and there is no 10
. When 10
is added, i
is 0 and will be 1
on next iteration. Due to 10
added at the front, the element on the 0
index moves to 1
and is printed twice when i is 0
and 1
.
Enumeration
using foreach and List
with _version
value is here to protect us from that kind of mistake. Thanks to it we can you assume that when using foreach List won’t change. This frees you up from implementing more complicated application logic. It is a similar concept to using different types of collections, limiting possibilities and sending a message to a fellow developer
Hey I am returning IEnumerable because I am expecting you to only enumerate this collection
. In this example it says - Hey I am using foreach on purpose and making sure that nothing is gonna change that list during enumeration
.
Ending notes
The Journey has started. There is, of course, more things about List
that we can write about, but I leave that to you dear reader. In next part, we will move down the rabbit hole and discuss Array.Sort()
also touching TrySZSort
magic.
Btw If _version
value is int and it increases every time we change the state of the list. Then does it mean that you can do a limited amount of operations on the list? I am curious about the answers :)
-
This is limitation of
decompilers
not being able to findFieldHandle
this is a special object containingmetadata
.RuntimeHelpers.InitializerArray
is a implementation detail ofnew int[] { 1..6 }
. ↩