ETS oddity
When working with the implementation of the new scalable ordered_set we came across a strangeness with the guarantees when iterating over a table while inserting elements in parallel.
Scenario: #
> Tab = ets:new(test_table,
                [set, public, {write_concurrency, true}]).
#Ref<0.1705802953.985792516.98626>
> P1 = spawn(fun() ->
               ets:insert(Tab, {fir, 1}),
               ets:insert(Tab, {sec, 2})
             end).
> K1 = ets:first(Tab), K2 = ets:next(Tab, K1).
What are the theoretical possible values of K1 and K2? Let us first list the obvious:
- K1 = fir,- K2 = sec- both values inserted and found in term order
- K1 = sec,- K2 = fir- since this is a- set, the hash algorithm may put- secbefore- fir
- K1 = fir,- K2 = '$end_of_table'- only- firhad time to be inserted
- K1 = '$end_of_table',- K2 = badarg- no elements were inserted
However it is also possible to get:
- K1 = sec,- K2 = '$end_of_table'
This was, at first, very counter-intuitive to me. How can the ets:first/1 find the
second value inserted, but then when iterating not find the value inserted before it?
The answer can be found in the way that the write_concurrency functionality is
implemented. Imagine we have a hash table
where each bucket is protected by a mutex. When inserting a new element the mutex for
the current bucket has to be taken and when iterating over the hash table we take
each mutex in turn for the buckets we iterate through.
Initial Table: #
| Bucket # | Values | 
|---|---|
| 1 | [] | 
| 2 | [] | 
| 3 | [] | 
| 4 | [] | 
Finished Table: #
| Bucket # | Values | 
|---|---|
| 1 | [{fir,1}] | 
| 2 | [] | 
| 3 | [] | 
| 4 | [{sec,2}] | 
So, in the scenario that leads to the strange behaviour the following will happen:
- ets:first/1is called when the table is empty and iterates to Bucket #2.
| Bucket # | Values | 
|---|---|
| 1 | [] | 
| 2 (first) | [] | 
| 3 | [] | 
| 4 | [] | 
- The OS does a context switch and P1 is allowed to run.
- P1 inserts both {fir,1}and{sec,2}and then exits.
| Bucket # | Values | 
|---|---|
| 1 | [{fir,a}] | 
| 2 (first) | [] | 
| 3 | [] | 
| 4 | [{sec,b}] | 
- The ets:first/1call resumes and will only seesecand then'$end_of_table'.
When spelled out like this it becomes more logical that it is possible to get only
the element inserted as the second element. This is not normally a problem for
tables of type set which have an arbitrary iteration order that you can’t depend on anyway.
However, for ordered_set you may very well depend on the defined iteration order
and expect ets:first/1 to return a key that has at least been first in the table
at some point in time. But for the same reasons as with set, that is not guaranteed
if you need that guarantee you have to either not use write_concurrency,
find some other way to synchronize or rely on luck… these races are very rare, but in heavily
used tables they will eventually happen.
The same oddity applies to all kinds of table iterations; ets:next/1,
ets:select/1-3, ets:match/1-3 and friends. They may all miss concurrently
inserted keys and return a key that has never existed in the table ordered
directly after the previously returned key.