On the evils of yield

I absolutely love yield. Don’t we all? It simplifies writing enumerations to the point of absurdity. Simplicity, however, is a double-edged sword – I spent the better part of this day debugging a most evil bug, that resulted from over-yielding.

At Delver (now Sears) we have a file-based repository containing millions of items. We try to make things as efficient as possible, and sometime we overdo it. Our sin for today is using IEnumerable a bit too much. This repository was designed to be:

  1. Scalable (within our constraints) – should able to hold several million tasks
  2. Fast
  3. Relatively convenient to use – the user should be able to iterate on it using foreach, for one.

To accomplish 1 and 2, we avoided allocating large in-memory structures because they wouldn’t be able to hold the amount of items we’re talking about. To provide a convenient interface, we used IEnumerable.

Here is a mock-up of the code (for simplicity, it doesn’t use the disk but an in-memory serialized dictionary):

public class PeopleRepository
{
    private readonly Dictionary _serializedPeople = new Dictionary();
 
    public void Save(Person person)
    {
        // this method, as innocent as it looks, make it more difficult to discover the bug. See ahead.
        Save(new[]{person});
    }
 
    public void Save(IEnumerable<Person> people)
    {
        var serializedPeople = from p in people select new {p.ID, p.Serialized};
        foreach (var p in serializedPeople)
            _serializedPeople[p.ID] = p.Serialized;
    }
 
    public IEnumerable<Person> Read(Predicate<Person> predicate)
    {
        foreach (int id in _serializedPeople.Keys)
        {
            var person = new Person(_serializedPeople[id]);
            if (predicate(person))
                yield return person;
        }
    }
}

The bug I tracked was that updates to the repository were not taking place, but instead were simply ignored. The first thing I tried, was writing a simple test:

// Setup
var repository = new PeopleRepository();
repository.Save(new Person(1, " John")); // oops, I put an extra space here
 
// Find &amp; Fix John
var john = repository.Read(p => p.ID == 1).First();
john.Name = john.Name.Trim();
 
// Fix poor John back to the repository
repository.Save(john);
 
// Make sure john is saved properly
john = repository.Read(p => p.ID == 1).First();
if (john.Name != "John")
    throw new Exception();

Sadly, this test passed with flying colors. More debugging revealed the problem happened because both our Read and Save methods returned IEnumerables. It appears that Read() read the items and made the required updates … but … when writing the items back to the repository, it iterated on them.

Let me repeat – we read some items, iterated on them and modified some, saved and thus reiterated. Bam!

Bam

The second iteration didn’t iterate on the modified items – because the internal implementation of Read used a yield statement, there was no actual collection returned. So the second iteration just caused the repository to re-read all the items from disk, and ignore the modified items.

Conclusion: whenever you see methods that return IEnumerables, be suspicious. Odds are it should return a List or Collection. And whatever you do, watch out from feeding that IEnumerable back to the same repository.

Here is a final test that almost reproduces the problem. It crashes with a CollectionWasModified exception, while my actual test & code just silently failed (because the repository I mocked up here doesn’t save to the disk, but rather keep everything in-memory).

// Setup
var repository = new PeopleRepository();
repository.Save(new Person(1, " John")); // oops, I saved a space in front of John
 
// Let's read and fix all people starting with a space
var people = repository.Read(p => p.Name.StartsWith(" "));
foreach (var person in people)
    person.Name = person.Name.Trim();
 
// store the modified points back
repository.Save(people);

2 Comments

  1. M. A. Hanin:

    “Odds are it should return a List or Collection” – very true!

    One thing I took from “framework design guidelines” (a book by Microsoft), is to be flexible about what you accept and strict about what you return. Therefore, it is a good practice to accept an IEnumerable when taking a collection as an argument, but defining the return type as IEnumerable is bad practice (as opposed to, say, returning an array, or List, or even IList).

    An IEnumerable really should be the last choice for a return type IMHO, and we should prefer to work with more explicit interfaces or classes when actually processing the collection – or restrict ourselves to actually enumerating (“for each”).
    A trivial example is the Count method of the IEnumerable interface – this is an extension method, which counts the number of elements in the collection by… that’s right – enumerating through the entire collection, making it a O(n) operation. If we work with a List, Collection or Array, and use their respective Count/Length instance methods / properties, then we’re executing an O(1) operation.

  2. ripper234:

    Yeah, that’s a good book.

    It took me a while to remember what this post was about.
    C# as a language tends to be very magical. It’s fun most of the time … until it bites you in the ass.

    (See C# vs Java)

Leave a comment