Sorting large CSV files by column using C#

This is a follow up to my previous post about sorting (really) large files using c#.

The problem

We have the following CSV file:

FirstName,LastName,FullName,Username,Email,UniqueIndex,Id,ProfilePicture
Hector,Reichel,Hector Reichel,Hector92,h@example.org,Value 0,adf1cf72-40eb-4596-b809-b509ffdcd1b7,https://cloudflare-ipfs.com/ipfs/Qmd3W5DuhgHirLHGVixi6V76LhCkZUz6pnFt5AJBiyvHye/avatar/417.jpg
Lamar,Schmeler,Lamar Schmeler,Lamar76,l@example.org,Value 1,a29fd784-c554-4301-b689-e678df358cc7,https://cloudflare-ipfs.com/ipfs/Qmd3W5DuhgHirLHGVixi6V76LhCkZUz6pnFt5AJBiyvHye/avatar/434.jpg
Pearl,Rosenbaum,Pearl Rosenbaum,Pearl.Rosenbaum54,p@example.org,Value 2,3f77c9b3-820b-4afd-8dbc-fb72dc4f81d3,https://cloudflare-ipfs.com/ipfs/Qmd3W5DuhgHirLHGVixi6V76LhCkZUz6pnFt5AJBiyvHye/avatar/945.jpg
Guido,Wolf,Guido Wolf,Guido53,g@example.org,Value 3,12618f37-e4eb-4c62-9a8f-104d77bea283,https://cloudflare-ipfs.com/ipfs/Qmd3W5DuhgHirLHGVixi6V76LhCkZUz6pnFt5AJBiyvHye/avatar/500.jpg
Alexis,Schneider,Alexis Schneider,Alexis_Schneider40,a@example.org,Value 4,df7b9351-a049-4fa3-a4ea-b753bc6df517,https://cloudflare-ipfs.com/ipfs/Qmd3W5DuhgHirLHGVixi6V76LhCkZUz6pnFt5AJBiyvHye/avatar/660.jpg

In my previous post I sorted the file based on the whole line. Example:

FirstName,LastName,FullName,Username,Email,UniqueIndex,Id,ProfilePicture
Alexis,Schneider,Alexis Schneider,Alexis_Schneider40,a@example.org,Value 4,df7b9351-a049-4fa3-a4ea-b753bc6df517,https://cloudflare-ipfs.com/ipfs/Qmd3W5DuhgHirLHGVixi6V76LhCkZUz6pnFt5AJBiyvHye/avatar/660.jpg
Guido,Wolf,Guido Wolf,Guido53,g@example.org,Value 3,12618f37-e4eb-4c62-9a8f-104d77bea283,https://cloudflare-ipfs.com/ipfs/Qmd3W5DuhgHirLHGVixi6V76LhCkZUz6pnFt5AJBiyvHye/avatar/500.jpg
Hector,Reichel,Hector Reichel,Hector92,h@example.org,Value 0,adf1cf72-40eb-4596-b809-b509ffdcd1b7,https://cloudflare-ipfs.com/ipfs/Qmd3W5DuhgHirLHGVixi6V76LhCkZUz6pnFt5AJBiyvHye/avatar/417.jpg
Lamar,Schmeler,Lamar Schmeler,Lamar76,l@example.org,Value 1,a29fd784-c554-4301-b689-e678df358cc7,https://cloudflare-ipfs.com/ipfs/Qmd3W5DuhgHirLHGVixi6V76LhCkZUz6pnFt5AJBiyvHye/avatar/434.jpg
Pearl,Rosenbaum,Pearl Rosenbaum,Pearl.Rosenbaum54,p@example.org,Value 2,3f77c9b3-820b-4afd-8dbc-fb72dc4f81d3,https://cloudflare-ipfs.com/ipfs/Qmd3W5DuhgHirLHGVixi6V76LhCkZUz6pnFt5AJBiyvHye/avatar/945.jpg

Now, I want to be able to sort the file based on a column value instead, example:
Sorted on the LastName column

FirstName,LastName,FullName,Username,Email,UniqueIndex,Id,ProfilePicture
Hector,Reichel,Hector Reichel,Hector92,h@example.org,Value 0,adf1cf72-40eb-4596-b809-b509ffdcd1b7,https://cloudflare-ipfs.com/ipfs/Qmd3W5DuhgHirLHGVixi6V76LhCkZUz6pnFt5AJBiyvHye/avatar/417.jpg
Pearl,Rosenbaum,Pearl Rosenbaum,Pearl.Rosenbaum54,p@example.org,Value 2,3f77c9b3-820b-4afd-8dbc-fb72dc4f81d3,https://cloudflare-ipfs.com/ipfs/Qmd3W5DuhgHirLHGVixi6V76LhCkZUz6pnFt5AJBiyvHye/avatar/945.jpg
Lamar,Schmeler,Lamar Schmeler,Lamar76,l@example.org,Value 1,a29fd784-c554-4301-b689-e678df358cc7,https://cloudflare-ipfs.com/ipfs/Qmd3W5DuhgHirLHGVixi6V76LhCkZUz6pnFt5AJBiyvHye/avatar/434.jpg
Alexis,Schneider,Alexis Schneider,Alexis_Schneider40,a@example.org,Value 4,df7b9351-a049-4fa3-a4ea-b753bc6df517,https://cloudflare-ipfs.com/ipfs/Qmd3W5DuhgHirLHGVixi6V76LhCkZUz6pnFt5AJBiyvHye/avatar/660.jpg
Guido,Wolf,Guido Wolf,Guido53,g@example.org,Value 3,12618f37-e4eb-4c62-9a8f-104d77bea283,https://cloudflare-ipfs.com/ipfs/Qmd3W5DuhgHirLHGVixi6V76LhCkZUz6pnFt5AJBiyvHye/avatar/500.jpg

My ExternalMergeSorter already has support for custom sorting like this by passing in a custom IComparer<string>.

The only thing we need to do is to create our own implementation of the IComparer<string> and then enable it like this:

var unsortedFile = File.OpenRead("MyUnsortedFile.csv");
var targetFile = File.OpenRead("MySortedFile.csv");
var options = new ExternalMergeSorterOptions
{
    Sort = new ExternalMergeSortSortOptions
    {
        // This is our custom comparer
        Comparer = new CsvColumnSorter()
    }
};
var sorter = new ExternalMergeSorter(options);

await sorter.Sort(unsortedFile, targetFile);

Implementations

It's easy to implement the IComparer<string> interface.

public class CsvColumnSorter : IComparer<string>
{
    public int Compare(string x, string y)
    {
        // return -1 (or less) if x < y
        // return 1 (or more) if y > x
        // return 0 if x == y
    }
}

I can think of at least 3 different ways of implementing the comparer, two "bad" and one "awesome" way of doing this. Bad as in slow and "allocaty".
Let's start with the "bad" ones.
Note: I will skip most of the error handling in my implementations, escaping of chars and what not will not be taken into consideration since my CSV files are well formated. If you need functionality like this you will need to extend my implementations or use a library like CSVHelper or something similar.

String.Split

/// <summary>
/// Don't use this code.
/// </summary>
public class CsvColumnSorter_StringSplit : IComparer<string>
{
    private readonly int _column;
    private readonly char _separator;

    public CsvColumnSorter_StringSplit(int column, char separator = ',')
    {
        _column = column;
        _separator = separator;
    }

    public int Compare(string? x, string? y)
    {
        if (x == null && y != null)
        {
            return -1;
        }

        if (y == null && x != null)
        {
            return 1;
        }

        if (x == null || y == null)
        {
            return 0;
        }

        var xColumn = GetColumnValue(x);
        var yColumn = GetColumnValue(y);

        return Comparer<string>.Default.Compare(xColumn, yColumn);
    }

    private string GetColumnValue(string value)
    {
        var columns = value.Split(_separator);
        if (columns.Length < _column)
        {
            throw new ArgumentOutOfRangeException($"Found fewer columns than expected. Actual: {columns.Length}, Expected: {_column}");
        }

        return columns[_column - 1];
    }
}

Usage

var unsortedFile = File.OpenRead("MyUnsortedFile.csv");
var targetFile = File.OpenRead("MySortedFile.csv");
var options = new ExternalMergeSorterOptions
{
    Sort = new ExternalMergeSortSortOptions
    {
        // Sort on the LastName column (second column)
        Comparer = new CsvColumnSorter_StringSplit(2) 
    }
};
var sorter = new ExternalMergeSorter(options);

await sorter.Sort(unsortedFile, targetFile);

This code works but it is slow and inefficient. In the GetColumnValue method it splits the string using the separator, thus allocating an array = not good, we can do better.

Substring

/// <summary>
/// Don't use this code
/// </summary>
public class CsvColumnSorter_Substring : IComparer<string>
{
    private readonly int _column;
    private readonly char _separator;

    public CsvColumnSorter_Substring(int column, char separator = ',')
    {
        _column = column;
        _separator = separator;
    }

    public int Compare(string? x, string? y)
    {
        if (x == null && y != null)
        {
            return -1;
        }

        if (y == null && x != null)
        {
            return 1;
        }

        if (x == null || y == null)
        {
            return 0;
        }

        var xColumn = GetColumnValue(x);
        var yColumn = GetColumnValue(y);

        return string.Compare(xColumn, yColumn, StringComparison.OrdinalIgnoreCase);
    }

    private string GetColumnValue(string value)
    {
        var columnCounter = 1;
        var columnStartIndex = 0;
        for (var i = 0; i < value.Length; i++)
        {
            if (value[i].Equals(_separator))
            {
                columnCounter++;
                continue;
            }

            if (columnCounter != _column)
            {
                continue;
            }

            columnStartIndex = i;
            break;
        }

        var columnLength = 0;
        var slice = value[columnStartIndex..];
        for (var i = 0; i < slice.Length; i++)
        {
            if (slice[i] != _separator)
            {
                columnLength++;
            }
            else
            {
                break;
            }
        }

        return value.Substring(columnStartIndex, columnLength);
    }
}

Now we are using Substring instead of String.Split. This is much better in terms of allocations (no array), but we are still allocating a new string when doing the actual .Substring call. We can do better.

Span

public class CsvColumnSorter_Span : IComparer<string>
{
    private readonly int _column;
    private readonly char _separator;

    public CsvColumnSorter_Span(int column, char separator = ',')
    {
        _column = column;
        _separator = separator;
    }

    public int Compare(string? x, string? y)
    {
        if (x == null && y != null)
        {
            return -1;
        }

        if (y == null && x != null)
        {
            return 1;
        }

        if (x == null || y == null)
        {
            return 0;
        }

        var xColumn = GetColumnValue(x);
        var yColumn = GetColumnValue(y);

        return xColumn.CompareTo(yColumn, StringComparison.OrdinalIgnoreCase);
    }

    private ReadOnlySpan<char> GetColumnValue(string value)
    {
        var span = value.AsSpan();
        var columnCounter = 1;
        var columnStartIndex = 0;
        for (var i = 0; i < span.Length; i++)
        {
            if (span[i].Equals(_separator))
            {
                columnCounter++;
                continue;
            }

            if (columnCounter != _column)
            {
                continue;
            }

            columnStartIndex = i;
            break;
        }

        var columnLength = 0;
        var slice = span[columnStartIndex..];
        for (var i = 0; i < slice.Length; i++)
        {
            if (slice[i] != _separator)
            {
                columnLength++;
            }
            else
            {
                break;
            }
        }

        return span.Slice(columnStartIndex, columnLength);
    }
}

We are now turning our string into a Span. This is key to achieve the least amount of allocations. Instead of creating a new string instance, we are returning a Slice of the string. The slice is of type ReadOnlySpan<char> and it "points to" the section of the original string that we are interested in, column number 2.

Benchmarks

BenchmarkDotNet

BenchmarkDotNet=v0.13.0, OS=Windows 10.0.22000
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET SDK=6.0.100-preview.6.21355.2
  [Host]   : .NET 6.0.0 (6.0.21.35212), X64 RyuJIT
  .NET 6.0 : .NET 6.0.0 (6.0.21.35212), X64 RyuJIT

Job=.NET 6.0  Runtime=.NET 6.0  InvocationCount=1
UnrollFactor=1

|      Method |     Mean |     Error |    StdDev | Ratio | RatioSD | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------ |---------:|----------:|----------:|------:|--------:|------:|------:|------:|----------:|
| StringSplit | 5.040 us | 0.0811 us | 0.1601 us |  1.00 |    0.00 |     - |     - |     - |   7,288 B |
|   Substring | 2.700 us | 0.0539 us | 0.1012 us |  0.54 |    0.02 |     - |     - |     - |   5,000 B |
|        Span | 2.220 us | 0.0717 us | 0.1902 us |  0.45 |    0.04 |     - |     - |     - |     208 B |

Here we can clearly see that the first implementation using String.Split performs the worst when it comes to speed and allocation (because of the array of strings).
The Substring runs a bit slower compared to the Span implementation but it also allocates much more. Span ftw! :)

The span implementation can be found on GitHub.

Josef Ottosson

Freelancing Developer · Microsoft MVP