Photo by Alex Block

This is a follow up to my previous post about sorting large files by column using c#.

The problem

We have the following CSV file:

FirstName,LastName,Album,Attendance,City
Marshall,Mathers,The Slim Shady LP,55000,London
Marshall,Mathers,The Slim Shady LP,45000,Stockholm
Marshall,Mathers,The Slim Shady LP,40000,New York
Marshall,Mathers,The Slim Shady LP,30000,Axvall
Marshall,Mathers,The Slim Shady LP,90000,Porto
Marshall,Mathers,The Slim Shady LP,70000,Barcelona
Eminem,,Infinite,70000,Detroit

We want to sort the file by Attendance and then by FirstName.

Marshall,Mathers,The Slim Shady LP,30000,Axvall
Marshall,Mathers,The Slim Shady LP,40000,New York
Marshall,Mathers,The Slim Shady LP,45000,Stockholm
Marshall,Mathers,The Slim Shady LP,55000,London
Eminem,,Infinite,70000,Detroit
Marshall,Mathers,The Slim Shady LP,70000,Barcelona
Marshall,Mathers,The Slim Shady LP,90000,Porto

The only thing we need to do is to implement a custom IComparer<string> and pass it to my ExternalMergeSorter.

Implementations

Let's use the same approaches as we did in the previous post, string.Split, Substring and Span.
Also, remember that the implementations, more or less, just supports a happy path when it comes to parsing the csv rows. If you have more "complex" csv files I recommend that you use CSVHelper instead of my "custom parsing".

The high-level algorithm for all implementations are something like this:

  1. Loop through the passed in columns
  2. Extract the column value
  3. Compare the two columns
  4. If not equal -> return, sorting done.
  5. If equal, repeat the same logic for the next column (if any).

String.Split

/// <summary>
/// Don't use this code
/// </summary>
public class CsvMultipleColumnSorter_StringSplit : IComparer<string>
{
    private readonly IReadOnlyCollection<int> _columns;
    private readonly char _separator;

    public CsvMultipleColumnSorter_StringSplit(IReadOnlyCollection<int> columns, char separator = ',')
    {
        _columns = columns;
        _separator = separator;
    }

    public int Compare(string? x, string? y)
    {
        if (x == null && y != null)
        {
            return -1;
        }

        if (y == null && x != null)
        {
            return 1;
        }

        if (x == null || y == null)
        {
            return 0;
        }

        var result = 0;
        var xColumns = x.Split(_separator);
        var yColumns = y.Split(_separator);
        foreach (var column in _columns)
        {
            var xColumn = GetColumnValue(xColumns, column);
            var yColumn = GetColumnValue(yColumns, column);

            result = string.Compare(xColumn, yColumn, StringComparison.OrdinalIgnoreCase);

            if (result != 0)
            {
                return result;
            }
        }

        return result;
    }

    private static string GetColumnValue(string[] columns, int column)
    {
        if (columns.Length < column)
        {
            throw new ArgumentOutOfRangeException($"Found fewer columns than expected. Actual: {columns.Length}, Expected: {column}");
        }

        return columns[column - 1];
    }
}

Usage

var unsortedFile = File.OpenRead("MyUnsortedFile.csv");
var targetFile = File.OpenRead("MySortedFile.csv");
var columns = new List<int> {4, 1};
var options = new ExternalMergeSorterOptions
{
    Sort = new ExternalMergeSortSortOptions
    {
        Comparer = new CsvMultipleColumnSorter_StringSplit(columns) 
    }
};
var sorter = new ExternalMergeSorter(options);

await sorter.Sort(unsortedFile, targetFile);

This is our first approach that relies on string.Split to extract the different columns and then sort based on the passed in columns. The code works and is quite easy to understand. It's not effiecent though (as you will see in the benchmarks), don't use this code.

Substring

public class CsvMultipleColumnSorter_Substring : IComparer<string>
{
    private readonly IReadOnlyCollection<int> _columns;
    private readonly char _separator;

    public CsvMultipleColumnSorter_Substring(
        IReadOnlyCollection<int> columns,
        char separator = ',')
    {
        _columns = columns ?? throw new ArgumentNullException(nameof(columns));
        _separator = separator;
    }

    public int Compare(string? x, string? y)
    {
        if (x == null && y != null)
        {
            return -1;
        }

        if (y == null && x != null)
        {
            return 1;
        }

        if (x == null || y == null)
        {
            return 0;
        }

        var result = 0;
        foreach (var column in _columns)
        {
            var xColumn = GetColumnValue(x, column);
            var yColumn = GetColumnValue(y, column);

            result = string.Compare(xColumn, yColumn, StringComparison.OrdinalIgnoreCase);

            if (result != 0)
            {
                return result;
            }
        }

        return result;
    }

    private string GetColumnValue(string value, int column)
    {
        var columnCounter = 1;
        var columnStartIndex = 0;
        for (var i = 0; i < value.Length; i++)
        {
            if (value[i].Equals(_separator))
            {
                columnCounter++;
                continue;
            }

            if (columnCounter != column)
            {
                continue;
            }

            columnStartIndex = i;
            break;
        }

        var columnLength = 0;
        var slice = value[columnStartIndex..];
        for (var i = 0; i < slice.Length; i++)
        {
            if (slice[i] != _separator)
            {
                columnLength++;
            }
            else
            {
                break;
            }
        }

        return value.Substring(columnStartIndex, columnLength);
    }
}

Here we're using subtring instead of string.Split to avoid doing the array allocation of all the different columns. At the end, when doing the actual Substring call, we allocate a new string. This code might be a bit harder to understand compared to the string.Split approach. Since we know that we can do even better than this code, I would not use this code either. Let's check out the span implementation.

Span

public class CsvMultipleColumnSorter_Span : IComparer<string>
{
    private readonly IReadOnlyCollection<int> _columns;
    private readonly char _separator;

    public CsvMultipleColumnSorter_Span(
        IReadOnlyCollection<int> columns,
        char separator = ',')
    {
        _columns = columns ?? throw new ArgumentNullException(nameof(columns));
        _separator = separator;
    }

    public int Compare(string? x, string? y)
    {
        if (x == null && y != null)
        {
            return -1;
        }

        if (y == null && x != null)
        {
            return 1;
        }

        if (x == null || y == null)
        {
            return 0;
        }

        var xSpan = x.AsSpan();
        var ySpan = y.AsSpan();
        var result = 0;
        foreach (var column in _columns)
        {
            var xValue = GetColumnValue(xSpan, column);
            var yValue = GetColumnValue(ySpan, column);

            result = xValue.CompareTo(yValue, StringComparison.OrdinalIgnoreCase);

            if (result != 0)
            {
                return result;
            }
        }

        return result;
    }

    /// <summary>
    /// Really naive implementation, just used as a POC.
    /// </summary>
    /// <param name="span"></param>
    /// <param name="column"></param>
    /// <returns></returns>
    private ReadOnlySpan<char> GetColumnValue(ReadOnlySpan<char> span, int column)
    {
        var columnCounter = 1;
        var columnStartIndex = 0;
        for (var i = 0; i < span.Length; i++)
        {
            if (span[i].Equals(_separator))
            {
                columnCounter++;
                continue;
            }

            if (columnCounter != column)
            {
                continue;
            }

            columnStartIndex = i;
            break;
        }

        var columnLength = 0;
        var slice = span[columnStartIndex..];
        for (var i = 0; i < slice.Length; i++)
        {
            if (slice[i] != _separator)
            {
                columnLength++;
            }
            else
            {
                break;
            }
        }

        return span.Slice(columnStartIndex, columnLength);
    }
}

Here we're calling .AsSpan() on the csv rows.
I'll just quote my previous post:

Instead of creating a new string instance, we are returning a Slice of the string. The slice is of type ReadOnlySpan<char> and it "points to" the section of the original string that we are interested in, column number 2.
This is key to achieve the least amount of allocations.

Let's have a look at the benchmarks.

Benchmarks

BenchmarkDotNet

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET SDK=6.0.100-preview.7.21379.14
  [Host]   : .NET 6.0.0 (6.0.21.37719), X64 RyuJIT
  .NET 6.0 : .NET 6.0.0 (6.0.21.37719), X64 RyuJIT

Job=.NET 6.0  Runtime=.NET 6.0  InvocationCount=1  
UnrollFactor=1  

|      Method |     Mean |     Error |    StdDev |   Median | Ratio | RatioSD | Allocated |
|------------ |---------:|----------:|----------:|---------:|------:|--------:|----------:|
| StringSplit | 7.664 μs | 0.3636 μs | 1.0608 μs | 7.400 μs |  1.00 |    0.00 |  12,088 B |
|   Substring | 4.005 μs | 0.2385 μs | 0.6957 μs | 3.700 μs |  0.53 |    0.12 |   2,384 B |
|        Span | 3.384 μs | 0.1522 μs | 0.4392 μs | 3.200 μs |  0.45 |    0.08 |     816 B |

No surprises here. The string.split approach is slowest and allocates the most. The span implementation allocates the least and runs the fastest, as I said, no surprises :)

"Real world scenario" - 10 million rows

String.Split

combined-sorting-string-split

Substring

combined-sorting-substring

Span

combined-sorting-span

We can see the same pattern here. String.Split is slowest and the span implementation is fastest (by quite a bit).

All code can be found at GitHub.