I decided to pick it up a bit. Perhaps this will not be the answer, but I still want to share information.
So, I want to: generate a large array of data, write a few sorts, compare performance.
Data generation:
It's simple. There is only 1 rule - the number in the string does not begin with 0, that is, the number in the string does not have leading zeros. The generation of the data itself looks like this:
private static Random random = new Random(); public static string[] GenerateRandomStringArray() { string[] ret = new string[200000]; for(int i=0; i<ret.Length; i++) ret[i] = GenerateRandomString(); return ret; } public static string GenerateRandomString() { var sb = new StringBuilder(); var len = random.Next(1, 50); sb.Append(random.Next(1, 10)); for(var i=0; i<len; i++) sb.Append(random.Next(0, 10)); return sb.ToString(); } public static string[] Copy(string[] input) { var dest = new string[input.Length]; Array.Copy(input, dest, input.Length); return dest; }
We'll need the Copy method later.
1) Current option
Next, copy the sort from the question. More precisely, I simply implement the comparer in a similar way.
public class Comparer : IComparer<string> { public int Compare(string s1, string s2) { int x; if (int.TryParse(s1, out x) && int.TryParse(s2, out x)) { if (int.Parse(s1) > int.Parse(s2)) { return 1; } if (int.Parse(s1) < int.Parse(s2)) { return -1; } return 0; } else { if (s1.Length > s2.Length) { return 1; } else if (s1.Length < s2.Length) { return -1; } else { return string.Compare(s1, s2); } } } }
Sorting by this comparer will look something like this:
array = array.OrderBy(x => x, new Comparer()).ToArray();
2) New Comparer
As we have seen, the previous comparer is no good. Too many gestures trying to convert text to a number. Let's try to rewrite it.
public class NewComparer : IComparer<string> { public int Compare(string s1, string s2) { var lenCompare = s1.Length.CompareTo(s2.Length); if (lenCompare != 0) return lenCompare; return string.CompareOrdinal(s1, s2); } }
It turned out compact and efficient. Enjoy so
array = array.OrderBy(x => x, new Comparer()).ToArray();
3) Bitwise sorting
Bitwise sorting is good because it quickly sorts strings with a reasonable limitation of the alphabet (and also depends on the minimum and maximum string lengths). But this is just our case - we only have numbers in the alphabet, that is, the size of the alphabet is 10.
Generally, LSD is recommended for sorting numbers, but I had MSD implementation somewhere in the bins, so I didn’t invent anything and just screwed MSD - in theory, the speed should be about the same as LSD.
And one more thing. Before MSD sorting, I will sort and divide the lines by sorting by counting, as MSD does not take into account the sorting by itself (more precisely, it takes into account, but not in the way I need). In general, it will be fun :)
As for the implementation, at first I added a function, by which I will receive the character code by index from a string. I dashed for this function, since requests can go beyond the string and you must return -1 in this case.
private static int GetChar(string s, int pos) { if (pos >= 0 && pos < s.Length) return s[pos] - '0'; else return -1; }
Further, the sorting itself:
public static void MSD(string[] inp, string[] tmp, int start, int end, int d) { if (end <= start) return; int r = 10; int[] count = new int[r + 2]; for (int i = start; i <= end; i++) count[GetChar(inp[i], d) + 2]++; for (int i = 0; i <= r; i++) count[i + 1] += count[i]; for (int i = start; i <= end; i++) tmp[count[GetChar(inp[i], d) + 1]++] = inp[i]; for (int i = start; i <= end; i++) inp[i] = tmp[i - start]; for (int i = 0; i < r; i++) MSD(inp, tmp, start + count[i], start + count[i + 1] - 1, d + 1); }
Sort by counting to take into account the lengths of lines + MSD for each group of lines of the same length:
public static void CountSortWithMSD(string[] inp, string[] tmp, int start, int end) { if (end <= start) return; int r = 55; int[] count = new int[r + 2]; for (int i = start; i <= end; i++) count[inp[i].Length + 2]++; for (int i = 0; i <= r; i++) count[i + 1] += count[i]; for (int i = start; i <= end; i++) tmp[count[inp[i].Length + 1]++] = inp[i]; for (int i = start; i <= end; i++) inp[i] = tmp[i - start]; for (int i = 0; i < r; i++) MSD(inp, tmp, start + count[i], start + count[i + 1] - 1, 0); }
The main sorting function:
public static void MSD(string[] inp) { string[] tmp = new string[inp.Length]; CountSortWithMSD(inp, tmp, 0, inp.Length - 1); }
So, we have 3 implementations of sorting. How to make sure they all work the same way? Something like this (the code for the MyClass class will be lower):
var test = new MyClass(); // тут внутри я сгенерирую начальный массив строк var copy1 = test.LinqOldWay(); var copy2 = test.LinqNewWay(); var copy3 = test.MSDWay(); for (int i = 0; i < copy1.Length; i++) { if (string.CompareOrdinal(copy1[i], copy2[i]) != 0 || string.CompareOrdinal(copy1[i], copy3[i]) != 0) Console.WriteLine($"{copy1[i]} - {copy2[i]} - {copy3[i]}"); }
In fact, here with the same input data, I do 3 sortings and check. that the results are exactly the same (if not the same, I will display the difference on the screen). Let me remind you that the input data is 200k random rows of length from 1 to 50 characters.
Now, how to check which option is faster? Write a benchmark, of course. I will use the BenchmarkDotNet library
public class MyClass { public MyClass() { initialData = GenerateRandomStringArray(); } string[] initialData; [Benchmark] public string[] LinqOldWay() { var copy = Copy(initialData); copy = copy.OrderBy(x => x, new Comparer()).ToArray(); return copy; } [Benchmark] public string[] LinqNewWay() { var copy = Copy(initialData); copy = copy.OrderBy(x => x, new NewComparer()).ToArray(); return copy; } [Benchmark] public string[] MSDWay() { var copy = Copy(initialData); MSD(copy); return copy; } }
Of course, one cannot say that the test is directly very honest, since the linq variants still create new collections, while MSD uses additional memory, but at the same time sorts the array in place. But I do not think that this will somehow have a strong influence on the working time, considering that we sort the 200k elements. So, we run the tests:
BenchmarkRunner.Run<MyClass>();
I will not write all the details of the test, the main table looks like this:
Method | Mean | Error | StdDev | ----------- |------------:|----------:|----------:| LinqOldWay | 2,223.73 ms | 31.922 ms | 29.860 ms | LinqNewWay | 276.29 ms | 5.376 ms | 5.029 ms | MSDWay | 90.88 ms | 1.333 ms | 1.247 ms |
We see that a simple revision of the comparer has accelerated the sorting 8 times. And the bitwise sorting turned out to be a little bit faster - about 3 times faster than the new comparer and 25 times faster than the initial version. I’m sure that if you devote a little more to this time, and instead of MSD, use LSD after breaking the array into subarrays of the same length, then the result may be even better (although the asymptotics should be the same).
UPD On the advice of @PavelMayorov corrected the GetChar function to get rid of sorting by count. The result looks like this:
private static int GetChar2(string s, int pos, int maxLength) { int d = pos - maxLength + s.Length; if (d >= -0 && d < s.Length) return s[d] - '0'; else return 0; } public static void MSD2(string[] inp) { string[] tmp = new string[inp.Length]; MSD2(inp, tmp, 0, inp.Length - 1, 0); } public static void MSD2(string[] inp, string[] tmp, int start, int end, int d) { if (end <= start) return; if (d > 55) return; int r = 10; int[] count = new int[r + 2]; for (int i = start; i <= end; i++) count[GetChar2(inp[i], d, 55) + 2]++; for (int i = 0; i <= r; i++) count[i + 1] += count[i]; for (int i = start; i <= end; i++) tmp[count[GetChar2(inp[i], d, 55) + 1]++] = inp[i]; for (int i = start; i <= end; i++) inp[i] = tmp[i - start]; for (int i = 0; i < r; i++) MSD2(inp, tmp, start + count[i], start + count[i + 1] - 1, d + 1); }
Test
[Benchmark] public string[] MSDWay2() { var copy = Copy(initialData); MSD2(copy); return copy; }
Results compared to MSD (test on another computer, because these results are in absolute values different from the previous test)
Method | Mean | Error | StdDev | ----------- |-----------:|----------:|-----------:| LinqOldWay | 150.878 ms | 4.7384 ms | 13.7471 ms | LinqNewWay | 9.986 ms | 0.1962 ms | 0.5204 ms | MSDWay | 2.245 ms | 0.0449 ms | 0.1084 ms | MSDWay2 | 10.000 ms | 0.2809 ms | 0.7830 ms |
It is seen that the second method is about the same speed as the method with the new comparator. This is because the second method will poll short strings as well as long ones, and the GetChar2 method GetChar2 will be much more than GetChar (for 200 random rows, GetChar GetChar 1745584 times GetChar 1745584 GetChar 1745584 times).
UPD distribution of rows depending on the length of random data. The first column is the length of the string, the second is the number of rows with such a length

UPD
So, on the advice of @PavelMayorov, I changed the data generator.
public static string GenerateRandomString() { var sb = new StringBuilder(); for (var i = 0; i < 50; i++) sb.Append(random.Next(0, 10)); return sb.ToString().TrimStart('0'); }
Now the distribution is uniform, the distribution of lengths of lines:

The tests are the same, the amount of data is the same. Results:
Method | Mean | Error | StdDev | ----------- |------------:|-----------:|-----------:| LinqOldWay | 1,821.69 ms | 62.8858 ms | 58.8234 ms | LinqNewWay | 177.76 ms | 3.4509 ms | 3.5438 ms | MSDWay | 71.54 ms | 1.4094 ms | 1.3842 ms | MSDWay2 | 93.57 ms | 0.8887 ms | 0.7878 ms |
Now the updated MSD is very close to the first version - from the fact that there is no skew in the lengths of numbers.
StringComparison.Ordinalparameter or theCompareOrdinalmethod. In Linky OrderBy, respectively, you can use theStringComparer.Ordinalparameter. - Alexander Petrov