How to properly sort a long sequence of numbers in a string?

Question

There is a list of data that is sorted by the sortingField field, which is a string type with a maximum length of 50 characters. The data is sorted via LINQ OrderBy, since the string length can reach 50 characters - not all values can be converted to numeric type. Therefore, I implemented the IComparer interface method in the class of sortable objects. The method looks like this:

public int Compare(string s1, string s2) { int x; if (int.TryParse(s1, out x) && int.TryParse(s2, out x)) { if (int.Parse(s1) > int.Parse(s2)) { return 1; } if (int.Parse(s1) < int.Parse(s2)) { return -1; } return 0; } else { if (s1.Length > s2.Length) { return 1; } else if (s1.Length < s2.Length) { return -1; } else { return string.Compare(s1, s2); } } }

The first question is: is such an implementation acceptable or is there a better approach? The second question is: is it worth doing a cast to a numerical type and comparing it like this if it is possible or is it better to compare only string values, what comparison should be faster in this case?

The goal is to make a quick comparison of string data that consists only of numbers, but because of the length of the string (it is impossible to reduce), it is impossible to reduce all values to an integer type so that the sorting takes place according to the logic of sorting a number.

PS is also not clear such a moment. To call the Compare method, which I implemented, instead of writing .OrderBy(f => int.Parse(f.sortingField)) use the following record .OrderBy(f => f.sortingField, new MyClass()) . Honestly, I don’t understand why I need to call a class constructor, whose objects I sort with the implementation of my method. And how does this constructor challenge hit performance?

also, kmk, numbers take up less memory and compare faster than strings
I look at your comparator - why not first compare the lengths of the lines, because if one line is longer than the other, then it is clear that the number inside is larger, and you do not need to parse the line into a number in that case
Comparing numbers is, of course, faster than comparing strings.
But parsing lines into numbers is very long, so it eats up a possible profit.
For maximum performance, use the StringComparison.Ordinal parameter or the CompareOrdinal method.
In Linky OrderBy, respectively, you can use the StringComparer.Ordinal parameter.

Accepted Answer · 2018-09-27T00:05:37

I decided to pick it up a bit. Perhaps this will not be the answer, but I still want to share information.

So, I want to: generate a large array of data, write a few sorts, compare performance.

Data generation:

It's simple. There is only 1 rule - the number in the string does not begin with 0, that is, the number in the string does not have leading zeros. The generation of the data itself looks like this:

 private static Random random = new Random(); public static string[] GenerateRandomStringArray() { string[] ret = new string[200000]; for(int i=0; i<ret.Length; i++) ret[i] = GenerateRandomString(); return ret; } public static string GenerateRandomString() { var sb = new StringBuilder(); var len = random.Next(1, 50); sb.Append(random.Next(1, 10)); for(var i=0; i<len; i++) sb.Append(random.Next(0, 10)); return sb.ToString(); } public static string[] Copy(string[] input) { var dest = new string[input.Length]; Array.Copy(input, dest, input.Length); return dest; }

We'll need the Copy method later.

1) Current option

Next, copy the sort from the question. More precisely, I simply implement the comparer in a similar way.

 public class Comparer : IComparer<string> { public int Compare(string s1, string s2) { int x; if (int.TryParse(s1, out x) && int.TryParse(s2, out x)) { if (int.Parse(s1) > int.Parse(s2)) { return 1; } if (int.Parse(s1) < int.Parse(s2)) { return -1; } return 0; } else { if (s1.Length > s2.Length) { return 1; } else if (s1.Length < s2.Length) { return -1; } else { return string.Compare(s1, s2); } } } }

Sorting by this comparer will look something like this:

 array = array.OrderBy(x => x, new Comparer()).ToArray();

2) New Comparer

As we have seen, the previous comparer is no good. Too many gestures trying to convert text to a number. Let's try to rewrite it.

 public class NewComparer : IComparer<string> { public int Compare(string s1, string s2) { var lenCompare = s1.Length.CompareTo(s2.Length); if (lenCompare != 0) return lenCompare; return string.CompareOrdinal(s1, s2); } }

It turned out compact and efficient. Enjoy so

 array = array.OrderBy(x => x, new Comparer()).ToArray();

3) Bitwise sorting

Bitwise sorting is good because it quickly sorts strings with a reasonable limitation of the alphabet (and also depends on the minimum and maximum string lengths). But this is just our case - we only have numbers in the alphabet, that is, the size of the alphabet is 10.

Generally, LSD is recommended for sorting numbers, but I had MSD implementation somewhere in the bins, so I didn’t invent anything and just screwed MSD - in theory, the speed should be about the same as LSD.

And one more thing. Before MSD sorting, I will sort and divide the lines by sorting by counting, as MSD does not take into account the sorting by itself (more precisely, it takes into account, but not in the way I need). In general, it will be fun :)

As for the implementation, at first I added a function, by which I will receive the character code by index from a string. I dashed for this function, since requests can go beyond the string and you must return -1 in this case.

 private static int GetChar(string s, int pos) { if (pos >= 0 && pos < s.Length) return s[pos] - '0'; else return -1; }

Further, the sorting itself:

 public static void MSD(string[] inp, string[] tmp, int start, int end, int d) { if (end <= start) return; int r = 10; int[] count = new int[r + 2]; for (int i = start; i <= end; i++) count[GetChar(inp[i], d) + 2]++; for (int i = 0; i <= r; i++) count[i + 1] += count[i]; for (int i = start; i <= end; i++) tmp[count[GetChar(inp[i], d) + 1]++] = inp[i]; for (int i = start; i <= end; i++) inp[i] = tmp[i - start]; for (int i = 0; i < r; i++) MSD(inp, tmp, start + count[i], start + count[i + 1] - 1, d + 1); }

Sort by counting to take into account the lengths of lines + MSD for each group of lines of the same length:

 public static void CountSortWithMSD(string[] inp, string[] tmp, int start, int end) { if (end <= start) return; int r = 55; int[] count = new int[r + 2]; for (int i = start; i <= end; i++) count[inp[i].Length + 2]++; for (int i = 0; i <= r; i++) count[i + 1] += count[i]; for (int i = start; i <= end; i++) tmp[count[inp[i].Length + 1]++] = inp[i]; for (int i = start; i <= end; i++) inp[i] = tmp[i - start]; for (int i = 0; i < r; i++) MSD(inp, tmp, start + count[i], start + count[i + 1] - 1, 0); }

The main sorting function:

 public static void MSD(string[] inp) { string[] tmp = new string[inp.Length]; CountSortWithMSD(inp, tmp, 0, inp.Length - 1); }

So, we have 3 implementations of sorting. How to make sure they all work the same way? Something like this (the code for the MyClass class will be lower):

 var test = new MyClass(); // тут внутри я сгенерирую начальный массив строк var copy1 = test.LinqOldWay(); var copy2 = test.LinqNewWay(); var copy3 = test.MSDWay(); for (int i = 0; i < copy1.Length; i++) { if (string.CompareOrdinal(copy1[i], copy2[i]) != 0 || string.CompareOrdinal(copy1[i], copy3[i]) != 0) Console.WriteLine($"{copy1[i]} - {copy2[i]} - {copy3[i]}"); }

In fact, here with the same input data, I do 3 sortings and check. that the results are exactly the same (if not the same, I will display the difference on the screen). Let me remind you that the input data is 200k random rows of length from 1 to 50 characters.

Now, how to check which option is faster? Write a benchmark, of course. I will use the BenchmarkDotNet library

 public class MyClass { public MyClass() { initialData = GenerateRandomStringArray(); } string[] initialData; [Benchmark] public string[] LinqOldWay() { var copy = Copy(initialData); copy = copy.OrderBy(x => x, new Comparer()).ToArray(); return copy; } [Benchmark] public string[] LinqNewWay() { var copy = Copy(initialData); copy = copy.OrderBy(x => x, new NewComparer()).ToArray(); return copy; } [Benchmark] public string[] MSDWay() { var copy = Copy(initialData); MSD(copy); return copy; } }

Of course, one cannot say that the test is directly very honest, since the linq variants still create new collections, while MSD uses additional memory, but at the same time sorts the array in place. But I do not think that this will somehow have a strong influence on the working time, considering that we sort the 200k elements. So, we run the tests:

 BenchmarkRunner.Run<MyClass>();

I will not write all the details of the test, the main table looks like this:

  Method | Mean | Error | StdDev | ----------- |------------:|----------:|----------:| LinqOldWay | 2,223.73 ms | 31.922 ms | 29.860 ms | LinqNewWay | 276.29 ms | 5.376 ms | 5.029 ms | MSDWay | 90.88 ms | 1.333 ms | 1.247 ms |

We see that a simple revision of the comparer has accelerated the sorting 8 times. And the bitwise sorting turned out to be a little bit faster - about 3 times faster than the new comparer and 25 times faster than the initial version. I’m sure that if you devote a little more to this time, and instead of MSD, use LSD after breaking the array into subarrays of the same length, then the result may be even better (although the asymptotics should be the same).

UPD On the advice of @PavelMayorov corrected the GetChar function to get rid of sorting by count. The result looks like this:

 private static int GetChar2(string s, int pos, int maxLength) { int d = pos - maxLength + s.Length; if (d >= -0 && d < s.Length) return s[d] - '0'; else return 0; } public static void MSD2(string[] inp) { string[] tmp = new string[inp.Length]; MSD2(inp, tmp, 0, inp.Length - 1, 0); } public static void MSD2(string[] inp, string[] tmp, int start, int end, int d) { if (end <= start) return; if (d > 55) return; int r = 10; int[] count = new int[r + 2]; for (int i = start; i <= end; i++) count[GetChar2(inp[i], d, 55) + 2]++; for (int i = 0; i <= r; i++) count[i + 1] += count[i]; for (int i = start; i <= end; i++) tmp[count[GetChar2(inp[i], d, 55) + 1]++] = inp[i]; for (int i = start; i <= end; i++) inp[i] = tmp[i - start]; for (int i = 0; i < r; i++) MSD2(inp, tmp, start + count[i], start + count[i + 1] - 1, d + 1); }

Test

 [Benchmark] public string[] MSDWay2() { var copy = Copy(initialData); MSD2(copy); return copy; }

Results compared to MSD (test on another computer, because these results are in absolute values different from the previous test)

  Method | Mean | Error | StdDev | ----------- |-----------:|----------:|-----------:| LinqOldWay | 150.878 ms | 4.7384 ms | 13.7471 ms | LinqNewWay | 9.986 ms | 0.1962 ms | 0.5204 ms | MSDWay | 2.245 ms | 0.0449 ms | 0.1084 ms | MSDWay2 | 10.000 ms | 0.2809 ms | 0.7830 ms |

It is seen that the second method is about the same speed as the method with the new comparator. This is because the second method will poll short strings as well as long ones, and the GetChar2 method GetChar2 will be much more than GetChar (for 200 random rows, GetChar GetChar 1745584 times GetChar 1745584 GetChar 1745584 times).

UPD distribution of rows depending on the length of random data. The first column is the length of the string, the second is the number of rows with such a length

UPD

So, on the advice of @PavelMayorov, I changed the data generator.

 public static string GenerateRandomString() { var sb = new StringBuilder(); for (var i = 0; i < 50; i++) sb.Append(random.Next(0, 10)); return sb.ToString().TrimStart('0'); }

Now the distribution is uniform, the distribution of lengths of lines:

The tests are the same, the amount of data is the same. Results:

  Method | Mean | Error | StdDev | ----------- |------------:|-----------:|-----------:| LinqOldWay | 1,821.69 ms | 62.8858 ms | 58.8234 ms | LinqNewWay | 177.76 ms | 3.4509 ms | 3.5438 ms | MSDWay | 71.54 ms | 1.4094 ms | 1.3842 ms | MSDWay2 | 93.57 ms | 0.8887 ms | 0.7878 ms |

Now the updated MSD is very close to the first version - from the fact that there is no skew in the lengths of numbers.

I just read Cormen and only yesterday I read about sorting by counting and one by one, but it did not even occur to me that in this particular case it is applicable.
@AlexanderPetrov Well, here you have the opportunity to test the theory that LSD can do better here than MSD.
Although I thought so at night at 3 o'clock in the morning, and today in the morning I am no longer sure that there will be any strong differences :)
If your GetChar is done like this - int d = pos - maxLength + s.Length; if (d >= -0 && d < s.Length) return s[d] - '0'; else return 0;
int d = pos - maxLength + s.Length; if (d >= -0 && d < s.Length) return s[d] - '0'; else return 0;
- then preliminary splitting along the length of the line is not required (plus the leading zeros will cease to be a problem)
@PavelMayorov yes, but then 1) I will have to do more calculations (since even short strings will be interpreted as long, that is, they will be asked for indices that they do not have), that is, the prog will be slower.
2) In theory, since I break lines into groups of the same length, you can try to get rid of this function altogether, but I didn’t have enough of it yesterday, I did write and

How to properly sort a long sequence of numbers in a string?

1 answer 1

More articles: