How to speed up the process of finding a match in two lists

Question

There are two files. I read from them both values in the lists. I want to take unique values from both lists, however, the process of searching and comparing is very long (there are ~ 300k lines in both lists). Tell me how to speed up the process of searching and comparing *?

var lst1 = File.ReadAllLines(@ "D:\test\1.csv").ToList(); var lst2 = File.ReadAllLines(@ "D:\\test\2.csv").ToList(); var rez = lst2.Where(x => !MySequenceContains(lst1, x)). Select(q => string.Join(";", q)).ToList(); } private bool MySequenceContains(List < string > x, string y) { bool contains = false; index++; label2.Text = index.ToString(); foreach(var a in x) { // ToDo: tweak the string comparison as needed if (string.Compare(a.Split(';')[0], y.Split(';')[0], StringComparison.InvariantCultureIgnoreCase) == 0 && string.Compare(a.Split(';')[14], y.Split(';')[14], StringComparison.InvariantCultureIgnoreCase) == 0) { contains = true; break; } } return contains; }

csv is still a tabular format. What exactly do you mean by unique values?
need unique values among rows, cells in some column or all cells in a table?
Well, for a start, do not jerk every time Split, at the beginning of the cycle, get 2 arrays by a split and work further with these arrays, this will speed up your algorithm every few.
Add another example of the input and output data, but now it is not quite clear what is happening, and what should be the result

Nicolas Chabanovsky ♦ 38.2k 54 220 437 · Answer 1 · 2016-08-24T09:10:20

First, your code returns the strings that are in the second file, but not in the first. It does not return the rows that are in the first, but not in the second.

Secondly, you use Split inside the loop, i.e. split the same lines many times.

Thirdly, you inside the loop for one file do the cycle for another (look for this value in another file. If the file sizes are the same, then the algorithm is slow N ^ 2.

If you preliminarily perform Split for all input lines and sort the lists by your criteria, then you can select non-repeating elements in one cycle at once by two (sorted) lists. The complexity is obtained 2N + 2NlogN (sorting).

Here is an example, I apologize for being non-conservative:

 private List<string> my(List<string> x, List<string> y) { var rez = new List<string>(); var lst1 = new List<string[]>(x.Select(s => s.Split(';'))); var lst2 = new List<string[]>(y.Select(s => s.Split(';'))); lst1.Sort(MyComarer.Comparer); lst2.Sort(MyComarer.Comparer); var isPresentInX = false; int j = 0; for (int i = 0; i < lst1.Count; i++) { if (j >= lst2.Count) { rez.Add(string.Join(";", lst1[i])); continue; } var comp = MyComarer.Comparer.Compare(lst1[i], lst2[j]); while (comp > 0 && j < lst2.Count) { if (!isPresentInX) rez.Add(string.Join(";", lst2[j])); j++; if (j < lst2.Count) { if (MyComarer.Comparer.Compare(lst2[j], lst2[j - 1]) != 0) isPresentInX = false; comp = MyComarer.Comparer.Compare(lst1[i], lst2[j]); } } if (comp != 0) rez.Add(string.Join(";", lst1[i])); else isPresentInX = true; } return rez; } private class MyComarer : IComparer<string[]> { public static MyComarer Comparer { get; } = new MyComarer(); public int Compare(string[] x, string[] y) { var res = string.Compare(x[0], y[0], StringComparison.InvariantCultureIgnoreCase); if (res == 0) res = string.Compare(x[14], y[14], StringComparison.InvariantCultureIgnoreCase); return res; } }

Answer 2 · 2016-08-24T07:12:50

 lst1.Concat(lst2).ToList() .GroupBy(x=>x.Split(';')[0]+x.Split(';')[14]) .Select(g=>g.First())

Leonid Malyshev

809 four 14

Can I get the full code? - Rajab
And this is the complete code. You can save var newlst = lst1.Concat (lst2) .ToList () .GroupBy (x => a.Split (';') [0] + a.Split (';') [14]) to the new sheet. Select (g => g.First ()); - Leonid Malyshev
still a small question. How to add try catch to Split, - Rajab
.... .GroupBy (x => {try {return x.Split (';') [0] + x.Split (';') [14];} catch {return "";}}). Select (g => g.First ()); - Leonid Malyshev

|

Grundy grundy 58.6k 7 49 93 · Answer 3 · 2016-08-24T08:50:12

Alternative: use HashSet <T> along with its ExceptWith method

When creating a HashSet <T> object, you can pass it an IEqualityComparer <T> that will be used to compare items.

It might look like this:

 class MySequenceEqualityComparer : IEqualityComparer<string> { public bool Equals(string x, string y) { var xSplit = x.Split(';'); var ySplit = x.Split(';'); return string.Compare(xSplit[0], ySplit[0], StringComparison.InvariantCultureIgnoreCase) == 0 && string.Compare(xSplit[14], ySplit[14], StringComparison.InvariantCultureIgnoreCase) == 0; } public int GetHashCode(string obj) { return -1; } }

The main code will be reduced to the following:

 var lst1 = File.ReadAllLines(@ "D:\test\1.csv"); var lst2 = new HashSet<string>(File.ReadAllLines(@ "D:\\test\2.csv"), new MySequenceEqualityComparer()); lst2.ExceptWith(lst1);

The result will be in the variable lst2

@ iluxa1810, will be, every time you go to Equals - Grundy

How to speed up the process of finding a match in two lists

3 answers 3

More articles: