There are two files. I read from them both values ​​in the lists. I want to take unique values ​​from both lists, however, the process of searching and comparing is very long (there are ~ 300k lines in both lists). Tell me how to speed up the process of searching and comparing *?

var lst1 = File.ReadAllLines(@ "D:\test\1.csv").ToList(); var lst2 = File.ReadAllLines(@ "D:\\test\2.csv").ToList(); var rez = lst2.Where(x => !MySequenceContains(lst1, x)). Select(q => string.Join(";", q)).ToList(); } private bool MySequenceContains(List < string > x, string y) { bool contains = false; index++; label2.Text = index.ToString(); foreach(var a in x) { // ToDo: tweak the string comparison as needed if (string.Compare(a.Split(';')[0], y.Split(';')[0], StringComparison.InvariantCultureIgnoreCase) == 0 && string.Compare(a.Split(';')[14], y.Split(';')[14], StringComparison.InvariantCultureIgnoreCase) == 0) { contains = true; break; } } return contains; } 
  • one
    csv is still a tabular format. What exactly do you mean by unique values? need unique values ​​among rows, cells in some column or all cells in a table? - rdorn
  • 2
    Well, for a start, do not jerk every time Split, at the beginning of the cycle, get 2 arrays by a split and work further with these arrays, this will speed up your algorithm every few. - rdorn pm
  • can be more detailed? - Radzhab
  • one
    Another option is to use HashSet, with its comparator. Add another example of the input and output data, but now it is not quite clear what is happening, and what should be the result - Grundy

3 answers 3

First, your code returns the strings that are in the second file, but not in the first. It does not return the rows that are in the first, but not in the second.

Secondly, you use Split inside the loop, i.e. split the same lines many times.

Thirdly, you inside the loop for one file do the cycle for another (look for this value in another file. If the file sizes are the same, then the algorithm is slow N ^ 2.

If you preliminarily perform Split for all input lines and sort the lists by your criteria, then you can select non-repeating elements in one cycle at once by two (sorted) lists. The complexity is obtained 2N + 2NlogN (sorting).

Here is an example, I apologize for being non-conservative:

 private List<string> my(List<string> x, List<string> y) { var rez = new List<string>(); var lst1 = new List<string[]>(x.Select(s => s.Split(';'))); var lst2 = new List<string[]>(y.Select(s => s.Split(';'))); lst1.Sort(MyComarer.Comparer); lst2.Sort(MyComarer.Comparer); var isPresentInX = false; int j = 0; for (int i = 0; i < lst1.Count; i++) { if (j >= lst2.Count) { rez.Add(string.Join(";", lst1[i])); continue; } var comp = MyComarer.Comparer.Compare(lst1[i], lst2[j]); while (comp > 0 && j < lst2.Count) { if (!isPresentInX) rez.Add(string.Join(";", lst2[j])); j++; if (j < lst2.Count) { if (MyComarer.Comparer.Compare(lst2[j], lst2[j - 1]) != 0) isPresentInX = false; comp = MyComarer.Comparer.Compare(lst1[i], lst2[j]); } } if (comp != 0) rez.Add(string.Join(";", lst1[i])); else isPresentInX = true; } return rez; } private class MyComarer : IComparer<string[]> { public static MyComarer Comparer { get; } = new MyComarer(); public int Compare(string[] x, string[] y) { var res = string.Compare(x[0], y[0], StringComparison.InvariantCultureIgnoreCase); if (res == 0) res = string.Compare(x[14], y[14], StringComparison.InvariantCultureIgnoreCase); return res; } } 
     lst1.Concat(lst2).ToList() .GroupBy(x=>x.Split(';')[0]+x.Split(';')[14]) .Select(g=>g.First()) 
    • Can I get the full code? - Rajab
    • And this is the complete code. You can save var newlst = lst1.Concat (lst2) .ToList () .GroupBy (x => a.Split (';') [0] + a.Split (';') [14]) to the new sheet. Select (g => g.First ()); - Leonid Malyshev
    • still a small question. How to add try catch to Split, - Rajab
    • .... .GroupBy (x => {try {return x.Split (';') [0] + x.Split (';') [14];} catch {return "";}}). Select (g => g.First ()); - Leonid Malyshev

    Alternative: use HashSet <T> along with its ExceptWith method

    When creating a HashSet <T> object, you can pass it an IEqualityComparer <T> that will be used to compare items.

    It might look like this:

     class MySequenceEqualityComparer : IEqualityComparer<string> { public bool Equals(string x, string y) { var xSplit = x.Split(';'); var ySplit = x.Split(';'); return string.Compare(xSplit[0], ySplit[0], StringComparison.InvariantCultureIgnoreCase) == 0 && string.Compare(xSplit[14], ySplit[14], StringComparison.InvariantCultureIgnoreCase) == 0; } public int GetHashCode(string obj) { return -1; } } 

    The main code will be reduced to the following:

     var lst1 = File.ReadAllLines(@ "D:\test\1.csv"); var lst2 = new HashSet<string>(File.ReadAllLines(@ "D:\\test\2.csv"), new MySequenceEqualityComparer()); lst2.ExceptWith(lst1); 

    The result will be in the variable lst2

    • He will not split the same thing several times? - iluxa1810
    • @ iluxa1810, will be, every time you go to Equals - Grundy