In defense of LINQ, I will say that you are preparing it incorrectly.
LINQ must be applied consciously, understanding what and for what, then you will have a clear and fast code.
For your case, as mentioned in similar discussions, you should use the Batch function from MoreLinq.
We get the following simple and nice code:
var data = File.ReadLines(@"test.txt"); var result = data.Batch(10, ConvertBatchToKeyGroup).ToList();
with the additional function ConvertBatchToKeyGroup that generates one KeyGroup from ten lines:
static KeyGroup ConvertBatchToKeyGroup(IEnumerable<string> batch) { var keygroup = new KeyGroup() { used = false }; var first = true; foreach (var s in batch) { var parts = s.Split('\t'); if (first) { keygroup.key = parts[1]; first = false; } keygroup.url.Add(parts[0]); } return keygroup; }
I was not even too lazy to hold a benchmark (in the first version there was an error, corrected). I created a file 10 times your size with the following code:
static Random r = new Random(); static string GetRandomString() { var l = r.Next(1, 10); var sb = new StringBuilder(l); for (int i = 0; i < l; i++) sb.Append(GetRandomChar()); return sb.ToString(); } static string validChars = "abcdefghijklmnopqrstuvwxyz1234567890"; static char GetRandomChar() { var c = validChars[r.Next(validChars.Length)]; if (r.Next(2) == 1) return char.ToUpper(c); else return c; } static void Generate() { using (var f = File.CreateText(@"test.txt")) { for (int i = 0; i < 650000; i++) { for (int j = 0; j < 10; j++) f.WriteLine(GetRandomString() + "\t" + GetRandomString()); } } }
On it, a LINQ solution of five runs (outside of IDE, VS 2015, Release) gave the following results:
Test took 00:00:03.0970144 Test took 00:00:03.0980258 Test took 00:00:03.1139645 Test took 00:00:03.0844650 Test took 00:00:03.0531891
Test code:
static void Main(string[] args) { var sw = Stopwatch.StartNew(); M3(); sw.Stop(); Console.WriteLine($"Test took {sw.Elapsed}"); } static void M3() { Console.WriteLine("Method #3"); var data = File.ReadLines(@"test.txt"); var result = data.Batch(10, ConvertBatchToKeyGroup).ToList(); Console.WriteLine(result.Count); }
On the same data, under similar conditions, the test for the method from the alternative solution, without LINQ, showed the following results:
Test took 00:00:02.9599368 Test took 00:00:02.9154132 Test took 00:00:02.9102364 Test took 00:00:02.8727285 Test took 00:00:02.9275071
- that is, comparable to LINQ.
And where are your mistakes? A lot of them.
File.ReadAllLines reads the entire file into memory. This is too expensive, because it requires the allocation of all the memory, which then still will not be needed. Better: File.ReadLines , which reads a file in a lazy way and does not load memory.- Cycle to
data.Count() - recalculation of the number in the loop each time is not needed. (It’s really fast, since you still considered the entire array as a memory.) data.Take(i + 10).Skip(i).ToList().Select(x => x.Split('\t')[0]).ToList(); - A terrible horror, it cannot be worse data.Take(i + 10).Skip(i) runs through the entire list from 0 to i + 10 to find the right item! That is, you get a quadratic processing speed! Then for some reason you materialize the list (an extra allocation, albeit only 10 elements), in order to dematerialize it immediately. This is all not needed.
ToListwhich is inside the string, well, the second problem - it seems there is adata.Take(i + 10).Skip(i)errordata.Take(i + 10).Skip(i)should first be calledSkip, although it’s not sure, but this is every time the case of a run from the beginning of the collection - the main problem - Grundy