In the input file ~ 650 thousand lines. Every ten lines I insert into the KeyGroup class and add it to the lst list. Everything would be fine, but the speed is sooooo slow. I thought using RAM, it will work faster. What piece of code can I optimize?

  var data = File.ReadAllLines("test.txt"); var lst = new List<KeyGroup>(); for (int i = 0; i < data.Count(); i += 10) { KeyGroup key = new KeyGroup(); key.key = data[i].Split('\t')[1]; key.url =data.Take(i + 10).Skip(i).ToList().Select(x => x.Split('\t')[0]).ToList(); key.used = false; lst.Add(key); } public class KeyGroup { public string key { get; set; } public bool used { get; set; } public List<string> url { get; set; } public KeyGroup() { url = new List<string>(); } } 
  • one
    inside the loop, you can remove one ToList which is inside the string, well, the second problem - it seems there is a data.Take(i + 10).Skip(i) error data.Take(i + 10).Skip(i) should first be called Skip , although it’s not sure, but this is every time the case of a run from the beginning of the collection - the main problem - Grundy
  • why didn't they use the solution from this answer ? - Grundy
  • And what exactly do you need to do? - VladD
  • Sample data from the source file can lay out? - Sergey Rufanov
  • Very slow is what? What is file size? Is it on a local hard drive or on a network? If the local - then it is HDD or SSD? - yeputons

2 answers 2

My advice to you is that if the authorities are not above you, screaming "LINQ is fashionable for young people, it should be everywhere!" - do not use LINQ in your code:

 const string FILE_NAME = @"test.txt"; const char SEPARATOR = '\t'; var result = new List<KeyGroup>(); string[] line; using (var reader = new StreamReader(FILE_NAME)) { while (!reader.EndOfStream) { var group = new KeyGroup(); line = reader.ReadLine().Split(SEPARATOR); group.key = line[1]; group.used = false; group.url = new List<string>(); group.url.Add(line[0]); for (int i = 0; i < 9; i++) { line = reader.ReadLine().Split(SEPARATOR); group.url.Add(line[0]); } result.Add(group); } } 

On a test dataset (1 million lines - 100 thousand groups generated by your image - a text file of 80 megabytes), this code works out in less than 0.1 seconds, while it took you about 4 minutes.

Most of the time you have to call ToList () in this line:

 key.url = data.Take(i + 10).Skip(i).ToList().Select(x => x.Split('\t')[0]).ToList(); 

But poking around in the depths of .NET, figuring out why it has been running for so long - I have neither desire nor time. Your code is now very hard to maintain, and I see absolutely no point in trying to use LINQ here.

LINQ does not know how to group by the index of an element - this is the main difficulty. You can of course artificially add an index, and group by it:

 var data = File.ReadAllLines(FILE_NAME) .Select((line, index) => new { line, index }) .GroupBy(element => element.index / 10) .Select(group => group.Select(element => element.line)) .Select(group => new KeyGroup() { key = group.First().Split(SEPARATOR)[1], url = new List<string>(group.Select(element => element.Split(SEPARATOR)[0])), used = false }); 

But agree - in spite of the comparable speed of work, is it a much more incomprehensible and more complicated supported code than the one originally proposed? LINQ only creates unnecessary difficulties here, without giving any advantages.

  • the whole problem is Take (i + 10) .Skip (i) - they go over from the beginning of the sequence each time - Grundy
  • @Grundy, yes, but all the same - without LINQ the solution to this problem is easier and more obvious) - Sergey Rufanov '13

In defense of LINQ, I will say that you are preparing it incorrectly.

LINQ must be applied consciously, understanding what and for what, then you will have a clear and fast code.

For your case, as mentioned in similar discussions, you should use the Batch function from MoreLinq.

We get the following simple and nice code:

 var data = File.ReadLines(@"test.txt"); var result = data.Batch(10, ConvertBatchToKeyGroup).ToList(); 

with the additional function ConvertBatchToKeyGroup that generates one KeyGroup from ten lines:

 static KeyGroup ConvertBatchToKeyGroup(IEnumerable<string> batch) { var keygroup = new KeyGroup() { used = false }; var first = true; foreach (var s in batch) { var parts = s.Split('\t'); if (first) { keygroup.key = parts[1]; first = false; } keygroup.url.Add(parts[0]); } return keygroup; } 

I was not even too lazy to hold a benchmark (in the first version there was an error, corrected). I created a file 10 times your size with the following code:

 static Random r = new Random(); static string GetRandomString() { var l = r.Next(1, 10); var sb = new StringBuilder(l); for (int i = 0; i < l; i++) sb.Append(GetRandomChar()); return sb.ToString(); } static string validChars = "abcdefghijklmnopqrstuvwxyz1234567890"; static char GetRandomChar() { var c = validChars[r.Next(validChars.Length)]; if (r.Next(2) == 1) return char.ToUpper(c); else return c; } static void Generate() { using (var f = File.CreateText(@"test.txt")) { for (int i = 0; i < 650000; i++) { for (int j = 0; j < 10; j++) f.WriteLine(GetRandomString() + "\t" + GetRandomString()); } } } 

On it, a LINQ solution of five runs (outside of IDE, VS 2015, Release) gave the following results:

 Test took 00:00:03.0970144 Test took 00:00:03.0980258 Test took 00:00:03.1139645 Test took 00:00:03.0844650 Test took 00:00:03.0531891 

Test code:

 static void Main(string[] args) { var sw = Stopwatch.StartNew(); M3(); sw.Stop(); Console.WriteLine($"Test took {sw.Elapsed}"); } static void M3() { Console.WriteLine("Method #3"); var data = File.ReadLines(@"test.txt"); var result = data.Batch(10, ConvertBatchToKeyGroup).ToList(); Console.WriteLine(result.Count); } 

On the same data, under similar conditions, the test for the method from the alternative solution, without LINQ, showed the following results:

 Test took 00:00:02.9599368 Test took 00:00:02.9154132 Test took 00:00:02.9102364 Test took 00:00:02.8727285 Test took 00:00:02.9275071 

- that is, comparable to LINQ.


And where are your mistakes? A lot of them.

  1. File.ReadAllLines reads the entire file into memory. This is too expensive, because it requires the allocation of all the memory, which then still will not be needed. Better: File.ReadLines , which reads a file in a lazy way and does not load memory.
  2. Cycle to data.Count() - recalculation of the number in the loop each time is not needed. (It’s really fast, since you still considered the entire array as a memory.)
  3. data.Take(i + 10).Skip(i).ToList().Select(x => x.Split('\t')[0]).ToList(); - A terrible horror, it cannot be worse data.Take(i + 10).Skip(i) runs through the entire list from 0 to i + 10 to find the right item! That is, you get a quadratic processing speed! Then for some reason you materialize the list (an extra allocation, albeit only 10 elements), in order to dematerialize it immediately. This is all not needed.