Strings of different lengths and from different characters, i.e. option with sorting by counting, not rolling. Made an external merge sort. I break the file into pieces of 100MB, sort each one, perform a merge. All this takes me about 10 minutes. Although there is someone else's software nearby, which in 6 minutes sorts the same volume and performs removal of duplicates from it.

I attach my code.

public static class Sorting { public static void ExternalMergeSort(string originalFile, string newFile) { //Разбить файл на части по 100 Мб string dir = SplitFile(originalFile); //Отсортировать каждую часть foreach (string file in Directory.GetFiles(dir, "*.txt", SearchOption.AllDirectories)) InternalSort(file); //Многопоточный вариант - не хватает памяти //List<Task> tasks = new List<Task>(); //foreach (string file in Directory.GetFiles(dir, "*.txt", SearchOption.AllDirectories)) //{ // Task t = new Task(() => InternalSort(file)); // tasks.Add(t); // t.Start(); //} //Task.WaitAll(tasks.ToArray()); //Объединить части MergeFilesInDirectory(dir); } /// <summary> /// Разбитие файла на куски укaзанного размера /// </summary> /// <param name="originalFile">Файл для разбития</param> /// <param name="maxFileSize">Максимальный размер части файла. Default = 100Mb</param> /// <returns>Путь к папке с частями файла</returns> private static string SplitFile(string originalFile, double maxFileSize= 1e+8) { TimeWatcher.Start("SplitFile"); var lines = File.ReadLines(originalFile); string dir = Path.GetDirectoryName(originalFile); string extDir = dir + "/" + Path.GetFileNameWithoutExtension(originalFile); if (!Directory.Exists(extDir)) Directory.CreateDirectory(extDir); string partPath = extDir + "/" + Guid.NewGuid().ToString() + Path.GetExtension(originalFile); var outputFile = new StreamWriter( File.OpenWrite(partPath)); foreach(string line in lines) { outputFile.WriteLine(line); if (outputFile.BaseStream.Position >= maxFileSize) { outputFile.Close(); partPath = extDir + "/" + Guid.NewGuid().ToString() + Path.GetExtension(originalFile); outputFile = new StreamWriter(File.OpenWrite(partPath)); } } TimeWatcher.Show("SplitFile", true); return extDir; } /// <summary> /// Внутренняя сортировка файла /// </summary> /// <param name="originalFile">Сортируемый файл</param> public static void InternalSort(string originalFile) { TimeWatcher.Start("InternalSort"); List<string> list = File.ReadAllLines(originalFile).ToList(); list.Sort(); File.WriteAllLines(originalFile, list); TimeWatcher.Show("InternalSort", true); } /// <summary> /// Слияние файлов в указанной директории /// </summary> /// <param name="dir">директория с файлами</param> private static void MergeFilesInDirectory(string dir) { TimeWatcher.Start("MergeFilesInDirectory"); // Открываем все файлы разом и формируем слой чтения List<StreamReader> readers = new List<StreamReader>(); List<string> layer = new List<string>(readers.Count); foreach (string file in Directory.GetFiles(dir, "*.txt", SearchOption.AllDirectories)) { var reader = new StreamReader(File.OpenRead(file)); readers.Add(reader); layer.Add(reader.ReadLine()); } //Создаем файл результата var writter = new StreamWriter(File.OpenWrite(dir + "/Result.txt")); int Id = 0; while(layer.FirstOrDefault(x=>x!=null) != null) { string min = layer.Min(); Id = layer.IndexOf(min); layer[Id] = readers[Id].ReadLine(); writter.WriteLine(min); } writter.Close(); foreach (var reader in readers) reader.Close(); foreach (string file in Directory.GetFiles(dir, "*.txt", SearchOption.AllDirectories)) { if (Path.GetFileNameWithoutExtension(file) != "Result") File.Delete(file); } TimeWatcher.Show("MergeFilesInDirectory", true); } } 

Found an opportunity to increase the maximum size of the object (in .NET limit 2GB). Thanks to this, you can safely read the entire file into memory, and already there to tear off the pieces. Reading 2 GB (with SSD in DDR3) took 0.7 seconds.

 <configuration> <runtime> <gcAllowVeryLargeObjects enabled="true" /> </runtime> .. 

Details here: https://social.msdn.microsoft.com/Forums/en-RU/8d5880fe-108e-47d2-bbd7-4669e0aec1ec/-2-?forum=programminglanguageru

  • 1) Read with buffer 2) Write with buffer 3) Sort by logarithm - tym32167
  • And in more detail?) - Alexander Lee
  • 3
    well, at a minimum, I see that you read the lines one at a time. You also write 1 stock at a time. But in order to effectively read from a hard disk and write, you need to reduce the number of reads / records. You can do this with a memory buffer. I already wrote about it: one , two , three , well, or here I had a question , where I generated 16 gigs and sorted them in 25 minutes - tym32167
  • @ tym32167, corrected. File breaking speed increased 3 times (from 30 seconds to 10). Thanks already! However, the main time takes the sorting. I read your question, but even did not understand where to dig. Poke a finger please) - Alexander Lee
  • In the SplitFile method you read 100 MB and write to a new file. Then in the method InternalSort re-read the same data. Change the algorithm: read from a large file of 100 MB in the list, immediately sort it and write it down. Then it remains only to smuggle. - Alexander Petrov

1 answer 1

A small file is read into memory as a whole array of bytes. We divide the array into pieces by separator 0x0A (we include the separator in the string). Sort the resulting array of pieces (You will need to write your Comparer). Then open the BufferedStream to read a large file and one more to write to the output file. Here I use StreamReader for convenience. We read the line. We get the bytes of the string in unicode. And we perform a binary search for this array of bytes of a string in our sorted array of pieces of a small file. If not found - write to the output file. You can also save the sorted array in the same file from which it was received and in some way remember that the file is already sorted and do not sort it when you use it. Important. The build must be done under x64. The size of the file that is read in its entirety should not be larger than 2GB. Otherwise, the File.ReadAllBytes function will throw an exception.