Parallelize processing in the read-handle-save pipeline

Question

In the process of reading-processing-saving , a bottleneck was found processing - converting html to pdf - this part takes the most time.

It seems it turned out to snatch a piece of processing-saving in a separate thread:

public class HtmlProcessor { private readonly string _htmlFolder; private readonly string _pdfFolder; public HtmlProcessor(string htmlFolder) { _htmlFolder = htmlFolder; //Create folder to store pdf-files _pdfFolder = Path.Combine(_htmlFolder, "pdfs"); if (!Directory.Exists(_pdfFolder)) { Directory.CreateDirectory(_pdfFolder); } } public void Process() { var htmlFileNames = Directory.GetFiles(_htmlFolder); foreach (var htmlFileName in htmlFileNames) { var htmlFileContent = File.ReadAllText(htmlFileName); Task.Run(() => { var htmlToPdf = new HtmlToPdf(); var pdfDocument = htmlToPdf.ConvertHtmlString(htmlFileContent); pdfDocument.Save($@"{_pdfFolder}\{Path.GetFileNameWithoutExtension(htmlFileName)}.pdf"); }); } } static void Main(string[] args) { var htmlProcessor = new HtmlProcessor("c:\htmlFilesFolder"); htmlProcessor.Process(); }

QUESTION

Now I want to tear off the processing so that only this part is executed in parallel. I do not know how logic should look for this. I guess that, something like a stream in the stream will be.

So the question is: what does multithreading look like when you want to parallelize a task from the middle of a certain process?

read-> put in the processing queue, from which the parallel-> is removed and processed; we put the result in the save queue.
A queue, for example, may be represented by a BlockingCollection.

Answer 1 · 2017-05-15T12:09:54

First, try PLINQ:

 var htmlFileNames = Directory.GetFiles(_htmlFolder); // EnumerateFiles htmlFileNames.AsParallel().ForAll(htmlFileName => { var htmlFileContent = File.ReadAllText(htmlFileName); var htmlToPdf = new HtmlToPdf(); var pdfDocument = htmlToPdf.ConvertHtmlString(htmlFileContent); pdfDocument.Save($@"{_pdfFolder}\{Path.GetFileNameWithoutExtension(htmlFileName)}.pdf"); });

TPL itself will do parallelization. Moreover, the number of running tasks will depend on the number of cores in the system.

Initializing a thread / task, even if they are taken from a pool, is quite a long process. As a result, the cost of it can offset all the benefits of parallelization, if the processing of each file is short.

In this case, you need to load each task by processing a range of files.

 using System.Collections.Concurrent; var htmlFileNames = Directory.GetFiles(_htmlFolder); Partitioner.Create(0, htmlFileNames.Length).AsParallel().ForAll(range => { for (int i = range.Item1; i < range.Item2; i++) { string htmlFileName = htmlFileNames[i]; var htmlFileContent = File.ReadAllText(htmlFileName); var htmlToPdf = new HtmlToPdf(); var pdfDocument = htmlToPdf.ConvertHtmlString(htmlFileContent); pdfDocument.Save($@"{_pdfFolder}\{Path.GetFileNameWithoutExtension(htmlFileName)}.pdf"); } });

The Partitioner.Create method creates ranges of values that are processed in each task.

I note that Directory.EnumerateFiles in this case is not applicable: you must initially know the size of the collection.

Another note. I do not know what HtmlToPdf class is HtmlToPdf . If its creation is resource intensive and the ConvertHtmlString method can be called multiple times, then it makes sense to take it out of the loop.

 Partitioner.Create(0, htmlFileNames.Length).AsParallel().ForAll(range => { var htmlToPdf = new HtmlToPdf(); for (int i = range.Item1; i < range.Item2; i++) { string htmlFileName = htmlFileNames[i]; var htmlFileContent = File.ReadAllText(htmlFileName); var pdfDocument = htmlToPdf.ConvertHtmlString(htmlFileContent); pdfDocument.Save($@"{_pdfFolder}\{Path.GetFileNameWithoutExtension(htmlFileName)}.pdf"); } });

Parallelize processing in the read-handle-save pipeline

QUESTION

1 answer 1

More articles: