There is a problem: at the entrance one or more EXCEL files. I convert all the cells into text and simply write everything into a new text file, which I also create programmatically.

There is a single-line version of this application: a console application with the NPOI library for working with Excel files. The problem is that when a program processes many, many files, it works for a very long time. Therefore, I want to speed it up and make a multi-threaded version. Tell me how best to implement multithreading.

Before that, I have never encountered multithreading. When I started searching for possible solutions to my problem, I found PLINQ and Parallel.ForEach. Tell me what is best to use.

  • Why don't you try both options and tell us which is better? - VladD

2 answers 2

Trying to read or write a lot of files in different streams, you won’t win much, because everything will ultimately rest on the disk. Therefore, I would recommend the following scheme:

  • Reading and writing a file using asynchronous operations (look towards asynchronous methods like ReadAsync() and async/await ). If the library for reading Excel files does not support asynchronous operation, then read multi-threaded, but you need to experiment and check how many threads you get the best speed.
  • Converting cells into text can be multithreaded ( Parallel.ForEach or PLINQ).

Parallel.ForEach and PLINQ are likely to give roughly the same results (however, nothing prevents to measure both options). What to choose depends primarily on what kind of functionality is needed:

  • if the element processing methods are independent of each other, the processing order is not important, choose Parallel.ForEach
  • if the order of processing elements is important, select PLINQ (see AsOrdered() )
  • if you need stream processing (lazy processing), choose PLINQ
  • if you want to process two collections together, choose PLINQ
  • if you need a thread-local state, choose Parallel.ForEach (it has built-in support)
  • if you want to cancel, select Parallel.ForEach (see ParallelLoopState.Stop() and ParallelLoopState.Break() )

Read more in the document "When Should I Use Parallel.ForEach? When Should I Use PLINQ?" .


Also with the “read-transform-write” scheme you may need the producer / consumer pattern , you will have two such pairs (read-transform and transform write). One of the advantages is that if the parts of the conveyor operate at different speeds, then it is possible to adjust the performance of individual parts, thereby optimizing the performance (both in speed and memory consumption) of the entire conveyor.

    Multithreading is the separation of one big task into several small ones working in parallel, PLINQ will not help you, you can modify your program by making it multi-threaded, create a pool of threads for this and parse Excel into a separate stream, eventually you will receive an increase in speed proportional to the number of processors .

    Excel's NPOI standard Open XML SDK will be faster NPOI