If there is power - the typical number of simultaneously connected clients is significantly less than the number of cores, then you can do this, for example.
The thread running the socket reads the data and puts it in the task queue. Several other threads are selected from this queue, process this data and put the results in the results queue. The first thread takes everything from the results queue and sends it to the client. All this must be implemented carefully, otherwise there will be problems with locks, memory access conflicts ... Notifications about the appearance of new data - semaphores. If it slows down because of them, it can be confused with manual implementation through variables (it will be difficult and dreary, but I heard about a similar success story from a colleague).
Or, as a fantasy, you can do tasks in turn for each thread-handler. Each thread working with a socket knows 4 handler threads (4 result from 10,000/3000), and in turn puts them tasks.
But for a start, I would recommend thinking about optimizing the actual processing process. It would be much more pleasant to push all the data processing of one client onto one core.