There is a Java application whose main activity is automated work on the Internet from many accounts at the same time (NOT spam or other bad activity). To work with the network using Apache HTTP Client 4.5.2. The application goes to the Internet through different HTTP proxy (for 1 proxy 2 accounts) with login / password authorization, and also uses cookies.

Initially, because of the very poor knowledge of Java, the development environment, and, most importantly, of the English language, the work of a multithreaded application was built as follows:

Select an account from the CSV file (primitive analogue of the Database), create a separate stream for it, in which we create an object of class Task, which implements the main functionality of the program (parsing received pages and making decisions based on the data, as well as updating the CSV file) . In this object (class), in turn, an object of the Network class is created, in which all work with the network is performed (and the obtained data is used in the Task).

Thus, for each stream, an instance of the HTTP Client is created. When I did this, I already knew approximately that HttpClient has some peculiarities of working in a multithreaded environment, but it seemed to me that creating a separate instance for each stream is the optimal solution to this problem.

And this code even quite tolerably worked, while there were not very many threads. By some miracle, well, or I just did not notice the error, because they were few.

But when there were a lot of threads, the code stopped working correctly, which is quite natural. Namely, at the beginning of the program, all the threads are started, but quite quickly most of them go to the Wait (lock) created by HttpClient and do not work (roughly speaking, it hangs).

Naturally, I turned to Google and the documentation, and almost immediately realized that you cannot create a separate instance of the HTTP Client for each stream, but you need to create one for the entire application and use

PoolingHttpClientConnectionManager 

But at the same time for each stream to create its own HttpContext (in it, if I do not confuse, cookies + in my case are stored proxy authorization (I use the preemptive authorization method)).

You can read more about this in the HTTP Client documentation , there is also an example, I recommend looking for what would be the most interesting thing I ask further.

In general, everything is clear. But how to do it specifically in my application, given its architecture described above? It was specifically made so that at least somehow be correct from the point of view of the PLO, with the idea of ​​re-using separate classes in other applications.

Of course, you can do everything - choosing an account (and creating a stream if it fits), and the main logic of the application, and working with the network in one class, creating the HTTP Client once. But it will come out a class of huge sizes, which will be very inconvenient to read and edit, and it is almost impossible to use it in other applications. It seems to me that this is categorically wrong from the point of view of the PLO, and this cannot be done.

How then to solve this problem correctly?

While writing my question, I got the idea that you can create an HTTP Client once in a class where an account is selected, and then just transfer it like this:

 public GetThread(CloseableHttpClient httpClient, HttpGet httpget) 

in each stream, and there already pass the same way to the Network class, and work with it.

Then, in theory, the structure of the program should be preserved, and there will be no problems with multithreading in the HTTP Client.

Do I think true? If not, how will it be right?

PS This is my first serious Java application, please do not judge strictly, everything comes with experience. Thank you in advance for your answers and help.

    1 answer 1

    Actually the main snag performance in what? As I understand it, you create many threads at the same time. Simultaneous creation of a large number of threads, as you already understood, is costly.

    In a good multi-threaded program, there is a fixed pool of threads that serves the system. For example:

     public class Task implements Runnable { @Override public void run() { //делаем свои дела } } public static void main(String[] args) { ExecutorService executor = Executors.newFixedThreadPool(4); for (int i = 0; i < 10; i++) { Runnable task = new Task(); executor.execute(task); } } 

    In this example, no more than 4 tasks will be performed simultaneously.

    Problem number two. Suppose you remove the manual creation of a huge number of threads and use a pool with 4 threads. But since you are working with the network, you are waiting for a response from the server. In this place, the thread is blocked, it goes to the waiting:

     public void run() { Response response = client.execute(...); } 

    Obviously, in the event of a long server response, the performance will be bad, and the processor will not be loaded, as you are just waiting.

    To solve such problems, for example, they wrote Non-blocking I / O. In the case of http clients, there are also non-blocking libraries, for example, async-http-client :

     public void run() { AsyncHttpClient client = new AsyncHttpClient(); client.get("https://www.google.com", new AsyncHttpResponseHandler() { @Override public void onSuccess(int statusCode, Header[] headers, byte[] response) { // called when response HTTP status is "200 OK" } }); } 

    After the call to get , the current thread is not blocked and we safely exit the run method, and the answer comes in onSuccess . Inside, such libraries are arranged in such a way that they have a thread pool serving the execution of requests, and in case of waiting for a server response, the thread is not blocked, but goes to serve another request. Something like this :)

    The apache http client also has a non-blocking api example . True, I think that if you do it through nio, then it’s easier to use some popular library.