The right approach to parsing

Question

Hello, there is the next task, parse email addresses of users of the social network My World who have shown recent activity.

I see the solution as follows:

1) Собираем интересующие нас страницы 2) Открываем страницу 3) В html коде проверяем чтобы значение <span class="profile__user-status"> соответствовало требуемому, например там должно быть написано "онлайн, секунд/минут/часов назад или января/февраля" 4) Если 3 пункт выполняется, то из html кода забираем информацию вида "Страница пользователя xxx@mail.ru социальной сети Мой Мир." 5) Отделяем email адрес

Is done

The desired time parsing at least 1000 addresses for 1 min. Better of course more.

I encounter such a task for the first time. The question is whether it is adequate to implement in python? Or is it easier / quicker to do on another PL (maybe js)? If on the other, which one and why?

Maybe there are already partially ready solutions?

If not, how can I generally solve my problem (which approach should I use)?

Thanks for answers.

On python, you can definitely, but there is a chance that you will be banned for too much activity.

Alex Titov Alex Titov 745 one 6 · Accepted Answer · 2018-01-11T15:37:37

I think Python would be the best option. There are good libraries for parsing HTML pages and a whole system of smart download scrapy .

The speed will be 99% determined by the ability to start many threads, so that you are not banned at the same time. In the use of scrapy you can find examples of how you can try to do this.

The right approach to parsing

1 answer 1

More articles: