I need to parse all the articles of Russian wikipedia for all the time and for the last year. What tools can be used to do this most conveniently in python?


With the parsing of the entire Wikipedia figured out. The difficulty is the collection of articles for the last year.

Closed due to the fact that the question is too general for participants Streletz , dirkgntly , aleksandr barakin , Nick Volynkin Aug 7 '16 at 5:46

Please correct the question so that it describes the specific problem with sufficient detail to determine the appropriate answer. Do not ask a few questions at once. See “How to ask a good question?” For clarification. If the question can be reformulated according to the rules set out in the certificate , edit it .

  • 2
    I think it will be interesting: scrapy.org - Alexander
  • grab is not bad Very easy to use and effective. - yarkov_aleksei
  • five
    Are you authorized by Wikipedia for this? The rules say: “[...] by the Wikimedia community” - VladD

2 answers 2

Downloading everything manually or using the API is a dead end, there are a lot of articles and Wikipedia has a threshold of requests per second, that is, in a thousand streams, scoring the entire download channel will not work. There are dumps with all articles in any language . The base of this is an archive in which a giant XML containing all the articles and some metadata. The information is not exactly the most up-to-date, but should be enough for any tasks. However, there is a problem - the articles in the dump are written on the Wiki template, however there are parsers of these templates. Imperfect, but you can live. There is no desire to write your super cool solution from scratch, you can take it ready - in gensim (a pretty cool thing, by the way) there is a built-in simple parser. And primerchiki , of course.

    Wikipedia seems to have its own API and module in Python to access it.

    • Yes, it looked. Found how to limit the time period, but did not find how to pull out all the articles. - Tolkachev Ivan