Parsing Wikipedia [closed]

Question

I need to parse all the articles of Russian wikipedia for all the time and for the last year. What tools can be used to do this most conveniently in python?

With the parsing of the entire Wikipedia figured out. The difficulty is the collection of articles for the last year.

grab is not bad Very easy to use and effective. - yarkov_aleksei
Are you authorized by Wikipedia for this? The rules say: “[...] by the Wikimedia community” - VladD

m9_psy m9_psy 4,887 3 21 51 · Accepted Answer · 2016-08-06T16:12:01

Downloading everything manually or using the API is a dead end, there are a lot of articles and Wikipedia has a threshold of requests per second, that is, in a thousand streams, scoring the entire download channel will not work. There are dumps with all articles in any language . The base of this is an archive in which a giant XML containing all the articles and some metadata. The information is not exactly the most up-to-date, but should be enough for any tasks. However, there is a problem - the articles in the dump are written on the Wiki template, however there are parsers of these templates. Imperfect, but you can live. There is no desire to write your super cool solution from scratch, you can take it ready - in gensim (a pretty cool thing, by the way) there is a built-in simple parser. And primerchiki , of course.

hunter hunter 739 one eight 26 · Answer 2 · 2016-08-06T11:57:53

Wikipedia seems to have its own API and module in Python to access it.

Found how to limit the time period, but did not find how to pull out all the articles.

Parsing Wikipedia [closed]

Closed due to the fact that the question is too general for participants Streletz , dirkgntly , aleksandr barakin , Nick Volynkin ♦ Aug 7 '16 at 5:46

2 answers 2

More articles: