Downloading everything manually or using the API is a dead end, there are a lot of articles and Wikipedia has a threshold of requests per second, that is, in a thousand streams, scoring the entire download channel will not work. There are dumps with all articles in any language . The base of this is an archive in which a giant XML containing all the articles and some metadata. The information is not exactly the most up-to-date, but should be enough for any tasks. However, there is a problem - the articles in the dump are written on the Wiki template, however there are parsers of these templates. Imperfect, but you can live. There is no desire to write your super cool solution from scratch, you can take it ready - in gensim
(a pretty cool thing, by the way) there is a built-in simple parser. And primerchiki , of course.