Best way to start with Hadoop?

Question

We need to pick a bunch: the book - installation to start. That is, to have your own cluster for games.

As a machine, I have my own Windows 7 32bit laptop, on it VirtualBox Debian 7.

Questions:

How big is the difference between Hadoop releases? Is 1x very different from 2x? What to download?
There is a download distribution on the Apache website. But I heard there are some hellish problems with the installation. This is true?
There are some ready-made installations with virtual machines from Cloudera QuickStart. However, they, as I understand it, only for a 64bit host machine. I do not have.
Now I'm downloading Horton's Sandbox. Do you do it right?
Is there anything else?

And besides, we need a book . Hadoop Definitive Guide comes from 2012 - is it still relevant? I did not understand from the preface, according to what version it is.

Nickolay Nickolay 1,190 5 silver marks 22 bronze marks · Answer 1 · 2015-06-14T17:54:46

A 32-bit machine means that you have little RAM, and this, in my opinion, guarantees suffering and pain with a virtual machine, within which there is a lot of Java-Hadup.

The same Cloudera QuickStart VM by default comes with 4GB for the guest OS, and to enable the administration interface (Cloudera Manager), the authors strongly recommend raising this volume to 8GB .

As for the rest (with the exception of Cloudera Manager, which is considered more advanced than Ambari in HDP), the choice of Cloudera / Horton is more political. You can start with the one and the other (the main components are still the same), and then try a competing distribution for non-compliant components (such as Impala, which is supplied only in CDH).

Regarding "download from apache.org": I have not tried it myself, but common sense tells me that it will be no easier to assemble a distribution kit from the components than a similar idea in Linux.

Hadoop 1.x differs significantly from 2.x. The mentioned book has recently been published in the fourth edition , and now describes only 2.x, which is easier to read. The 3rd edition was mixed up about two versions.

My opinion - no matter what book to read - you will not get by with one. For my taste, all the books go into the details too much (which makes such and such a class in map-reduce), did not see an intelligent high-level description of the image as a whole (and it changes very quickly). Therefore, he himself long googled. But everyone likes their own.

Alex Chermenin Alex Chermenin 5,022 9 silver marks 31 bronze badge · Answer 2 · 2016-09-29T07:46:49

Yes, the difference between versions 1.x and 2.x is significant, but between 2.x and 3.x is not so much, so at the moment you can choose from the latter.
There are no big problems with the installation of the "clean" distribution, downloaded from the site. Tutorial will help: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html .
Immediately download the assembly from Hortonworks or Cloudera would not advise, it is better to deal separately with each component, starting directly with Hadoop.
Problems with the fact that you have a 32-bit machine, at the initial stage will not arise. This is quite enough to start a pseudo-distributed mode, and even if you want to try to configure a cluster of a pair of virtual machines, 512 MB will be enough for each of them to start, if you successfully configure everything.

Best way to start with Hadoop?

2 answers 2

More articles: