I wanted to know what it took to install an instance of Spark on a Hadoop cluster. I found this step-by-step guide. Even if I used AWS EC2 instances, I ain’t got time for that. I want to play with Spark on top of Hadoop. Only if someone created a serverless platform I could use to kick the tires of Spark.
That’s AWS EMR. EMR is one of those funny-named services from AWS. It doesn’t mean anything that I can tell. However, what it does is takes S3 and EC2 compute and make Hadoop and Spark cluster creation simple. One of the advantages of Hadoop is also one of the challenges. You scale Hadoop computing and storage linearly. Analytics are running too slowly? Upgrade the number of nodes in your cluster.
However, if you run out of space, you also add more nodes to the cluster. Even if you don’t need more compute. AWS solves the inefficiency on their end and you benefit from the simplicity.