I’m working on a system which has a backend database that could in principle need unlimited data storage. This project motivated me to take a look at the Apache Spark framework. Spark is an open source computing framework intended for big data, and is well-suited for machine learning systems.
I was skeptical at the outset. As a developer I don’t enjoy the data layer very much — I just want some data so I can write algorithms. Also, I have looked at Hadoop and didn’t like it very much.
Anyway, I was pleasantly surprised by Spark. I like it. From a developer’s point of view, the key data abstraction is called a resilient distributed dataset (RDD). An RDD is like a Collection object that is possibly stored in the RAM of several machines.
If you scan through my demo session, you can get an idea of how Spark works. There is a lot going on in my demo. I start with a text file containing some lines of text from a well-known book. The demo reads the file into a string RDD (named f), breaks the string into multiple words and stores into a second RDD (fm), assigns a value of 1 to each word in a so-called pair RDD (m), adds the values of each word (cts), sorts the RDD fro largest count to smallest (sorted), and displays the result. This technique is called map-reduce.
The map-reduce paradigm is key to working with Big Data. I tend to think of map-reduce as more “prepare-process”. The map part prepares data by setting it up (because it could be scattered across hundreds of machines). The reduce part processes the data in some way.
Spark can be accessed in several ways. My demo uses the Scala language, which is the most common approach. Scala is like an interactive Java with functional language characteristics.