- June 28, 2017
- Posted by: Matej Artač Guest Post
- Category: Big Data
Any technology becomes exciting if one can experience it hands-on. This is a great way to learn to use it, obtain experience, and gradually build useful solutions. In this post, we show how to get started with Apache Spark quickly and easily without losing days of time trying to install it. Read on if you are interested in Big Data and have an Amazon AWS account or an access to an OpenStack cloud.
In the DICE H2020 project, we are bringing UML-model-driven development approach into the DevOps world. Automation of continuous application deployment is an important element, and here Cloudify has been a great enabler. In the release 0.3.4 of the DICE components, we are several steps closer to two of our goals: making Continuous Integration and application experimentation easy, and providing building blocks for deploying Big Data applications. In this post we will demonstrate the latter bit on the example of Spark.
Watch the Big Data for DevOps Webinar! Go
Let us start with some pre-requisites. The recipe we will give you requires an instance of Cloudify Manager 3.4.2 running in the Amazon AWS. We also recommend that you set up the DICE Deployment Service, a wrapper service for making cloud deployments for DevOps easier. You can follow our instructions, and that will will take about 1 hour for collecting all the configurations and settings. The process of boot-strapping the services is then completely automated, taking about 45 minutes. Now, anyone with an Amazon AWS account can try this out in the EC2.
For the rest of the post, we will walk you through the features of a blueprint that we find particularly noteworthy. The blueprint deploys a Spark cluster and runs a π calculation on top of it. This is the topology as shown in the Cloudify GUI:
We start with the header part of the blueprint:
Considering that the DICE TOSCA library is compatible with the Cloudify Manager version 3.4.x, the chosen TOSCA dialect is
cloudify_dsl_1_3. Then, in the
imports we can use just one plug-in to gain the multicloud support for the Big Data services.
node_templates section then contains all the needed entities to make a Spark cluster work.
We start with the master node. In this example, we will make the master node available on a public internet address, thus we add the node template
master_ip. We also need to declare a firewall node that we here name
master_fw to protect the master host from unsolicited network traffic:
master_ip is of type
dice.VirtualIP, which is an abstraction in the DICE TOSCA library of the concept of what the OpenStack world refer to as the floating IP, and in the AWS it is known as Elastic IP. Similarly, the
master_fw is of type
dice.firewall_rules.spark.Master. This abstracts the notion of security groups in OpenStack or AWS, which also internally codes in all the rules that are relevant for this particular service type. The designer of the blueprint neither needs to know what they are nor enter them manually into the blueprint.
To accommodate for a virtual machine to host the Spark master service, we declare the
master_vm node template:
The type used here is
dice.hosts.ubuntu.Medium, which represents a virtual host running the Ubuntu operating system. The host’s flavour (i.e., size) is medium. Using the relationships of
master_vm we connect the virtual machine with the virtual IP and the firewall node templates.
Note that all three of these node templates map to specific concepts handled in the target platform (e.g., OpenStack or Amazon EC2). The plug-in makes sure that the proper API calls get invoked during the orchestration to manipulate the live instances of these constructs.
We finish defining the Spark master node with the following node template declaration:
dice.components.spark.Master unlocks the magic of our TOSCA technology library by running the needed installation and configuration processes on the
master_vm instance, installing the Spark master. Again, the author of the blueprint does not have to know what is involved in this installation process. For the curious, however, the Chef cookbook provides the details of this implementation.
A Spark cluster also requires a number of Spark worker nodes. We define them in a similar fashion as we did the master node. These nodes don’t need any external access, so we skip the virtual IP:
Note that the number assigned to the
instances.deploy property can be as high as required by the workload, but for π calculation, 2 is more than enough.
The only difference from the
master definition, other than a different type, is that
worker is in relationship with the
master node template. This ensures that the orchestration will configure the worker with the proper address and port of the master node.
The last bit in the node template to be defined is the Spark application. The type to use is
dice.components.spark.Application, which defines a submitter of the user’s custom application. Here, a number of properties are available, and the blueprint author has to populate them with the values that are specific to the user’s application:
As the example shows, the properties of the node fully describe the user’s application. The
jar property points to the
.jar-compiled application that is either available online (the orchestrator will fetch it before submitting it to the master) or can be bundled in with the blueprint and referred to locally. The
class property defines the class to execute, and the
name property reflects in the name of the application when submitted. Finally,
args is a free-form list of command line arguments required by the particular application. Here, we only supply one parameter with the value 10.
The two relationships are important, because
dice.relationships.spark.SubmittedBy declares the master that will both run and receive the application. The
dice.relationships.Needs, on the other hand, is present to express a dependency of the client application to the worker nodes. If it was not present, the orchestrator would submit the job to the master before the workers would have been ready, causing delay or possibly even a failure of running the job.
The concluding part of the blueprint represents the outputs of the deployment:
In this case, we use TOSCA internal function
get_attribute to extract a runtime attribute of the floating address to the master service, and
concat to compose the final URL.
Obtaining a working Spark deployment is then a matter of submitting the blueprint to an instance of DICE Deployment Service. 10-20 minutes later, the cluster will appear and will compute an approximation of the number π. The only disadvantage of the simplicity of the presented example is that this result remains buried in the cluster, so the user has to log into one of the nodes to read it.
Now we have a blueprint that we can keep versioned in Git, and use Continuous Integration to deploy it automatically and frequently. This is perfect for a DevOps approach of developing and validating Big Data applications. In the DICE Deployment Service’s GUI, a successfully finished deploy will appear like this: