A Recent benchmark done by Petersenna over various virtualization solutions, such as XEN, VMWare, HyperV etc, shows that on average the performance overhead associated with virtualization can lead to 2.4 times slower disk latency and 25% slower network I/O.
Based on the various benchmarks, I think that it would be fair to assume that running I/O intrusive workloads such as Big Data on a virtualized infrastructure would require 3X more resources than its Baremetal equivalent.
As we are talking about Big Data infrastructure – 3x means a lots of wasted resources. The operational costs coupled with the complexity associated with running a significantly bigger system yields a fairly substantial overhead and cost. In addition, the Network and Disk overhead are not deterministic and can vary quite substantially when the utilization gets higher. This leads to not just performance overhead, but also non-deterministic overhead. In Big Data terms, that means that a query for particular data can take 10msec one time or 30msec another time. Quite often, running Big Data analysis requires a sequence of those operations. Therefore, if we pile this overhead in a sequence of 10 queries, the variance in the response time can vary quite substantially.
While the choice of running Big Data on the cloud holds a lot of promise, the performance and non-deterministic behavior makes that choice limited to more of a niche scenario where this overhead becomes less significant in comparison to the elasticity benefit of the cloud. For example, for sporadic worklads in which we run our analysis for a certain period of time and then can release the resources, using on-demand infrastructure is still a better choice, as running 3x the amount of resources for an hour is significantly cheaper than having a third of that infrastructure allocated 24/7.
The analysis above shows that for I/O intensive workloads, virtualized infrastructure isn’t such a good fit.
Cloud is often viewed as an infrastructure on top of virtualization. Therefore, by definition, a cloud-based infrastructure inherits the benefits and limitations of virtualization.
If we think of a cloud as an infrastructure for getting compute, storage and network resources on-demand, then there is nothing to necessitate the coupling of the cloud with virtualization. It became common to pair the two mainly because of the complexity involved with provisioning non-virtualized resources and the limitations in enabling partitioning of a given bare metal machine.
As I noted in one of my previous posts Bare Metal Cloud/PaaS, there are ways today to provision a bare metal machine from an image and partition it just like we would do with a hypervisor based VM. There are already cloud providers that offer a choice of bare metal machines as part of their cloud infrastructure.
A new bare metal project in the OpenStack Grizzly release takes this a step further. It allows us to use the same compute API (NOVA) and allocate a bare metal machine just as we would with a virtualized machine.
All we need to do to make the switch is to change the image type of our target machine and the cloud infrastructure will know to map that request and allocate a bare metal image instead of a virtualized instance.
With this option we can now run our entire Big Data workload on the cloud and not worry about switching environments depending on the workload happen to be sporadic or I/O intensive.
During the last OpenStack Summit,


“We took this single image, picked it up from public cloud into a Rackspace-powered private cloud and saw a 4X increased efficiency running that workload.”
Moving from Existing Data Centers to the Cloud
Many existing Big Data and BI systems run on traditional data center environments. Moving those systems into an OpenStack-based environment isn’t going to be a walk in the park. This is where automation frameworks, such as Cloudify with the combination of Chef and Puppet, make this transition smoother.
In this approach, we can automate the deployment of our existing Big Data/BI systems in a way that will be abstracted from the underlying infrastructure. We can later use this abstraction to run Big Data systems in our existing data center, and when ready, we can use the same deployment framework on an OpenStack-based environment without re-doing any of that investment.
Learn more through hands-on experience – a Real Life Experience on HP OpenStack
As it often happen in this sort of conceptual discussions the points and arguments from this post may often sound artificial and not easy to grasp. To make it more down to earth, we’ve put a NoSQL datastore such as Couchbase, MongoDB, Cassandra, ElasticSearch and Big Data applications available on-demand on HP OpenStack cloud services. You can use this reference to launch any of those applications and then use the management console to browse through the recipes and customize it to your environment as needed.
It is also worth pointing out that the recipes behind this project are available on Github and can be easily deployed on your cloud or in your data center or even desktop in pretty much the same way. If you are interested in more details, please post a comment on this post or in the Cloudify forum.
For more on Big Data on OpenStack, come check out my presentation at Cloud Expo East on Tuesday, June 11th at 8:15am in the “Cloud Computing and Big Data” Track.