Big Data on OpenStack – Moving to the Cloud

Big Data systems by their very nature tend to be…Big.Big in the amount of data, in the number and size of infrastructure behind it, etc. Cloud based infrastructure can be a good fit as a cost effective infrastructure for running those Big Data systems. While this may sound obvious, many of the Big Data deployments that I’ve encountered run outside of a cloud environment. The main reason is the performance overhead that is often associated in running on a virtualized cloud environment. Let me explain..
The Performance Overhead of Running Big Data on the Cloud
A Recent benchmark done by Petersenna over various virtualization solutions, such as XEN, VMWare, HyperV etc, shows that on average the performance overhead associated with virtualization can lead to 2.4 times slower disk latency and 25% slower network I/O.
bare metal vs virtualization benchmark
How does the performance overhead translate into cost?
Based on the various benchmarks, I think that it would be fair to assume that running I/O intrusive workloads such as Big Data on a virtualized infrastructure would require 3X more resources than its Baremetal equivalent.
As we are talking about Big Data infrastructure – 3x means a lots of wasted resources. The operational costs coupled with the complexity associated with running a significantly bigger system yields a fairly substantial overhead and cost. In addition, the Network and Disk overhead are not deterministic and can vary quite substantially when the utilization gets higher. This leads to not just performance overhead, but also non-deterministic overhead. In Big Data terms, that means that a query for particular data can take 10msec one time or 30msec another time. Quite often, running Big Data analysis requires a sequence of those operations. Therefore, if we pile this overhead in a sequence of 10 queries, the variance in the response time can vary quite substantially.
While the choice of running Big Data on the cloud holds a lot of promise, the performance and non-deterministic behavior makes that choice limited to more of a niche scenario where this overhead  becomes less significant in comparison to the elasticity benefit of the cloud. For example, for sporadic worklads in which we run our analysis for a certain period of time and then can release the resources, using on-demand infrastructure is still a better choice, as running 3x the amount of resources for an hour is significantly cheaper than having a third of that infrastructure allocated 24/7.
OpenStack Bare Metal Cloud
The analysis above shows that for I/O intensive workloads, virtualized infrastructure isn’t such a good fit.
Cloud is often viewed as an infrastructure on top of virtualization. Therefore, by definition, a cloud-based infrastructure inherits the benefits and limitations of virtualization.
Is the coupling between cloud and virtualization really mandatory?
If we think of a cloud as an infrastructure for getting compute, storage and network resources on-demand, then there is nothing  to necessitate the coupling of the cloud with virtualization. It became common to pair the two mainly because of the complexity involved with provisioning non-virtualized resources and the limitations in enabling partitioning of a given bare metal machine.
As I noted in one of my previous posts Bare Metal Cloud/PaaS,  there are ways today to provision a bare metal machine from an image and partition it just like we would do with a hypervisor based VM.  There are already cloud providers that offer a choice of bare metal machines as part of their cloud infrastructure.
A new bare metal project in the OpenStack Grizzly release takes this a step further. It allows us to use the same compute API (NOVA) and allocate a bare metal machine just as we would with a virtualized machine.
All we need to do to make the switch is to change the image type of our target machine and the cloud infrastructure will know to map that request and allocate a bare metal image instead of a virtualized instance.
With this option we can now run our entire Big Data workload on the cloud and not worry about switching environments depending on the workload happen to be sporadic or I/O intensive.
HubSpot OpenStack – Bare Metal Case Study
During the last OpenStack Summit, Jim O?eill CIO at HubSpot showed how using OpenStack with a combination of a public virtualized cloud and private bare metal cloud enabled a 4X increase in their infrastructure efficiency:
clip_image004

“We took this single image, picked it up from public cloud into a Rackspace-powered private cloud and saw a 4X increased efficiency running that workload.”

Moving from Existing Data Centers to the Cloud
Many existing Big Data and BI systems run on traditional data center environments. Moving those systems into an OpenStack-based environment isn’t going to be a walk in the park. This is where automation frameworks, such as Cloudify with the combination of Chef and Puppet, make this transition smoother.
In this approach, we can automate the deployment of our existing Big Data/BI systems in a way that will be abstracted from the underlying infrastructure. We can later use this abstraction to run Big Data systems in our existing data center, and when ready, we can use the same deployment framework on an OpenStack-based environment without re-doing any of that investment.

clip_image006

Learn more through hands-on experience – a Real Life Experience on HP OpenStack
As it often happen in this sort of conceptual discussions the points and arguments from this post may often sound artificial and not easy to grasp. To make it more down to earth, we’ve put a NoSQL datastore such as Couchbase, MongoDB, Cassandra, ElasticSearch and Big Data applications available on-demand on HP OpenStack cloud services. You can use this reference to launch any of those applications and then use the management console to browse through the recipes and customize it to your environment as needed.
It is also worth pointing out that the recipes behind this project are available on Github and can be easily deployed on your cloud or in your data center or even desktop in pretty much the same way. If you are interested in more details, please post a comment on this post or in the Cloudify forum.
For more on Big Data on OpenStack, come check out my presentation at Cloud Expo East on Tuesday, June 11th at 8:15am in the “Cloud Computing and Big Data” Track.
References

comments

  • Best data science software course training institute in hyderabad
    February 11, 2022
    Best data science software course training institute in hyderabad says: Reply

    Really good information, this information is excellent and essential for everyone.
    I am very very thankful to you for providing this kind of information.

  • Best data science software course training institute in hyderabad
    February 18, 2022
    Best data science software course training institute in hyderabad says: Reply

    Genuine information. Thanks for posting
    Thank you so much for sharing .

  • Best data science software course training institute in hyderabad
    February 21, 2022
    Best data science software course training institute in hyderabad says: Reply

    I like this article, very useful for me and this content was very easy to understand for all readers. Keep it up to the great work…!
    keep up the good work. this is an Assam post. this to helpful,

  • Best data science software course training institute in hyderabad
    March 3, 2022
    Best data science software course training institute in hyderabad says: Reply

    This is good site and nice point of view. I learnt lots of useful information.
    Very useful information. Thanks for sharing
    Keep up writing. So useful

  • Best data science software course training institute in hyderabad
    March 8, 2022
    Best data science software course training institute in hyderabad says: Reply

    Very useful information. Thanks for sharing
    Keep up writing. So useful
    Very informative. Thanks a lot for sharing
    Hoping to hear more about this stuff. Very effective and informative.

  • Best data science software course training institute in hyderabad
    March 26, 2022
    Best data science software course training institute in hyderabad says: Reply

    It helped me very much
    I appreciate your hard work

  • Best data science software course training institute in hyderabad
    March 26, 2022
    Best data science software course training institute in hyderabad says: Reply

    It helped me very much
    I appreciate your hard work OK

  • Best data science software course training institute in hyderabad
    March 28, 2022
    Best data science software course training institute in hyderabad says: Reply

    It helped me very much
    I appreciate your hard work
    Thank you so much sir you are doing a great work

  • Best data science software course training institute in hyderabad
    March 29, 2022
    Best data science software course training institute in hyderabad says: Reply

    I appreciate your hard work
    It helped me very much
    I appreciate your hard work
    I am very very thankful to you for providing this kind of information.

  • Best data science software course training institute in hyderabad
    March 30, 2022
    Best data science software course training institute in hyderabad says: Reply

    Thank you so much sir you are doing a great work
    I appreciate your hard work
    It helped me very much

  • Best data science software course training institute in hyderabad
    April 4, 2022
    Best data science software course training institute in hyderabad says: Reply

    I appreciate your hard work
    It helped me very much
    I appreciate your hard work

  • Best data science software course training institute in hyderabad
    April 5, 2022
    Best data science software course training institute in hyderabad says: Reply

    I appreciate your hard work
    I am very very thankful to you for providing this kind of information.

  • Best data science software course training institute in hyderabad
    April 7, 2022
    Best data science software course training institute in hyderabad says: Reply

    I will keep visiting it.
    Thank you so much sir you are doing a great work

  • Best data science software course training institute in hyderabad
    April 12, 2022
    Best data science software course training institute in hyderabad says: Reply

    It helped me very much
    I appreciate your hard work
    I am very very thankful to you for providing this kind of information.
    nice

  • Best data science software course training institute in hyderabad
    April 13, 2022
    Best data science software course training institute in hyderabad says: Reply

    It helped me very much
    I appreciate your hard work
    I am very very thankful to you for providing this kind of information.
    nice.

  • Best data science software course training institute in hyderabad
    April 15, 2022
    Best data science software course training institute in hyderabad says: Reply

    It helped me very much
    I appreciate your hard work
    I am very very thankful to you for providing this kind of information. nice

  • Best data science software course training institute in hyderabad
    April 16, 2022
    Best data science software course training institute in hyderabad says: Reply

    Really good information, this information is excellent and essential for everyone. I am very very thankful to you for providing this kind of information.
    I will keep visiting it.
    Thank you so much sir you are doing a great work

Leave a Reply

Your email address will not be published.

Back to top