In this special edition of the Cloudify Tech Talk Podcast, we delve into a unique case study with a special guest Haim Yadid– Director of Platform Engineering for NEXT INSURANCE. We cover all things DevOps from Terraform, Kubernetes, to Elastic Beanstalk, and how they relate to a very specific and successful use case.
Guys, welcome to a special edition of our Cloudify Tech Talk Podcast. We have very special guests with us. We have our usual superstars at Cloudify here. We’re going to be talking about a lot of things today, including Terraform, Kubernetes, Elastic Beanstalk, and how they relate to a very specific use case. So to talk more about it, I’m going to pass you over to our CTO, Nati Shalom, who’s going to expand on this.
Hey, Hey, thanks, Jonny. So with me is a friend from, I would say a long time ago, Haim Yadid, who is now managing the engineering within the Next Insurance, which is an interesting company by itself, a start-up company, and it is kind of a changing the insurance landscape. And I had a call with Haim a couple of, I think, weeks or even month ago. And during the discussion, I think I discovered a lot of things that were very interesting about the choice of platform, the choice of language, and the evolution of Java. Haim comes with a lot of background around Java and the choices that they made within Next Insurance along the way that I thought might be interesting, more people in the industry and that’s why we invited him to be a guest speaker in this podcast. So Haim, why didn’t you introduce yourself and we’ll take it from there.
Thank you, Nati, for inviting me to this podcast. So I’m Haim and I’m currently leading the platform group in Next Insurance. I will talk a little bit about Next Insurance and about the company because it’s very relevant for the intro. So Next Insurance was founded at the beginning of 2016 by Guy Goldstein and Nissim Tapiro and Alon Huri. This is their second venture. The first one called Pageonce was sold to Intuit by, I think, something like 350 or $360 million a few years back then. And at the beginning of 2016, they decided to found Next Insurance. They got initial funding of $30 million and the offices were opened at the beginning of May. I was joining Next Insurance from the beginning of its operation.
Basically I’m the first backend developer and I was starting to lead all the backend operations in Next Insurance. The front end group was led by another developer and both of us were reporting Nissim Tapiro, which is the VP of engineering of Next Insurance. Next insurance is disrupting the small business insurance field which is a very small business area of around over $100 billion of premium per year in the US alone. So basically it’s a huge market, very fragmented with technology that is very outdated. And the premise and the story of Next Insurance is to make a huge change in this industry and making the purchase and the management and working with insurance for small businesses much more fluent, much more transparent, much more tailored in such a way that it really becomes a good and comforting experience as it should be and not the frightening experience that it is now or in the past.
Who is the current players, like the legacy players that are in the market right now?
I think that all of the insurance companies have something in this area. The market is very fragmented. There is no real brand or leader in this area, and this is where our Next Insurance comes in. And we really hope to be the brand leader of insurance for small businesses in the future.
So it’s part of the new trend within insurance, where the digital transformation really opened the door for new startups to really disrupt, as you said, the market of those, I would say, well-established insurance companies that were, you know, very legacy oriented coming with a, I’m assuming, a more digital type of experience is probably the main disruption that I think you bring to the table. Now, the discussion you’ve mentioned that that will challenge us, that are unique to that business, that are unique to that, but it’s not necessarily scaling, it’s the flexibility and the permutation that you have to go through. Maybe you could expand a little bit about that.
Yeah. So I think that insurance failed for small businesses is quite unique because you need to give answers for a lot of kinds of businesses, which are very different from one another. For example, you can be a clown. This is a business owner. He can be a clown. You can be a constructor. You can be a doctor or you can be a lawyer. Each one of these segments is very different with very different needs and very different terminology. And being able to support, in an efficient way, these different segments is a real challenge. And you need to put on top of that, the need for different kinds of insurance. For example, you need to get liability insurance. For example, if you make a mistake or hurt someone physically this is liability insurance. You need to have insurance for your car. You need to have insurance for your workers. You need to have insurance for your equipment. You need to have insurance for your property. You need to have a lot of kinds…very different kinds of insurance. And each one of them is a whole world.
In addition, there is another complexity to the matrices is that insurance is a highly regulated field and every state in the United States has different rules and different regulations. So the matrix is actually a three dimensional and a need to vary it in these three dimensions and we need to be able to tailor it efficiently with our software and with our product.
So in essence, basically, you’re saying that if you look at the count, as you mentioned, or another contractor or whatever, each one of them has some different permutations to know the type of things that they would be using from the system. And therefore, unlike if you like regular size, there’s a lot of difference between one user and the other, potentially one tenant and the other. And, you know, as opposed to maybe regular SAS, which is kind of a cookie-cutter for, you know, kind of one platform for many users, in this case, it’s one platform for many users, but behind that, there’s a lot of variation between each user. That’s the complexity that I think that you mentioned between this clown and constructor and another thing. Sorry for picking up on the clown’s example.
Yeah, it’s good to pick up on clowns because it’s funny. So the big amount of variation means that we need to have a system that is very flexible and able to support a lot of variations and to be able to define each of the variations, hopefully, without changes in the code. So as we go forward, we are trying to get a system that is in some cases configured by code to become configured by data. So you need to be able to specify some kinds of language which is enabling the subject matter expert, which is the people that are creating the insurance products, be able to express the needs for insurance without needing to be able to master software engineering and software programming.
And you mentioned also, if I recall correctly, on the call that a, you know, there was a business impact on my brief feature. Again, that’s slightly different than I would say, definitely infrastructure product, and like the one that we’re dealing with and that drives, you know, a direct impact on innovation, like not innovation, but the agility, like how fast you can deliver a feature, have a more direct impact on the business, as opposed to maybe other systems. Can you elaborate on that or expand on that?
So I think that the ability to iterate fast is very important to every software company. I think that’s the main difference between what happened like 10 or 15 years ago and today. Today, we really want to move faster. And the ability to make small changes fast is very important. And it’s one of the biggest differentiators of the approach that we’re taking and we’re trying to improve all the time. So the ability to tailor and make modifications to the product fast, this is what is going to bring the competitive edge to Next Insurance and basically to every company that wants to become competitive. We need the ability to make changes fast. We need to roll them out safely to production. We need to be able to measure their effect in times with not a lot of data because of the fragmentation of different segments of the product. And this is something that is really important to us.
Yeah, so I think many are, you know, as you said, the things that they’re moving fast is important. I’m trying to kind of touch on the business impact in your case. Because in some cases, it’s indirect, I think in your case, it’s more direct the impact of moving fast. Because when you talk about features, it’s basically features that would be a new insurance or a new regulation, or potentially a new region that you can support, things of that line. Again, if you could say a few words about that, then we’ll move to the more technical part of the discussion from here.
So building an insurance product is a complex task, and it takes some time. You need to define it correctly. You need the approval for the regulation. You need to implement it. You need to make sure that the implementation is according to the plan and complies with the regulation. It takes time. It’s not a feature that you develop in a day. However, the ability to make small changes after you define the basic product is extremely important. The fact that you’re able to make small changes and release them to production on the same day to measure them, to roll them back if there is a problem; these are things that are very important. And the fact that we are doing continuous integration, continuous deployment, and pushing to production several times a day from the infrastructure perspective, but also from our infrastructure of insurance is very important for being competitive in this field.
Okay. So let’s move on to the technical choices of the architecture behind it. One thing that you mentioned is that it’s less of a scaling challenge. It’s more of a, you know, this innovation and not the innovation, the ability to iterate faster and I think we discussed that. Let’s start with the choice of language you chose to build on Java. And I saw that one of the discussion on Facebook, some looked at that and said, oh, a new company with a kind of in a joke and legacy language, and you kind of expand that it’s not really Java, its Kotlin. And then maybe you could start with that. I know that you come with a lot of Java background and why did you chose Java for a new startup? And then the choice of Kotlin as part of that.
Okay. So when someone says Java, it’s actually meaning three things. There is the language itself, there is the platform that runs the byte code compiled, and there is the ecosystem. So Java language has been a bit stalled for a period of time and it’s from a feature perspective, and verbosity is it’s a bit problematic for a lot of developers. However, I really think that the JVM and the platform that runs Java code, its byte code is of the most advanced platforms that we have today, both from a performance tooling ability to create a very robust product that is relying on something that you can really count on. I think it’s a very important feature of the language of the ecosystem. The third thing is the ecosystem itself, the third party products, the open-source, which is very advanced and very capable in the Java world.
So the two latter concerns that I mentioned that I think are very well covered in Java – and Java and the JVM platform is very good for – the language is a little bit lagging behind and for that purpose, we have decided to choose Kotlin, which is a programming language that is relatively new. It took some very nice balance and design choices that put it in a very good position to choose it as a language of choice and it solves a lot of the problems that bother developers related to the Java platform. So when we start…
Can we discuss…yeah. Just one pointer here, and maybe Alex, you can join me to those questions. I know that there’s been an attempt to modernize Java with skyline and the type of languages that we’re kind of trying to take the same approach that you mentioned, which is leveraging the JVM as a platform because of its robustness and all the things that you mentioned and try to address the developer experience by creating a different language on top of the JVM. And in that case, address the developer experience challenge, because we’re kind of talking about the two flats here; the languages provide a very good developer experience, but lacking the maturity, definitely that the JVM really provides. And there is obviously Java that is very robust but is lacking the developer experience, especially the ability to iterate faster with the language and do all the things that the other languages like Python and Go(?) have been very good at. And Scholar was an attempt to that, but I think it was neither a good developer experience. They try to capture the maturity of the product but didn’t really yield the developer experience that was expected.
Yeah. So I have some experience with Scholar and also worked in organizations that used Scholar. And I think that there are two major problems with this language. The first one is it’s complexity. Scholar is a very powerful language and it was designed to be powerful and powerful languages require very capable developers which are sometimes hard to get. And what you get is that the good developers, which are struggling with the complexity of the features of the language and taking it to the wrong directions. And I think that Scholar has been a bit too complex to be used in the mainstream and for the average developer. And as a result led to a situation that people write scholar code that is unreadable and very hard to understand.
I think those are less familiar. Scholar was really born to take advantage of the parallel programming aspect that I think was becoming more common with the multi-core type of architecture with copy and write and some other features that came up with the language. But as you said, it was written almost as a scientific language and resulted with non-readable/unreadable coding. And maybe I’ll ask Alex to interrupt you. Alex, did you have any experience with Scholar?
No, unfortunately, that was one of the languages I haven’t touched. One of the reason is as Haim mentioned that once we started, I started to investigate if I want to go onboarding and what are the capabilities and what are the pros and cons? The pros, of course, it’s a very powerful language, but the concept was that it’s very fast, it might very fast go very complicated and unmaintainable, and you need to be very skillful or have a very skillful team in order to work and proceed properly with a good cadence.
I think Spark gave Scholar a little bit of a, I would say, run a rundown because, you know, Spark was becoming popular language, but other than Spark, I’m not familiar with other projects that have been widely adopted but is still using Scholar today. I find maybe you could correct me on that and let’s move to the choice of company afterward.
Can you repeat the question?
Yeah. I’m asking you if you’re familiar with another language that is using Scholar other than Spark, other projects within Java? And second, why do you think Kotlin wouldn’t repeat the same experience that we have with Scholar, which is, you know, it was taking hype and then almost banished. And why do you think Kotlin would be a different experience from my perspective?
So I think there are a lot of projects that are using Scholar and a lot of companies that use Scholar. I think that as a hype, I think that Scholar is on a decline right now. I wanted to mention another problem that I think is worth mentioning is the compatibility with the Java ecosystem for a Scholar. And the backward compatibility approach that the Scholar engineers and designer took, basically there is a binary compatibility issues with Scholar versions, and you need to compile, again, code for newer version of Scholar – at least it has been happening a few times in the past – which is very unnatural for Java developers, which is you compile once and you can run it like a few years later. The other compatibility issue is that all the data structures and collections of Scholar are very different from the collections of Java and mostly redesign.
I think they are better essentially, and they are more immutable friendly. In order to support all these parallelism and functional programming but eventually, when you are trying to work with third parties in the Java ecosystem, it’s very hard to combine it to Scholar because you will need to do translations all the time. So that the second aspect, which is problematic with Scholar. And Kotlin is different because when the designer of Kotlin took this language, wrote it, they were looking on Java and they were looking at Scholar and they were trying not to repeat the same problems. So in some sense, Kotlin is less puristic. For example, behind the scenes, all the collections are Java collections with the limitations that they suffer from. And there are some features that they decided not to include in the language just to make sure that the compatibility inter-opt with Java is maintained properly.
And if features are too powerful, too complex, they are not in Kotlin. So there are some features that took inspiration from Scholar and other features that are removed. For example, operator overloading in Kotlin is very limited but I think this is amazing. I’m really frightened when I see what operator overloading in Scholar has become. All the spaceship operators and such which is leading to a very unreadable code. So the design principles for Kotlin were to make some language that is practical, take the features that are better proven, and don’t try to create experimental features that were not tried in other languages. By the way, I’m not sure that it will not be the direction in the future releases of Kotlin, but at least at the beginning, Kotlin was not very much…
Disruptive to the language, to the Java language. So I could use, for example, a library that was written in Java and then call it from Kotlin and vice versa, I’m assuming. So from a data structure perspective, they kept the compatibility so that I can actually do that versus I would say, Scholar, that was a completely, like, other than the fact that it was running on a JVM, it was very different from a data structure perspective and therefore calling it in a class in Java, wasn’t that trivial from within Scholar and vice versa.
Yes. It’s included some friction.
Yep. Give me a cool feature that, you know, as a developer, that I’m looking for code languages, that’s something that will be easy to work with. What makes the Kotlin – and maybe Alex, you could comment on that as well – what’s very different in Kotlin that is something that will attract developers to that language, do you think?
There is no such feature. Nothing is really cool and there are things that are very nice and they overall is making it cool. For example, I can give you data classes. A class that you only need to specify the members, don’t need to set the getter setters. You don’t need to ride the hashcode that equals that you have copy constructors or copy functions in it. And class with a lot of word-letting Java becomes a single line in Kotlin. But this is not new; there is also in Scholar, the same thing and in other languages. So this is very cool if you come from Java; another thing is the point of validation, same as Swift, if you’re familiar with this feature.
Alex, anything that you wanted to ask or add to that comment before we move to the next topic?
Yeah. I think that the main benefits that we’ll see like the new or extra nice features those will be for developers that are coming from Java, but what Kotlin did, as I mentioned, is just borrowed some features from different languages. Like, I do like ‘extension function’. So for example, if you have classes that you are not managing, for example, strings, and you want to add on top of it, some function, usually what we will do – let’s say list, not string; list will be a better example – So for example, if you wanted to add the capability, you have to create in Java wrapper class and add some extra functions where, in Kotlin, you just can extend the same class.
I love the option of Kotlin to create your own DSL. You can move it to the direction where you actually, your DSL is more like an English language that you can write, and it will be interpreted into functions and do the job so this one as well. There are small topics that were borrowed from different languages and I think that, like, once you will be talking to each one, so you can say, okay, this one was borrowed from C sharp, and this one was borrowed from Scholar and Swift. And, in general, what I personally love about Kotlin, it’s an easy onboarding and for someone who is going over some other’s codes, so it’s very readable and easy to understand what’s going on. And I think they’re doing the right direction because I just added one more topic. And if you will be looking at the Java projects, the upcoming versions, you’ll see that they’re more and more actually adopting… they try to take the features that are in Kotlin – at least it looks like – and to make them native in Java. So it will be more powerful with the extra to link as well.
Excellent. So I think we can talk about languages for hours. That’s a (…) topic. There’s a lot of nuances and, you know, language is like talking about – for me, at least – the literature, music, but we wanted to cover some other aspects on this call so let’s move to the next step, which is your cloud infrastructure and your choices around that and the changes that you’ve made of the choices over the time. Like many startups, you were born to Cloud and that’s probably obviously the disruption that I think enabled you to move fast. You mentioned that you started with Elastic Beanstalk at Amazon. Maybe you could say a few words about the choice, your experience there, and we can take it from there.
So just last sentence on Kotlin if I’m…
Not only theoretical, we have almost 60 developers writing in Kotlin and they’re really enjoying it. And the migration from Java to Kotlin has been very smooth for almost all of them. So that’s my last sentence on Kotlin; it works – at least for us. Now moving to the Cloud. As I mentioned, we started developing at the beginning of May 2016. And I think that the first release of our product to production was one and a half months later on. So this is quite amazing. But you’re able to release something that produces money, a very small amount of money, but at least something and integrating with other systems and that’s in one and a half months’ timeframe we have two developers. And I think the only reason that we were able to do so is that we are using Cloud.
And this is an amazing, I think, revolution to all of us that we’re able to bootstrap a company so fast. And we were using Amazon from the beginning, which we chose to go with AWS. And we knew that we want to have a CICD system from the beginning. And we were developing the infrastructure in such a way that it will support that. So we were using Jenkins as our server for running the build and deploy pipeline. And we were looking for a simple system in Amazon that will help us to deploy services easily. And I want to admit that we chose microservices architecture from day one. We didn’t start with them and then went to microservices later on. So it might be considered as premature optimization but I think it turned out to work well. So I think that the Elastic Beanstalk is the simplest system to start with – we’re talking about 2016. I think that back then, at least for me, it was not clear yet that Kubernetes is going to be the leader in the market. We had other options like Nomad and Nexus And Nati, do you remember any other?
I think you touch many of them, but there was also the one from (…) try to remember. And obviously OpenShift came afterward. There were few generation of those platforms from Pivotal and (…) later become Kubernetes. They were not Kubernetes before, but they were, and obviously there was the (…) via Salesforce. So that whole concept of platform as a service was really new and I think popular at that point in time, because the alternative was very complex, but they needed to go directly with infrastructure at the time. Infrastructure was very almost bare metal a little bit more than that. So the alternatives weren’t there, I think as it is right now.
Yeah. So the last thing, is a deployment system, which gives us a rolling deployment and also blue-green deployment. We used rolling deployment and enables to manage more than one system or environment of a service which basically has the same versions, history, and codebase. And I think it’s relatively simple, relatively easy to start with. And for the time I think it was a good choice because…
Just to touch on that for those are maybe younger and less familiar with that at that stage in the cloud evolution. So Elastic Beanstalk falls in the category of a platform as a service. It’s basically… a platform as a service was born as a platform. Those who are familiar with an application server, it was kind of a modern application server that take a lot of the operational aspects of one application server, meaning that you don’t have to deal with (…) scalability, how do you expose it, like in a public IP and DNS and connect it to a database and deal with all the operational aspect of the application server. But it was very opinionated and you needed to write things in a certain way, obviously it needed to fit well with web application. If you have something that is more backend (…) the right choice.
It was really an evolution of those application server concepts that was very popular, especially in Java and JTV in the early 2000s and became the cloud version of that that was obviously a much simpler version of that, but it was very opinionated in a sense of that far limited. And we always in the cloud, deal with those trade-offs. Like we deal with the previous choice of language, where we were dealing with the tradeoffs between robustness and developer experience. Also, I think in the platform choice, we’re dealing with flexibility versus simplicity. Like how do we one and give a very simple experience to developers and hide a lot of the infrastructure complexity, but at the same time at some point, we do need to have access to infrastructure? And how do you balance between those two trade-offs?
And at that point of time, I think that the problem was that you had to choose either, or; either you go to flexibility and expense of simplicity – and that was a high cost actually – or you go with very simplicity but you pay a lot of toll on the flexibility side. And that I think, the war that was there for you in 2016, so the choice of Beanstalk made a lot of sense. And then the question is why, where did you actually hit the world with Beanstalk?
I think there are some things that are bothering us with Elastic Beanstalk. I think the first problem is that automation with Elastic Beanstalk is hard. For example, if we are using Kubernetes’ cluster and we want to relaunch it or relaunch a service on a different cluster, it’s very easy. Once you have all the infrastructure in place, it’s very easy and it’s feasible to develop a mechanism that makes sense in this area. For Elastic Beanstalk, it turned out to be very hard to develop an infrastructure’s code. And probably also because we really want to move out from Elastic Beanstalk, we didn’t want to put all our efforts into it, but it’s much more complex than we anticipated. In the beginning, I must admit shamefully that we did most of the infrastructure building with click ops and not infrastructure as a code. And when we moved later on to try to do it as infrastructure as code, it was very cumbersome and complex.
So just to summarize on that, cause I think it’s an important point. So the wall, if you like, the point in which you hit the wall is when you needed to start and automate your entire development. So it was good for you running an application and deploying code into, let’s say, a sandbox environment that was easy, but part of the DevOps cycle is that you need to also create an entire environment from scratch, for development, for production, and customize the infrastructure around it.
Haim: And for the (…) for example.
Nati: Right. Yeah, because the concept on platform as a service was that you only need to push code. And the whole concept was really high, the infrastructure. And therefore it makes sense that it wasn’t really built for automating the infrastructure because of assumed ownership of that. But I think the evolution is that when you’re doing those continuous deployments, continuous (…) it’s never just code bullshits. It’s more than that. And you mentioned here and you mentioned some other use cases in which you need to create those entire environments. And that’s where those concepts starts to break. Alex, maybe before we move to the Kubernetes’ side, anything that you wanted to add to that?
No, I think you’ve covered everything pretty well.
Okay, cool. So, that’s kind of the point where you said, okay, we need something better and you started the journey with Kubernetes; when that was?
So our journey of Kubernetes is quite long, a bit longer than I expected. And I think that we are in the last miles of it. And I will try to describe our journey. At the beginning we said, okay, so let’s move to Kubernetes. And we started working that, and it turned out to be much more complex than we thought before. And then we stopped and thought, is this exactly what we want to do? Is this what we want to do at the time? And I think it was like almost two years ago we said, do we really want to move our Elastic Beanstalk to Kubernetes? Is that the most painful thing we want to address? And the answer was no. The most important thing that we wanted to achieve is the ability to (…) environments, mostly for testing. And we changed our focus and decided to launch Kubernetes environments for dev purposes. That was our first step in with Kubernetes.
So at that point, just to make it clear, so the production was still Elastic Beanstalk, but the development was done in Kubernetes?
Yes. So basically what we want to do and want to achieve is to be able to run our suite of tests, which is very comprehensive before we merge our code to master. Which indicates that it starts with the CICD pipeline goes to production. So the developer is developing a feature. He wants to be able to test it before entering it to the main codebase. So what we’re doing is we have an ability to launch an entire system on a branch. That we pull a branch, we launch all the services and we run all the tests. So basically every developer can launch a system at will and the system runs created, run all the tests, and terminates when it is done.
That’s kind of the get-ups paradigm in a sense, right?
And then in that context, basically what you’re saying is that the developer creates a branch. And with that branch, there is a new environment created for him for testing that branch and with everything that is I’m assuming database and other things that are part of that environment, right?
Yes. So the database or the services, both backend/frontend services and also the Jenkins slaves, which runs the tests, run inside the namespace in Kubernetes and they are created when needed and destroyed immediately after.
And how did you ensure compatibility with Beanstalk in terms of the code and the rest of the things?
So, as I mentioned before, our main systems are still on Elastic Beanstalk. We have a staging environment and production environment. We run all the tests on the staging environment before we deployed into production. But this is really changing right now as we are moving our systems in or pipelines in different directions right now. Which if you want, I can elaborate…
…what is happening now?
Yeah, go ahead.
So one of the limitations that we have now, which is really bothering us, and it was something that we knew that will come someday, but we decided to go with simplicity at the beginning. Basically we are…it relates to the previous podcast that you have; all our backend services are in a (…) code. It has a lot of benefits already discussed in your previous podcast. But basically it meant that most of our backend codebase was on a single report and whenever someone made a change and especially at the beginning of our journey, we didn’t try to enforce changes to a certain service. It means that if you want to develop a feature, you do all the changes in all of the services and you just push a single commit. And in that case, it didn’t make much sense to deploy the services independently. So we had a system that was microservices in theory, but the deployment was very monolithic and it worked quite well for a time, and then it started to make our life hard.
So, let’s stop that for a second here, just to elaborate that when you’re saying it was microservices, you mean the code was structured in a modular way, but the deployment was monolithic. Can you elaborate on that?
The code was structured in a modular way, the different services are running on different machines in Elastic Beanstalk – different services. But when we wanted to deploy the code, we deployed all the services to us.
Okay. So the monolithic side was the packaging side, meaning the operational aspect of how do you run it and manage it; was monolithic, even though the code was modular and even the component of the code in runtime was running as separate services, but you had to treat it as one big thing from deployment updates, I’m assuming, things of that plan.
And it’s not very hard to change such a behavior. It’s only changing the deployment pipeline, but the point is that we had another problem. We still have this problem is we had a very good framework for developing integration tests. And what happens when you have a very good framework for developing integration tests?
So you could see the time was also a trainer; still checking that we’re still running and we’re still listening.
Yeah. So the problem with a very good integration test framework is that you start writing down mostly integration tests; and for running integration tests, you need the entire system. So most of the tests that were written for a very long period until we shift our approach, were very oriented toward integration tests and require the entire system to run. And that was becoming the major limiting factor from breaking out the pipeline into several separate pipelines of each one of them can be tested separately.
Okay. So you’re basically saying that the integration test was actually forcing that type of monolithic behavior because you could really do individual tests on an individual component because each component was still dependent on the entire system to run to actually do the validation and testing.
Got it. Interesting.
So what we are trying to do now, and it’s not an easy process is to shift our tests, to be a more confident test that test the service in isolation, the tests, the contractor tests, the internals without using this framework, which we’re trying to limit to very certain scenarios, because eventually, it’s very good to have some degree of integration test, right? Each component can play very nice and separately, but in the end you don’t have anything working. But for our CICD pipeline, we are going in a direction that each service will require only the component tests in order to qualify deployment to production. This is a process that we are currently in progress. It’s a bit frightening because we used to rely on this integration test very heavily and it’s shifting in, you know…
Where did that idea come from, because it does sound scary?
I think that in the long run, we want to be able to treat the entire system as monolithic and without taking the leap of faith that we’re able to test part of it in order to separate the monolithic deployment. Because our engineering team is getting bigger and bigger, we cannot be in a situation that we need to test the entire system in order to deploy to production. This is something that the probability that it will pass every time we run it, goes down as the number of developers increase.
Got it. And what is the process of getting to this from integration test model, leading integration test into this component-based testing? What did you have to change, or where are the hard points? Is that the interfaces between the components? It’s the ability to simulate the data or, you know, maybe scenarios?
Basically have the ability to simulate other services in the system. To be able to run on the service, some kinds of marks that replicate the other services. So you will be able to test your service in isolation. So basically a component test or something that we define the component test is your service, the database of your service, the scheme of your service. We have, of course, different schemas for different services….services do not access other services data. So you have the service running, you have the database and that’s all. And we try to shift as much as possible, test to this format. And there are some services that it’s much easier and we are progressing whether there are others that is harder than…
Okay. So, again, just to summarize that, and maybe Alex, you can comment on your experience with that. Just to summarize that as you’re growing from a development perspective, what you find out is that the code is componentized. Maybe the code itself, even in its packaging is modular, but once you start to create the testing environment, integration testing, you have to have everything in place. And that means that every change needs to integrate into an entire environment. And because there are multiple moving parts in that context of changes that deliver a stable environment for those individual tests, and wouldn’t have a mutual effect of one component or the other breaking the test is much higher. And that slows down the agility, which I think we discussed earlier in the discussion. And that was the point in which you started to think about this componentized test so that each test can be done in isolation without breaking the test because other component breaks and kind of isolate each of the components. Individually to do that, you created some sort of mock so that the interfaces without a component could be still done in isolation without bringing those components in. And that’s kind of the surgery that you’re going through right now to get to that to that type of environment. Is that a good summary?
Yeah. Yeah. I think that you’ve pretty much summed it up.
Alex, do you actually wanted to say anything around that?
I think it’s very interesting challenge because integration tests one of their main, and I think that’s where the initial solution came from is to test end to end solution, to see if you’re changing something in one local service that actually everything is working properly. And for this, for sure you need to have the entire system up and running. It’s a very interesting solution because like once if you’re going to the mockups situation, is that each time you change any microservice, it does require you to change all the mockups and all the interfaces in the microservices that are using it. So it’s very interesting and very creative solution, I would say. It’s really interesting how it goes together. I believe they are in a microservice experience. So with Kubernetes, it’s very easy to spin very fast an entire system, but again, once we are talking about certain amount of microservices, so for example, 100, 200 microservices. So I believe it becomes very tricky to spend each time in your environment that is such big. So, I personally would be interested to follow up, to see how 55:35unclear] of this same challenge ’cause it’s so interesting.
And maybe one more point to touch on what you just described. So the development experience for you is running on EKS, is that right?
So in my previous experience, we were running on the managed say Kubernetes’ service. You manage it by yourself with on AWS, but with the ops. But you’re talking about something like 20 microservices. So there wasn’t an issue, for example, to create a clear environment on-demand with just the specific service that was changed. I guess that’s not the place or not the same case scenario, what I’m talking about because their system is much more vast than the complexity he described, so I think it’s just not to serve it. Like it won’t fit his solution that’s why it’s so interesting. Cause like once we were talking about scale the microservices architecture. So it’s always pros and cons between productivity and velocity. And those are the interesting topics, how you can creatively solve them.
Just to summarize on the environment. So right now, what you’re saying is you running on Amazon, I’m assuming EKS?
We started with COPS and lately we started examining, I guess. So there are some of the clusters are running over COPS. Some of them are running over EKS, I guess.
By the way, what’s your experience between the two?
So it’s not my personal experience, but I know that our dev-ops team was trying EKS, I think, two years ago and got into a lot of issues and they decided to go with COPS. And later on, we got feedback that most of the issues might be solved and we tried spinning EKS clusters as well. Currently, the majority of the workload is on COPS, but we do have some clusters of EKS, which works fine. So if it will work for us, we will move to EKS because we don’t want to manage services when we don’t need to.
Got it. And just to clarify, the Jenkins is you’re still using Jenkins as the pipeline management, right?
So where are you now in the total journey and I know that we have a lot of topics, additional topic to discuss here. I think we need to wrap up towards the end of the session here when we might do another call to go deeper into some of those aspects because it is very interesting. So where are you now in that journey? So you’re moving from being stuck into Kubernetes. You’ve done it in stages, which is a pragmatic approach, which is like running Binstock alongside Kubernetes and using Kubernetes for development. And now you’re moving gradually from that into all Kubernetes type of environment. And you mentioned also traditionally from COPS into Kubernetes. So it sounds like within a few years that you’re running the startup, you’ve changed a lot of the development infrastructure, which is in itself an interesting experience. Can you elaborate on how you make those decisions, how fast you can, you know, even replace environments, and kind of move into that? And what’s the impact on the engineering when you’re making those choices? I know it’s probably a long answer so try to summarize it.
Switching infrastructure is not easy especially when you have production up and running, you have engineering up and running, which you don’t want to disrupt. And it’s much harder switching infrastructure than starting with the right infrastructure at the beginning. And it’s also very hard to predict what will be the best infrastructure to start with.
Is there anything that you would do differently, like in retrospect…when you look at things in retrospect, anything that you would…?
Yes, in retrospect, I think that I should have started with Kubernetes in the beginning. We had a very small operation even if it was not working very, you know, swiftly, we could have solved the problems on time. The evolvement of the system, it would have been much, much easier than today. That’s my thought on it, maybe I’m hallucinating or something.
Yeah, I think two camps that I’m seeing, you know, like those who would say Kubernetes is overkill for certain problems and maybe good for very large scale system and only that point you would still be right. But even the answer to that is changing over time because you know what you would consider complex maybe a couple of years ago would be less complex today. And therefore we’re talking about an environment in which the answer to the same question would be different in a year’s timeframe could be vastly different. And I think you’re a great example for that for, you know, if you’re looking at the answer that you had to take at the time that you moved to Beanstalk, it wasn’t that clear. Like I would go (…) managers at that point of time in 2016 and asked them the choice, a lot of them would say Kubernetes for sure.
In retrospect it’s easy, but back then, I think it wasn’t that clear choice especially, the complexity was very, very explicit. And I think the main lessons from what I’m taking here is that every engineering in the cloud space needs to really adopt that mindset that, you know, it is painful, but it’s also inevitable because technology is moving much faster in the cloud and you have to be able to transform yourself even in the back end type of a system. And even if it’s especially, you know, when you’re not dealing with selling infrastructure, you’re still selling the product, that is a lot of the changes that you’re talking about is not really visible to the customer, it doesn’t really care about that. So it really makes those choices even harder because you’re now investing a lot of engineering work, not necessarily to that maps into our director ROI, from a business perspective; it is a long term ROI, which is the agility and the velocity, which is going to be a hidden cost for many users.
So, how did you got the buy-in from the business to actually make those choices? What was that process?
I think that the business is not very I would say interested in this decision. I think the buy-in was harder inside engineering and in the different functions among engineering. So basically I think that the business is giving trust in engineering to make the right decisions and allocates resources to support the infrastructure and the technologies that we use. So we do have people that this is their job. I think that it was not easy to convince inside engineering that moving to Kubernetes is the right step. But when it comes to business once the operations up and running and you delegates these decisions to engineering.
Right. So I think in your case, the way you kind of curved it out is was running with the two systems, like putting Kubernetes in dev, and then let the rest of the team feel comfortable with that and gradually moved to a full Kubernetes cluster, right?
It was a bit more complex than that because we also had to move to a different region. We made a mistake at the beginning that we started with a region, a US West one, which is a California. And it seems like a very good idea, the beginning because it was close to the office in Palo Alto, but California is apparently one of the regions that are a bit problematic in the US because there are getting rolled out for new features and mostly last. So I think that the first one that gets the updates is North Virginia and then there are other regions and California is left mostly outdated in lots of the features that we needed. So we decided to move to Oregon. And at the beginning, we thought that we would do the move to Oregon on top of Kubernetes, but then we decided to take a smaller step and first moved to Oregon and then move to Kubernetes.
So it was a long journey, and now we are starting to move service…we’ll probably be soon in a position that we will have the first service running on Kubernetes while the other is running Elastic Beanstalk. And then we remove them one by one until we finish it. It is going to be soon
Excellent. So, I think we need to kind of wrap up. It was a very interesting discussion. So, for me, I think I didn’t know that that would be the summary, by the way. So it came out of this discussion that it’s another great example that the only constant is change. That’s kind of the mantra that I keep on repeating. I’m always kind of a surprise to see that, you know, people are surprised of the speed in which they need to make those choices on changing between infrastructure, languages, and platforms. Because at any point in time, you look at the choices and you’re saying, okay, this is the choice and it looks very well adopted in the industry and everyone is using it so probably it’s not going to change anytime soon. And then a year or three years later, the landscape looks very different and you have to make a choice and it becomes very painful.
So our mantra, at least in part, was really the only constant is change, and you need to build your infrastructure and your choices such that it will be adaptable for those changes so that those decisions wouldn’t be that painful and wouldn’t be taking such a long time if you put the right (…) in the right places. Again, this is not the topic of this discussion so I don’t want to steal that topic into the other, just wanted to kind of repeat that mantra that at least was inspired from where from in this description.
And Haim, first of all, thank you very much for going through the details here. I think it would be very interesting for other engineering and other startups to listen to that, because there’s a lot of interesting stuff here, both on the language, you know, thinking of Java, not only as a legacy language and something that could serve a very innovative startup and also how to make a change in their organization.
I think the probiotic approach that you’ve presented here was an interesting approach to move from being stuck in Kubernetes and using Kubernetes into that. You mentioned mono-repo and how integration tests create a monolithic experience, even though the code is not necessarily monolithic. And how do you come around that, which I think was a very interesting point of discussion. And we also discussed how you now kind of do the internal buy-in and transformation. We didn’t dive too much into that, but I thought that that was also interesting and I have a lot of appetite for taking a deeper dive into (…) specifically. I found that the experience of the journey that you’ve taken here is something that is close to my heart so I would probably want to follow up on that. So thank you very much for sharing all this information and for coming to that, I know it’s your first podcast. It was excellent in my view. And any closing remarks on your end?
I’m very honored for being invited, and it was a very good experience for me as well. So thank you.
Excellent. We’ll probably talk more. Jonny?
Thanks, everyone for participating in such a really, really great session; really full of awesome insights. Thanks to a very special guest, Haim Yadid, and also to, of course, Nati and Alex -part of the Cloudify team. Our next session, actually, I believe it was quite exciting. Nati, you want to expand on that a little bit?
Yeah. So speaking of language, we’ll go through the different automation languages. There have been different languages via choices between if you compare HCL, which is the HashiCorp language Tasker, which is what Cloudify was founded on. Ansible(?) H1, has different choices, different selection, different, I would say, the decision on how the automation modeling was built in. And we’ll also discuss some of the, let’s say, a toy that Alex is playing with how to make the experience with the YAML file more safe using the ability to add some ability to do code completion in a nice way on the YAML files. So it will be a very automation language reach type of discussion.
Very cool. Okay, so stay tuned for that. I’m sure that’ll be coming our way very, very soon. As usual, any supporting material for this podcast will be available on www.cloudify.co/podcast. So thanks to everyone for participating, thanks to everyone for tuning in and stay safe and we’ll catch you next time.