In this unique edition of the Cloudify Tech Talk Podcast, we are taking a dive into machine learning and AI when it comes to DevOps – all through the eyes of our special guest Joshua Odmark, CTO and Founder of Pandio. In this discussion Josh walks us through his experience with different distributed messaging systems such as Apache Kafka, SQS and explains why he chose Apache Pulsar from many other machine learning and messaging related systems.
Nati: Excellent. Thanks so much, Jonny and Josh, welcome to the podcast. I think we spoke a couple of weeks ago about the topics and I was overwhelmed with the background and history that you brought in. As Jonny mentioned, the topic of this discussion is going to be focusing on AI and Machine Learning. It’s a pretty hot topic these days, but the thought that your background and history, I think you started very early on when you’re young in the entrepreneurship journey and not necessarily with AI. But I thought that a quick introduction about yourself and your background would probably set the stage for the discussion before we jump into everything AI and everything machine learning. So just…
Josh: All right. Thank you very much, I appreciate that. My name is Joshua Odmark, I’m the CTO and founder of Pandio. I started my career very early. I started a company back in the early two thousand with somebody that I met online, which was not a very common thing back then as a senior in high school. And that company took off and made me think that being an entrepreneur was easy, but we made $16,000 in the first week of running the business and ended the first month with $50,000 in revenue. So, it was like a wild ride for me. We ran websites that were very popular. They received between 500,000 a million unique visits, which in the early two thousand was quite a popular website. So that kicked it off and got me a really good introduction into high volume data and what that looks like.
And ever since I’ve been a serial entrepreneur since then. I raised a million dollars to start a classified ad company. Also worked on a lead generation company out of South Florida. In both of those companies, the classified ads were hyper-local ads. So, you could think of it like a Craigslist competitor with the component that when you list and a classified ad on the website, it would place it on third-party websites to give you additional exposure. That was the whole pitch. And the lead generation company was very interesting because this was in the late two-thousands, around 2009 and our whole angle was real-time bidding on leads, which was a relatively new idea. So, all of those experiences taught me or put me in the deep zone with a whole lot of data and real-time data. So, fast forward to recently, I’m working on a hot startup out of Santa Barbara, California called Carpet Data, and they helped do machine learning and AI for major insurance carriers. And after that…
Nati: Maybe just before we jumped into Carpet data just to summarize your entrepreneurship experience, which as I said is pretty interesting, especially given the age at which you started in high school. For those again, who are less familiar with Craigslist, thinking of it as a simplified yellow page and whatever, very popular in the US and probably elsewhere. Not everyone realizes what it means, but I thought about the intersection that I think…. I’ve read through your experiences as not just dealing with the technology side of things… again, dealing with massive data, but also how to make a business out of it. Which I think is pretty Interlinked with your history. And at that time that was page ranking. That was… you mentioned the ads and other things that I think, again, those are familiar with Google history, that’s pretty much how you make money out of the data at the time. I think that’s the way to look at that. So, that’s an interesting intersection between entrepreneurship, technology, and finding a way to make money out of data. Which I think was very…. early on at the time to do those types of things. So that’s a quick summary of the introduction section. Let’s jump again into 2017. By that time, what age are you?
Josh: Oh, man, make me do some math here. So, in 2017 I was 34 years old by that time. So yeah…
Nati: Still pretty young.
Josh: Yeah, I suppose. It’s like when I hit 30 though, I felt like I was no longer young anymore. It’s like your body leads… it tells you can and can’t do certain things anymore. But yeah, I was 34 when I joined carpet data.
Nati: Yeah. Actually, I can tell you out of my age, which is a little bit further than yours, that at least for me, the 30s is where I felt the oldest in my life and I’m always saying it because you get the mortgage, and you get kids, and you get the family, and you get new startups, and careers and all those things pile up and all of a sudden, you’re coming almost out of university with a pile of duties on your head and it just buries you. So that’s really the time in which I felt the oldest in my life. And ever since I started feeling younger every year. At least that’s what I like to think about.
Josh: Well, I love to hear that because I just had my firstborn four weeks ago. Yeah.
Nati: That’s my mantra. So, it was the weight… I think I will wait a couple of years. I think Elan here can show some of these experiences here, especially when we talked about the time of Corona and the time in which we’re doing… working from home. Elan is the heavy-duty here, in terms of trying to maneuver kids over zoom sessions. So, they can sympathize with you. So, good luck with the newborn, by the way. So, I think that’s a great introduction to corporate data. So why don’t you take us through your 30th?
Josh: So, I joined corporate data as employee number six and they were a very small company. The beauty of the positioning that they were in is they had a lot of very strong contacts in the insurance space. And they were interested in me because at that time I had a lot of experience with growing startups when they either have an MVP or working towards an MVP and they’re skyrocketing. So, on your traditional hockey stick graph of growth, I was very good over the last 10 years or so at helping companies make that transition from when you’ve got a couple of customers, to when you can’t service the customers fast enough. So that was my specialty and that was one of the reasons why they were very interested in me. So, they hired me to run their engineering department.
So, I built out your traditional engineers. This is in the insurance space, so we had actuaries and data analysts and things like that. And their whole angle was to help the insurance carriers use other types of data in their machine learning and AI, for the purposes of predicting risk. So that was a very fascinating position. That’s things like using social media data, Facebook, Twitter, et cetera. Pretty much anything publicly available on the internet, we would help corral and then clean and then help them learn things from that. And so, I grew that team to about 30 individuals on the technical team before I left to found Pandio. And that particular project was very interesting because we were getting into the space of massive amounts of data.
These insurance carriers, they have so much data, it’s absolutely ridiculous and they’re sitting on huge amounts of data as well. For example, one of those large insurance carriers had satellite imagery of the entire United States just sitting on a cluster of servers. So that was hundreds of petabytes of data. So that’s just one example. They had data about nearly everything you could imagine because these big insurance carriers, they’re in every industry. They’re ensuring and doing things like that. So, they were opening up the kimono to us. So, it was a wild position to be in. They would pay us large amounts of money and say here’s a huge amount of data, try to find some patterns in it. So that was… it felt amazing because you’re just sitting back and its a… they’ve asked you to do something that’s not very specific, but it’s fun because you now have this huge amount of data, but now you’ve got to figure out how to do things with it.
So, that’s when I started to dive into the dev-ops nature of dealing with a huge amount of data and doing inference against it. At that particular time, I wasn’t too familiar with data science, but I was very familiar with how to deal with huge amounts of data. And it was at that time that we hired a Ph.D. in data science, he taught at the University of Santa Barbara. That was an incredible experience for me because this is a person that understands all the intricacies of data science. The algorithms, how they work, what it’s all about, and of course, a fair amount goes over your head, but I was tasked with bringing it all together. So, I had to coordinate with the product team to figure out what was going to be built. I had to coordinate with the tech team on the actual tech choices, more like an architect would. I had to work with the data science team, which is all new to me. Their processes and how they need the data and how they need to do training. What it looks like to allocate servers to them to do that.
So, it was quite a crazy ride. And what we found there was, we tried using all off the shelf solutions to move data. And what was very interesting about working with the insurance carriers is sometimes you could use the cloud, and sometimes you couldn’t. So that created very interesting situations. So sometimes you could move the data, sometimes you couldn’t. Sometimes you have unlimited CPU resources and sometimes they will only allocate you a certain number of virtual machines. Sometimes you would have GPU instances and sometimes you wouldn’t. So, you were trying to solve these problems with….sometimes significant constraints and sometimes not. But what we found is, we tried to use off the shelf solutions like SQS and Kinesis, and CAFCA for actual the logistics of data, and found that to be very difficult. When you’re…imagine hundreds of petabytes of data…
Nati: Again, just to make sure that everyone would follow what we’re discussing right now. Basically, when you’re describing SQS, it’s the Amazon messaging systems, and CAFCO obviously it’s an open-source project for doing messaging Apache projects. And when you’re saying that you’ve tried to use SQS you’ve run into some limits; can you elaborate on that a little bit?
Josh: Yeah. So, when it comes to the logistics, you need some ability to…
Nati: Well, I’ll stop you for a second again, just for the audience. To make sure that they follow what we described. So, when you say logistics and data, that usually involves how you get data from the data source and then put it into an acquirable service where you could actually do something with that data, Right? So, if you can say a few words about that logistics.
Josh: Yeah, exactly. So, logistics is just the movement in storage and connectivity of data. So, It’s like the plumbing of a house. It’s actually just the pipes of moving things around. So, we had to figure out how to move large data sets like that. And so specifically with SQS, the thing that you don’t really realize until you use things like this at a major scale is the latency is very slow. And then you start to run into Phantom limitations. I say Phantom because if you dig deep enough, you can find them or somebody talks about how they hit this limit and had to reach out to AWS support. I would say out of everybody, AWS is the best, but you’ll get things like connection limits limitations on the number of messages, and things like that.
But by far, the biggest issue with SQS was connections. The actual physical client that you open a connection to SQS to use, and then the latency which makes total sense. If you’re Amazon, you’ve got the service SQS, you’re serving it up to a huge amount of people. You’re building it for the middle of the road or the averages. So, you look at how your entire customer base uses it, and then you make it perform for that upper average. But for us, being a heavy user, in that case, that’s not really something that expects you to use SQS for that. Now, again, since AWS was a first mover, there are a lot better than that, than somebody like Google, who’s notorious to Google cloud for having terrible support.
And to me, it’s been so bad that we just have not been able to use Google for certain things, because when you hit a limit, they are not very helpful. There’s nobody you can call, but with AWS, they have really great support. And so, another couple issues we ran into using AWS, which was, we were using there… I forget, it’s like their Postgres database and so we were using that almost…
Nati: I can think of it like.
Josh: What was that? Yeah.
Nati: RDS you mean?
Josh: Yeah, they have that new service well…it’s not really relevant but… so RDS has limitations too and you can get into a situation where, anybody has dealt with dev-ops sometimes when you use things a certain way your indexes starts acting weird, and you need the ability to get to the underlying system to figure out what’s going on.
Was it corrupted? What happened? Am I using it wrong? But when you use hosted solutions, you’re limited in doing things like that. So, you inevitably with these large systems need to get closer to the bare metal to troubleshoot them and figure them out. But by and far, the biggest issue with cloud services is the latency. As an example of a leading competitor to SQS, you can expect a 10-millisecond latency. And with SQS, you see typically between 200 to 500-millisecond latency. And when you’re talking massive scale, like you’re trying to churn through a hundred petabytes of data as fast as you can, that latency really becomes a problem because you’re paying per hour of usage. So, a 10 X slowdown is significant at those types of scales.
So that’s our experience with the cloud. Then we moved into more homegrown solutions, like a CAFCA and things like that. The issues around CAFCA I think most people are familiar with is it doesn’t scale well horizontally because topics are tied to brokers. And so, then you can really only scale to a certain point vertically and you can only have a topic that can work as… like tied to the biggest hard drive you’ve got type of a thing because they don’t separate compute and storage. So, there are just technical limitations there and due to the nature of how they structure their architecture, you can’t use a whole lot of topics. So, we found CAFCA to be fantastic up to a certain point.
So, it starts to break at the seams based on your volume of traffic. So that might be a million messages a second through a topic. But when you start to go above that, which is the way we see things trending right now. Most people are doing things that are analytical in nature and there’s really no huge need from a throughput standpoint. But when you get to AI and ML, because it’s not as linear in nature, when you think of an analytical pipeline, it’s linear in nature, something comes from the left, moves into the system, maybe it gets transformed, you’re doing some calculation, it was something like Apache Flink. And then you’re storing the result in the database, which powers a dashboard or a BI tool that a human looks at to get a better understanding of your business. But when it gets into the ML and AI nature, that’s all-automated tasks when you have your results and looks like a circular feedback loop. So instead of something getting put into a database for a visualization or a human to look at, it goes into a model, the model spits out an inference or a prediction value. And that feeds into another model, which then the output of those models eventually gets put back in the beginning of that pipeline, the linear pipeline. So, it looks like a loop.
Nati: Yeah. So again, let’s pause for a second just to make sure that I capture all the things that you’ve mentioned. So, we started with the logistics of the data and the logistics, meaning shipping data from the source into the actual data system. You mentioned trying to use any of the cloud services at the time that was SQS the issue was latency and some predictive behavior or deterministic behavior especially when it comes to scale. I think, by and large, you mentioned that the services and Amazon are pretty advanced and the support is great as opposed to Google. Again, at the time I’m assuming that things have probably changed since then, but we’re talking about 2017, I think, right?
Josh: Yeah. That bridges, yeah that bridges into late 2018. But yeah.
Nati: Okay. So that brought you to use a homegrown solution based on CAFCA. So, people are less CAFCA, Apache CAFCA is an open-source message queue system. Don’t remember the exact time it was launched. And then you mentioned a couple of things that I wanted you to elaborate on if I captured it correctly: the scaling of CAFCA. You mentioned the limit of topics and then the ability to scale horizontally beyond topics which was also limited. So, for those who are less familiar with those topics, I’ll just explain what Josh was trying to say. So basically, the way messaging systems are structured is in a way in which you have queues and queues are needed to be ordered and that they do need to be ordered in a timely fashion. Like a fivefold model.
They have to be on some server. Usually, that’s how their data systems are structured. That’s the easiest way to order messages and in a consistent way, but that also limits the scale. So, if you have a topic that needs to have… so they scale well between topics because each topic can live on a different server. But when you have a massive amount of data on a certain topic, that’s where you hit the scaly limit and therefore your option to scale is scaling up. Meaning, increasing the server capacity and not by adding more servers, which I would say is a better way to scale in the cloud space. And that’s one limit to CAFCA. And then you mentioned that even in the… you had an issue also with a number of topics that you could also scale, if you can elaborate on that, I’m not sure that I captured that point well.
Josh: So just the way that CAFCA’s architecture is designed and here we’re running up against my expertise of CAFCA, because from my perspective, I’m always the CTO figure. So, for me, I’ve got to look at the whole tech team holistically and understand limitations across the whole thing. But just due to the nature of the architecture of CAFCA it doesn’t support using hundreds of thousands of topics. So, it has a limit of tens of thousands of topics. So somehow the way it’s designed and the way topics are set up inside of CAFCA, it doesn’t play nice when you have lots of topics and even tens of thousands of stresses it out. What we see with most people is they stick to less than a thousand topics. It’s a loose guideline generally of CAFCA and depending on how you architect.
Nati: Yeah, and that resulted in some cost issues, right? Because you had to over-provision things if I recall correctly and several things that are related to that.
Josh: Exactly. So, that right there is the biggest negative of CAFCA. So due to the nature of it, when you scale up and then you want to scale down, it has the rebalanced data across the topics. So, it moves your topic in such a way that it has to rebalance the data. That means that traditionally when your data is smaller, that’s easier because it’s just a computation at the end of the day. It’s just reorganizing data. Imagine you were to move data from one queue to another queue, It’s similar in nature to that. So, it’s not overly complex, but it just takes time. But imagine you had huge amounts of data, millions and millions of messages, that just takes time.
So, what you typically see in the wild is people over-provisioned for CAFCA. So, they look at their usage graph and they look at the highest point of that graph. The most it needs to support at any one time. And then they provisioned 10% above that. When you get into machine learning and AI and things like that, that usage graph looks way crazier than traditional event streaming pubs, sober queues. Because, what it looks like is…you might have a team of data scientists that read in three years of data, and then need to sort of process it and use it for something. Then you might have another team that does the exact same thing and that’s an addition to your normal usage. And pulling three years of data…I don’t think I need to stress how massive that could be.
So that creates a huge amount of stress on your logistics or messaging platforms. So, then what we ended up doing was, since we’re doing machine learning and using it as a pub-sub, like event streaming, so it was almost like just coordinating messages versus a queue scenario. We had to create a massive CAFCA cluster. And since we were training, typically every couple of days, it was a huge pain to scale it back down so we just left it up. So, it’s this huge expensive cluster that it doesn’t make sense to scale down because it’s too painful.
Nati: You touched on training; can you elaborate on what training means. Again, for those who are novice to the topic?
Josh: Yeah, and that also is a good transition into the way people are doing it differently today. So, in traditional machine learning, you do what’s called batch training. You have a finite data set and you need to grab that data set and have it in hand and process that to train an algorithm. So training is basically… it ends up looking like a big matrix at the end of the day. So, if you’ve got a million records or 10 million records, imagine those as columns of a matrix and then your training is performing a mathematical calculation on each cell of the matrix. And it’s putting that calculation into the second row of the matrix. And then it’s doing another calculation on that second row. So, it’s doing it a million times or 10 million times, and then it keeps going down the matrix. Those are typically the layers of your training.
So that’s what machine learning looks like at the end of the day. It’s just turning letters into numbers and doing some calculations. And then where it gets really interesting is when you’ve got that matrix, the machine learning has the ability to do backward propagation. So, it can go back if it’s on row five, column 50. It can go backward or it can go forwards. So, that’s why it’s important to have a finite data set in hand so that you can go backward through it or forwards through it. You could just imagine a cursor going forward and backward through a dataset pretty straight forward. So, to do that, imagine huge amounts of data. Typically, the more data you have, the more accurate your machine learning model is after you’ve trained it. So that’s its ability to predict the future is more accurate with more data you have. So, all those math calculations, something has the coordinate that. Something has to say, okay, I need to do these 10 calculations. I’m going to send it over to this microservice or this process to do that. So, all that gets very heavy in the logistics nature of computing. Does that clarify what that is for you?
Nati: Yeah, definitely. And this is where people are hearing continuously on GPU and Nvidia and those things. Why is that even important when you are talking about those topics?
Josh: What’s interesting is GPU’s were built obviously to render graphics to a screen and the nature of how it does that is basically math with like over matrices, when you actually look into how graphics are done. So, GPUs were never built for machine learning. It’s one of those things where just due to the nature of how they work, they’re a shoo-in for machine learning. They just perform it much better. And it’s not like very little, it’s orders of magnitude better. Whereas, a traditional CPU, it has limited cores and it works as a queue of things to do. So, your computer’s constantly switching between things to do over like four to eight cores for a traditional desktop computer. While, if you think of it like an Nvidia GPU, a traditional one has 2000 cores, so the cores aren’t the same.
It’s not liked a GPU is way more powerful than a CPU broadly. They’re different cores. They’re smaller and they’re functionally different. They’re more parallel in nature. But that just wide GPU’s can do a matrix, the math more efficiently than a CPU. A CPU walks through it serially in series and a GPU can do that more in parallel. So, you can train a model on a CPU and a GPU and the GPU will just finish two orders of magnitude sometimes, or faster. Actual time to get to the end of the calculations.
Nati: Okay. So again, pause here for a second, just to make sure that the audience is following what we’re discussing right now. So, I think what Josh mentioned is the reason why GPU became popular… GPU was originally designed for graphic analytics, graphic processing and that’s where Nvidia came in. The nature of graphics is obviously it takes feeds of graphs and data graphs and needs to present it on a screen and a relatively fast model. And the architecture that they built is highly pilot to do that, so that they can actually process some of those signals in a very timely fashion.
Where a CPU was really built to do more transactional work, which is…as you mentioned, the order of command that is basically done mostly serially. And the level of pragmatism was actually introduced very recently, in the couple of years where we have a multi-core architecture, but still, the way that it lends itself less to a highly parallel type of architecture, but more to most sequential type of processing where parallelization is between processes and threads. But that’s less native to this architecture and the result is that if you take a bunch of data and you need to process it, obviously if you could parallel it in by 2000. If you’d like a thread versus a 20, 30, or even a hundred, in some cases, obviously the 2000 would scale much better. So, it’s really those things that I think are lending… making GPU architecture and based architecture lend themselves better to that. Yeah, go ahead.
[foreign language 31:55]
Nati: Elan, can you mute? Sorry about that.
Josh: Yeah. So, something that’s interesting too is, since I’ve been knee-deep in the machine learning and AI space, what it ends up looking like is very interesting. It looks very similar to traditional software. So, what’s very hot right now is microservices and things like that. And that’s what jumped the popularity of these event streaming platforms and pub-sub, is because they needed something to coordinate all these microservices, but a lot of models are being deployed as microservices. So, it looks very similar now from a dev-ops perspective. This is where ML ops becomes very important… from my perspective, ML ops is basically the transition of traditional dev ops, but just focusing on ML. And that’s why I’m bringing up the parallels because it looks very similar to traditional dev ops.
But where it gets very complicated is, typically most applications are linear in nature. It’s charging towards a future and then it gets there and then it’s done, and then it starts over again. But with ML and AI, number one, those microservices are your model deployment. So, after you’ve trained that model, you’ve… it ends up being like a… sometimes like between four and 20-gigabyte files. That’s what you get at the end of the day. It’s just a single file. That’s massive. And then when you deploy your microservice, it has to load that file because that’s its intelligence, and the process of doing that is very difficult. So that can be like your microservice needs maybe eight gigs of memory just to start up. So, you can’t create a pod in Kubernetes, inside of a Docker image and allocate 256 megabytes of memory. Your service will never startup.
Where this gets complicated is once that process that file and as the microservice that’s doing the machine learning has quote-unquote started up, it may only need 756 megabytes of memory. So, you get into a very interesting situation where you have to allocate more for startup, and then you can pull it back. So, there’s just a… it ends up looking the same, there are just slightly different ways you need to do it. And also, when you get into what major companies are doing these days, they have hundreds of those. Whereas most companies or large companies have hundreds of microservices, AI and ML look very similar to that, but those microservices are all models. And where it gets very interesting is you end up building a mesh network where model A…. whatever comes out of model A gets fed into model B, C and D. And then what comes out of model B, C and D goes into other models. So, it ends up looking like a mesh network that… they need peer to peer communication almost. So that’s why the logistics in ML and AI are much more difficult than what exists today. Which is typically built for things that are analytical in nature models.
Nati: Let’s stop here again, just to make sure that we keep everyone in pace here. So, ML Ops is Machine Learning Operation. it’s basically Dev Ops practices for machine learning. And the reason why you need something a little bit special for machine learning operation is the nature of machine learning is much more distributed. And it has all those aspects of shipping data as part of the Dev Ops processes, where I would say non-machine learning or typical traditional application, even microservices are built in a relatively more stateless fashion where you process many obligations or services and the data itself doesn’t necessarily move with the application. And you deal less with the logistics of the data that is to become your words around that. And the processes of moving data as part of Dev Ops processes include all this pipeline of feeding one thing, feed to the other, and feed to the other. That really requires a lot of that data shipping aspect that needs to be covered as part of that. So, I think that’s what makes a machine-learning operation an interesting challenge and something that requires some more focus around that. So, go ahead.
Josh: Yeah, and it gets even crazier complex because with machine learning you’re basically relying on something to automate some part of your business. So, it’s giving some predictions and you’re doing something with that. You need a whole separate track of sanity checks that go along with that. So, these come in all sorts of flavors and all different things. So, it’s almost like when you build traditional software, you put in unit tests and functional tests and a CIA and CD pipeline, all of these things to make sure that the code you’re developing is doing what you expect in production. So, you put it in all these safety things. So, it’s a very similar concept in machine learning. One big thing that’s very dangerous is what’s called model drift. That’s where your predictions can be slightly off and then the data is being fed back into the system.
Then they’re slightly more off, slightly more off and over time they’re becoming less and less accurate. If you’re not tracking that, then you can get into a very dangerous situation where those predictions are hurting you. They’re completely wrong and you’re driving your business off of it. So that whole idea amps up the exponential nature of machine learning as well, because what that looks like is, you’re constantly comparing the predictions to make sure that they’re accurate. All the while, you’re also developing new models to try to make them more accurate or maintain the accuracy.
So that looks like you may want to pull six months’ worth of data and run it through your new algorithm… your new model, and then compare that to what actually happened in your business, just to make sure like you’ve got the most accurate model or to see if your new models are more accurate. So, it ends up looking like the most complex sort of Canary deployment of software where you’re slightly rolling things out to make sure things still act right. From a traditional software standpoint, it’s very…
Nati: If you take the Biden versus Trump predictions, I’m not sure that those models are very accurate. Still, we have room to improve those models.
Josh: Exactly. And the thing that’s interesting…This is depressing from my perspective, but due to the position I’m in these days with Pandio, we work with typically Fortune 2000 companies and the vast majority of…
Nati: Let me again, pause you here, we jumped into Pandio, why don’t you introduce that period of time and why you chose to move Pandio and what Pandio is and…
Josh: Yeah, so, in my experience over the last five years, I redeveloped and reinvented the wheel of data logistics. And when I was at Carpet data, I saw a very clear trend with the major companies we were working with. That the existing logistics software, like CAFCA, Kinesis and SQS was nowhere near adequate for what the future was going to demand. And so, I started Pandio with a business partner of mine. He’s your traditional MBA. We’ve worked together numerous times, very good friends for the last 10 years just to address that. We saw something that was alarming and that was 83% of business executives want AI, but the AI traction or who’s using it is next to nothing. So, we wanted to help usher in that promised future where AI is making everybody’s life better.
And I identified that common denominator of logistics being extremely difficult as a big hurdle to that. And so, having been running Pandio for a few years now, it’s exactly what we see in the world. These large companies without naming them, have a huge problem with connecting data, getting it in a way that they can do something with it, and moving past machine learning and analytics. If you imagine what most companies are doing today from an ML and AI perspective, it’s all analytical in nature, the majority. Obviously, there are edge cases of people doing amazing things like Tesla’s autopilot and things like that. That’s very much way past just analytics, but traditionally, most of the companies are doing just analytics. So, it’s linear regressions, which is the basic form of machine learning.
I view it as way far away from traditional AI. AI is meant to be so much more than just linear regression. And so, the Biden polls and Trump polls and things like that, those are traditionally just linear regressions. They’re just doing simple things against the data and making predictions on it. But when you move into a mission… neural networks and things of that nature, that’s like a massive leap forward and capabilities. And the thing that people are struggling with is right now, the solutions out there like CAFCA are bursting at the seams, just for analytics and to do proper ML and AI, which that’s where I was knee-deep in, it’s orders of magnitude more of everything. More storage, more compute, more bandwidth, everything, and these software are already bursting at the seams. So that’s where I identified the issue of like, we’re going to have a major problem. Like if someone doesn’t come up with logistics software that can handle that scale, well, we’re going to be in trouble. So that’s Pandio in a nutshell.
Nati: Excellent. So, what’s the actual technology behind Pandio? What are you using in the stack there and how do you actually solve the logistical challenges that you mentioned?
Josh: One of the saving graces of my experience in the past is, I came across the patchy pulsar. So, Apache pulsar is basically an event streaming platform that can do pub-sub and Qs. So, it has all the messaging patterns built into it so it was really great. So, I use that to solve a lot of the issues I was having with CAFCA and, and Kinesis and things like that. So, we used Apache Pulsar as our foundation to Pandio. We took Pulsar and made it better. So, it’s got stream processing and an SQL against the stream, which is pretty interesting to write SQL against an infinite stream of data. But our real value prop was all those experiences going back 10 years, dev-ops was very difficult so you had to throw a lot of bodies at it.
And these aren’t like low-level employees. Your traditional highly skilled dev ops person, they’re very expensive. Some are a quarter of a million or more in salary every year here in the US. And we see companies in the fortune 2000 that have hundreds of them just turning dials and knobs and things to keep systems running. So, I saw an opportunity to use AI itself to improve that. So, the big benefit of Pandio is it has that traditional messaging that can work for any use case but use a neural network that actually runs our messaging. So, it will horizontally scale through Kubernetes, it’ll reconfigure Pulsar and change configuration settings… hot-swapping them. Changing caches and things like that, redistributing things, it completely ruins it to the point where you don’t really need dev ops.
So, in one of our biggest customers who is a large public media company in the US, for this particular project, they have 15 Dev Ops people working on it. They switched it over to Pandio, and now there’s a quarter of an FTE that’s doing the same thing that those 15 people were doing. So, it’s really just an offset of Dev Ops because there’s scarce labor there. Dev Ops… it’s very hard to find good people there and ML Ops is even rarer because it’s traditionally the cream of the crop of dev-ops or moving to ML ops these days. So, it’s a very difficult thing that major corporations are dealing with right now is scarce labor there. So, our neural networks help offset that. And the more data you send through your messaging, the more value that neural network provides. It learns as it goes. So, it’s pretty fascinating.
Nati: Let me maybe try to summarize what we just described. So, we’re talking about Pandio and… Pandio, right? How do you spell it?
Josh: P A N D I O dot com.
Nati: Right. So, and it sounds like you’re really focusing on the part of the message processing of machine learning, like logistics as I think you mentioned a couple of times, and on that regard the… what you’ve done is really provided the message processing that has a machine learning engine behind it, that all the setup of how you configure it and tune it over time would be an automatic and wouldn’t require any intervention of an engineer or a Dev Ops guy. And it will be a self-configured, self-manage kind of thing. Self-tuned even and it’s based on the budget project, I forgot its name right now. Can you remind me again? Pulsar
Josh: Pulsar, P U L S A R.
Nati: Okay, and why by the way, did you choose Pulser? What brought you to use it?
Josh: There’s a lot of different reasons. One of the biggest reasons, well, two of the biggest reasons was the unified messaging. So, with CAFCA, you can’t do traditional cues. So, like the FIFO cue that you mentioned earlier CAFCA, can’t do that. So, if you wanted to do the same things that Pulser would do. You would need CAFCA and something like rabbit, MQ, or SQS. So, you need two pieces of software just to increase your tech complexity. So, it has all three of those messaging patterns, event streaming, pub, sub, and queuing. And secondly, the biggest thing is it’s cloud-native. So, it separates compute and storage. So, if you want to scale up compute independently of storage, you can do that. So, it scales much more easily horizontally. You can use more commodity hardware to scale out horizontally versus having to add more CPUs and bigger hard drives, not more hard drives, bigger hard drives as compared to CAFCA, for example.
So, it’s cloud-native and is built much for a container world. And that gives you a huge amount of flexibility to scale up storage independently, scale it down. So that gives you a lot of ancillary benefits, like CAFCA. Like I was mentioning earlier, it’s very hard to get to a hundred percent resource utilization with a CAFCA cluster, with Pulsar it’s significantly easier to get closer to a hundred percent server utilization. And there’s just a laundry list, I would definitely invite people to check out their website. There’s a huge amount of information thereof all additional benefits, like zero message loss. So, it’s designed to be more durable, which is great for the financial industry. We talked to a huge bank, the third biggest bank in the world and they were using CAFCA, but they were noticing message loss. And so, they had to build a whole other system to make sure that message loss wasn’t happening. And they just had to do that because there was no other solution that they are aware of.
So, the financial industry has an especially hard time using CAFCA, but if you can lose 0.0, zero 1% of messages, and it’s not a big deal, then great. But when you’re a ledger and transactions, that’s a big deal. But yeah, those are the two biggest ones, but there’s a huge amount. And we took that as a very good foundation and built on top of it, so…
Nati: Right, and the thing that you added is really the ability to manage it, or scale it, to configure it based on different patterns. As opposed to having someone in operation Dev Ops guys that will continuously manage a system, right?
Josh: Yeah. And we view it as a connectivity layer as well. So, another thing we added to it is, for almost any ML or AI software out there, like spark and things like that, we have connectivity into those things. So just… if you’re doing an ML or AI initiative, it just saves time really at the end of the day. And the other interesting thing, we didn’t get to talk too much about is online machine learning. I think that’s a really hot topic that a lot of people aren’t aware of and they should be. If we have time, I’d love to just touch on that.
Nati: Let’s do that.
Josh: So, we were talking about traditional training of models and that’s where you have to have a finite data set, have it in your hands, and then you do your training quote, unquote offline. So that’s like you give a data scientist a really big box with really powerful GPUs and they do their thing. Online machine learning flips all that around. So, as things are flowing through a stream, you’ve got that infinite data. There’s no end to the stream. You’re just getting data as it comes through. There’s a library out there called Psych kit Multi-Flow. So, they basically took a psych kit, which is a very popular machine learning library, and then they turned it into online machine learning. And so that’s why it’s called Psychic Multi-Flow.
The idea is you’ve got an infinite stream of data so as record one comes in record, two comes in records, three, it’s doing the training and the inference at the same time. Typically, those are not done at the same time. You do a whole bunch of training and then you do your inference after that and you stop training. And then you do more training after that, a month later, just to… and so then you’ve got a new model every month. So that means that every day that passes in the old way of doing it, your models are a day older and then a day older. And then… which means the accuracy could be moving in the wrong direction. So, with machine learning, that’s online and nature against an infinite stream, so there’s no end, it’s training with each data point that comes in.
That means it’s as accurate as it’s going to be at all times. It never is out of date. This is a lot easier in many respects because you don’t have the whole ML ops operands of doing the training and deploying the model into production. Your models are in production and getting smarter every single time data’s flowing through it. So that gets… that’s a very interesting different way of looking at that. And since we’re so heavily in that space… we’re doing a very interesting partnership with Nvidia, with the Rapids framework which is very much in this space. So that they’re very interested in… Nvidia is very interested in stream processing in general, and this falls under that umbrella. So, it’s stream processing, but it’s doing a model training and processing doing the inference on the stream of data at the same time.
It’s a fascinating concept because it makes everything easier. Your model is always up to date. You don’t have the whole expensive nature of pulling in three years’ worth of data and training. Now, obviously, it’s not a replacement for batch training because like I was describing earlier, when you’ve got a matrix and you need to be able to go forward in it and backward, well, as you can imagine in a stream of data, you can’t go forward. There’s no, there’s no such thing as forward in a stream of data. You’re always at the tail end of it. And you’re limited how much you can go backward just based on how much has been sent through it. So, there are limitations and some models require full-forward propagation and full backward propagation, which means it requires a finite dataset. So, there are algorithms you can’t use in this methodology, but there are algorithms that work in it. Things like decision trees and things along that nature. So, it’s a very hard space.
Nati: Right. So, okay. This is indeed fascinating. I just wanted to make sure that it’s clear when you use the simulation of the data and the training of the data and when this with time… I think it’s called machine learning that we’re discussing right now is applicable. For example, if you’re doing face recognition, you still need all the pictures of the history of the person to actually do the recognition, right? I’m not sure that will fit into that, but to say a decision tree might be…so, if you can elaborate on when that’s applicable, because it sounds like it’s too good to be true, in the sense that if that’s a sound so… it makes so much sense. Why are we even having other options of doing batch-oriented stuff or all the training processing that we’ve done before?
Josh: That is… so the negatives are definitely those algorithms psychic monthly flows, a good starting point. They list the algorithms that you can do this with. And they also have a lot of examples that make that easier. The other thing is it takes a lot more compute to do online training. So, for example, when you send that event through, and then it gets fed into the training algorithm, which is called partial training, basically its compute is about an order of magnitude more than if you were at that same point in an offline nature. So that’s like an apples-to-apples comparison of online training versus offline training. The online training can be an order of magnitude slower, which is not necessarily a huge bad thing because that’s still… we’re talking microseconds here. But if you design it properly where you do your inference and then you return back to the requester and then you do the training, when nobody’s waiting around, you can offset that.
But that is something to keep in mind that it is slower apples to apples, but you can design around that. And then of course the limitations of the algorithms, I can’t talk too much to those exact differences, that’s definitely where my data science hits a wall on my expertise. I can’t tell you the exact algorithm and the exact reason but it is almost always around backward and forward propagation. It’s needed to look at more data than what’s available in a stream. So, imagine like a window of data in a stream, you can say you can look at the last hours’ worth of data by having a stateful function and you just keep track of the last hours’ worth of data. And then when the hour eclipses you do some calculation but you’re in a stream, you’re not going to do that with a week or a month’s worth of data. It doesn’t really make sense and surely not for years’ worth of data, that’s an anti-pattern to streaming. So anytime you need that for your machine learning, that’s where the limitations are.
Nati: Got it. So that’s indeed cosmetic. And again, I feel that we can spend hours talking about that. By The way, my background and gig of space was really around solving large cluster data. In this case, it was kind of similar to what this is doing today. So, the biggest lesson was one of the first in-memory distributed data processing. And I’m doing automation to clarify the processes of how you manage those things so it’s touching those two aspects of it. I think you’ve mentioned it. It’s called… one of them is the data on logistics as he called it and the other one is the importance of the operation of the actual logistics. So, I think that fits nicely into my background on those two topics.
One of the things that maybe if we want to summarize and then talk about the… close the discussion here is that I think there’s a strong correlation and I when we talk about machine learning between… and I think you talked to a couple of aspects of that… between automation and machine learning. So, we tend to think of machine learning as is it more from a data science perspective, but we tend to ignore the logistical aspect. I think you put a lot of focus in this call around that, which I think it’s, it fits well with the topics that we’re covering right now, but you also opened another interesting idea of logistics that can in itself use machine learning. I think that’s what you’re doing in your company today, which is for solving the logistical problem, you could actually auto-configure all the scale or heal your own operation by collecting the data and remove or reduce a lot of the Dev Ops work that people need to do today. So, looking at the logistics from a couple of aspects, I think is interesting. And again, if I’m to put almost final words around that when we’re discussing here based on a [inaudible 1: 01:15] machine on a… you cannot really not think about those things because it’s inevitable.
Josh: Yeah. And you know what, your kind of touching on something that really pulls in my heartstrings is, when I…. I’m in a very fortunate position, right. I get to talk to the head of data scientists’ teams with major corporations every week. And the promise of the future where it’s almost like… like imagine the singularity or from the Terminator movies, like Skynet like those, where we’re at today is so much in its infancy, it’s not even close to what that future looks like, where it’s more AI automation powering entire parts of the business. An entire manufacturing line is just completely automated. Today it’s more about making a human better at their job, not in general, that’ll always be true, but in the future where AI builds AI. Imagine AI writing software does something and then AI manages that software, then there’s no human insight, it just does what it does, from start to finish. To me that’s…
Nati: Like a self-driving car?
Josh: Exactly, yeah. So, if you extrapolate that and imagine if full self-driving was here today, it’s one of those things where it’s so almost unbelievable. Unless you have a Tesla and you see it do some things you can like see the light, you can see like, oh my gosh, like it we’re getting there. It’s not here today, but we’re getting there. Whereas, the vast majority of the companies are just on analytics these days. And I don’t…a lot of the very big companies are quickly figuring out that what they have today for analytics is nowhere near what’s needed for what ML and AI is going to be in three to five years. It’s not entirely new classes of software are going to have to be built.
And so that’s what our positioning is. What’s out there is great for analytics. Our software works great for analytics, but that’s not our focus. We’re trying to help companies almost build the singularity in a sense, like that’s what we built for. We didn’t build for a better mousetrap in analytics, but that’s what gets me excited is imagining a future where entire groups of AIs completely solve some issue. And it’s no longer evens a conscious thought for humans. It’s just, AI takes care of it. We can think about something else and spend our time somewhere else. That’s what we want to enable as a company. And that’s, that’s the future I’m excited about.
Nati: Great! So, Josh, thanks very much for taking you through this journey from your earlier entrepreneurship days on your… I would say school or high school till today which I think covers a very nice spectrum also in the way data processing and get started as a ranking of websites. And you also mentioned Craigslist as an example to the point in which we’re discussing this online machine learning in which we’re getting closer to this self-driving car experience. And as you mentioned, we now at the stage of mostly analytics and less of that self-driving car but we’re moving closer to that self-driving car experience where it’s going to be all like closed-loop automation. And that discussion took us through the journey both from a personal perspective, but also from a technology perspective. So hopefully the audience would appreciate that and enjoy the discussion. I definitely enjoyed it and we’re probably going to talk again until next time, thank you very much for getting onto this podcast, Josh.