AWS AI & Machine Learning Podcast

Episode 4

January 05, 2020 Julien Simon Season 1 Episode 4
AWS AI & Machine Learning Podcast
Episode 4
Show Notes Transcript

In this episode, I have a chat with Pavlos Mitsoulis-Ntompos, a Data Scientist for the Expedia Group and an AWS Machine Learning Hero. We talk about real life ML, and what it takes for ML projects to be successful.

⭐️⭐️⭐️ Don't forget to subscribe to be notified of future episodes ⭐️⭐️⭐️

Check out Sagify, an excellent CLI tool for Amazon SageMaker written by Pavlos:
* End to end demo:

This podcast is also available in video:

For more content, follow me at and at

speaker 0:   0:00
Hi, everybody. This is Julian from edible. Yes. Welcome to Episode four of my podcast. Don't forget to subscribe to be notified of future episodes in this episode, I'm actually talking to one of our machine learning heroes, Pavlos working for the Expedia Group. This is an interview that I've done a few ah, weeks ago for the AWS innovates online conference. But I thought it was so good that Ah, I should use it for the podcast as well. So without further ado, let's listen to what Pavlos has to say on machine learning to any while. So Pablo's will introduce himself. But just to give you ah, little background. Um, I actually started following on Twitter a while ago because he was was sharing all kinds of really, really cool stuff and and there was AWS content and sage maker. And I'm like, Okay, I gotta I gotta talk to this guy and then completely by chance, right way ended up speaking at the same meet up in Athens Was the big date on meet up so high to everybody from Greece. If you're watching, uh, looking forward to being back there. Ah, and, uh, and he said OK, we know each other, right? And, um And then we talked more. And, uh and it was a very easy decision for us to decide that Pablo should be a machine learning hero. Right? If you're curious about machine learning heroes you probably know about, I devote a zero. So members of the edibles community do a lot right? Help other developers with tools and blog's and what not projects. And now we have machine learning heroes. So unless you're one of those, you think I'm really excited to be a w similar here? Um so? Well, yeah, stuff They designed this working for Expedia group on Animal Hero. Um, I'm so excited about machine learning and really privileged to live in this era. Um, I started working in the mail maybe seven years now, eight years ago, when it wasn't so fun. Say, did you actually study it or did you ah, fake it until you made it? Well, I started operations research. It's on then computer science. I took many ml courses for course about optimization, but it wasn't there wasn't any course back in the day specialized in machine learning. So I had some formal training uh, but that's all. Second Lynn was anything. But then I remember, was Wecker. Um, it was a good tool for small data. Um, and yeah, it was like my produce a big thing. I published a paper about my produce, so people were starting talking about big data. How to store, obviously the help process them. So it was the beginning off the first step of my sin living. It's a good point. I have a slightly similar story. Although I didn't take any former courses, but a big data. 2010 2011 right? It was all about Hadoop and big data and piling up Web logs and try to figure out what to do with him. Um, and, uh, and machine learning kind of followed on, right? That's why we have data with computing power and we have to do something with them. Let's do something that we can't just do basic aggregation or whatever. Let's be smart. So we had all the data necessary, you know, terabytes of data. And then the question was like, Okay, what are we doing now on? Then? We had to use all these moderate years. And how do our processing tools to prepare the data for the machine learning models. And then was like, it was a painful era where we didn't have a lot off good ml tools to process big data. Yes, and so I keep saying it. You know, machine learning is not something on the side. Right? Machine learning is just one step in your project, and you need data and big data tools. And we had a session today on what big data service is on. AWS can help you with that. You know, glue and Athena and EMR. So data engineering seems to be the buzzword now. Okay, So data engineering is really, really important on in ml just follows down, right? Exactly. Zbig process that involves many people from different disciplines. Date ends in their software engineers, data scientist, machine learning engineers, even product managers. So it's ah, it's really challenging to make all these people work together. Um, and then now we have all the stools, and then it's more like an organizational problem. How do you think So you were You're part of the Expedia group. Um, tell us about how you what? What kind of machine learning projects you work on. And, uh and how do you organize them? Because I absolutely agree. You know, Tech, You know, tech is on the table. We can pick what we need and assemble things and build, but actually running the project, getting from the business question to the actual prediction that helps improve the business. KP. And that's the big story, right? So tell us a little bit about what you do on a daily basis. Well, I think 90% of my my job is to prepared the mind data for the 10% which is actually machine. Let me. Ah, that's the number that I keep hearing. 80% at least. Yeah, exactly. So, um, I'm really happy because I think most others hopes the fun part of this one or 2% so help people speaking is good at 10%. Yeah, it's the absolute minimum, right? But I'm really privileged to be part of Expedia because it's a big company. They have an ML platform that essentially helps. They'd assign the stone, look their talent. I think one key thing to make a successful machine learning project in a big company or a small company doesn't matter, is to have a really good machine learning platform. That's the key thing on Dhe. That's why, for example, See, I believe that states maker can contain hands and make my sin learning power many different features on the companies out there. Yeah, I hear that from a lot of customers. You know, off course, when we do those small demos like we've done today, we look at toy problems, very small data sets on dhe. We try to solve one thing. But I guess you know when you're working on ml at Expedia or any other company, you're maybe looking at 50 problems and 200 data sets and thousands off models because, um, if you're trying to build a classifier right, there's no one way to do it there so many ways. And then what about hyper parameter tuning, which we talked about? You know, more models, more models, more models. So Andi issue If you have a larger team one day, the scientist, that person's already gonna be lots of things. If you have 10 or 5100 and you're looking at potentially thousands of training jobs every day on and the cleaning process is acceptable right So that's That's the main issue right now. And you need to make sure that you're not duplicating any work because probably someone else in another office has been to say, a similar classifier. So discovering ML models in a company is another big problem. And here seeds make it can help you with your concerns for the different tailoring jobs. Uh, what were the parameters? What were what was the input data? Um, how often this more the least trained, if it is deployed. Um, all the stuff s so so. Data wrangling. Did our engineering's important model Discovery model version ing data set. Preparation is important. Uh, what about, um What about the actual training and deployment process? I mean, how? That's a question I've to everybody. How much automation do you have today? What do you still have a human in the loop for model validation? Whatever model que way or you know, what's the automated part? What's the manual port? And how far are you willing to go? I see. And you don't lie, right? A lot of people I work. I think that, um, there isn't at the moment. The deployment process is not fully automated. So have a seal I tool that deploys a model, for example, of Expedia. But I have to remind myself that okay, these model needs to be re trained on knew they re training is important. Exactly. So I think that universally speaking, there is no continues delivered from a single learning out there that essentially you can have, I find, like in Jenkins file that will tell to do your CD to OK, go and retrain the model every week, using these data and then deployed Ah, as a rest ful endpoint have three instances. Uh oh. And before we deployed, make sure that, for example, the precision recall is higher than specific school. Yeah, uh, so I don't think we have something like that. S o. C. Is making you think it's Do you think it's possible? I mean, do you think we'll get to ah, um, you know, one click? Um, one click automation. Just like we've done for Web abs and containers and everything else. Or or do we always need, um, humans at some point? Looking of the model and deciding. Okay, this is a good one. This is a bad one. And okay, click and then go go and push it, which is automated. What's your gut feeling here? I think it's possible, but it will take time. I think the whole most stimulating community needs to agree to the best practices. There are many different ways to do it. Um, I think that, for example, seeds make your and forces in a good way the best practices. And, you know, a company like Amazon that you know, does a machine learning for maybe decades. Ah, you know, you know already. Oh, the painful Boynes Essentially, you need You know what needs. It's not to be done. That's very important. So we can. The community can learn from that and can transcend to the best practices out there. Uh, I think it's possible, like now would have saved maker. It's essentially the ML infrastructure, but let's stay. Let's take a stay back. Mm. Flat from consists off two things in the middle infrastructure like seeds maker and an interface between the data scientists and the ML infrastructure. That interface needs to be ah, user interface and the seal I too at the same time. Um so sanctify this Eli to love Bill. Yeah, we're going to get a demo of that later in the session. Okay? It's It's a very It's a very cool too. Um, so yeah, yeah, Yeah, I guess so. Dev. Ops. Right now, that's why you know, we we wanted to have automation in and Dev ops and data wrangling sessions today because I don't know if you if you see that as well. But a lot of customer discussions around the male tend to drift to, you know, neural networks and SG and, uh, and the crazy stuff in five minutes, and I'm always say, Well, okay, next person says SG is just leaves the room, you know? What's the business problem, right? Exactly. So, uh, I guess my question is, and without disclosing anything, of course, how fancy are really life models, Right? Because we keep reading, you know, hacker news and archived papers. And about the crazy Latin, the new crazy gan or the new crazy NLP models. You know, Burt and all the variations around bert are really, really exciting. Now, if we come back to reality, right isn't anywhere. Isn't just everyone using linear regression and extra boost. And I think 90% of the cases. Yeah, it's really finally because Lena regulation without the maybe 100 years old, something like that. And, yeah, someone who uses linear regression and says, Okay, I'm doing a eye on. Then study sessions, you get little bit angry. And they were like, Oh, I was doing I 25 years ago. Um, but it's a good thing we need baselines. Uh, every time that you start a new mill project, you need the bass line. This usually in the regression logistic regression, maybe use a random classify eso would you actually recommend? Because we have a lot of people who are new to machine learning Also watching us, Um, is that something you work on every day? Like, let's try the simple things first, let's try the simple or basic Argos and I mean that in a positive sense. Simple is good. We love simple. I love simple because I understand them. So would you run those first before going crazy with, you know, neural networks and that everything else definitely. You know, that is the outcomes rager theory. That simple is it Lee is the best thing s Oh, definitely, because it will make your whole process simpler. You will deploy a modeling production in much less time. It will make everyone in the business happy. So in this way Ah, you have something in production. You will get quickly feedback. If the mother works really well in production, you have simple code. Um, it's understandable by most engineers and product managers. And then if everyone is happy and they want to move on and make it more accurate than yeah, you can move on to deep learning or more fancy tools. Algorithms. It's so that's that's great advice to everybody out there, especially the people who are kind of new to ml again. That's why we wanted to have this session on the introduction to ML with psychic learned and go through. You know, Lina regression, logistic regression trees and you know, PC A. Right. A lot of people ask me Oh, I want to learn about deep learning. How do I do that? My answer is always the same. Like how much you know about statistical ml. You start. You know you have to walk before you run. So so spend some time learning those because, like paddles just said a lot of companies just you. That extra boost is still winning on Kagle all the time, right? So it's winning competition. It's still very good, and a lot of people still use it. It's probably the most popular around. Go today, right? Exactly. And if you're starting is really good to find a good mentor and informal one. Someone that has been doing missing, leading for many years and try to learn from that person what didn't work in the workplace, what it worked, What open source tools they used. How do they collaborate with other members in the company? I think the key thing before you you start applying this stuff in an industrial said, Yeah, that's good advice. So, uh, you know, I'm gonna be more brutal about it. So don't believe the hype, I guess. Okay, you know, just like everyone else. I read the blog's and I read the archive, put papers and, uh, and the technical blocks from Cool Cos. Etcetera. But really, life is usually more boring, and it's great because, you know, boring is is nice. Boring. Technology is simple to understand. Simple to explain. So if you're solve your problem with, you know, random forests or linear regression, then fine, right? You're you are doing machine learning. Okay, that's the end of the episode. I hope you enjoy the conversation. And don't forget to subscribe to my channel. I'll be back next week with Maurin A blessed news and demos and God knows what else. Until then, keep rocking.