AWS AI & Machine Learning Podcast

Episode 3: Machine Learning from A to Z

December 30, 2019 Julien Simon
AWS AI & Machine Learning Podcast
Episode 3: Machine Learning from A to Z
Chapters
0:54
Accuracy
1:31
Backpropagation
2:44
Convolution
3:49
Dataset
4:45
Epoch
5:35
Feature
7:10
Gradient
8:32
Hyperparameter
9:57
Iteration
11:05
Just do it!
11:58
Keras
12:50
Loss
14:30
Model
15:09
Neuron
17:35
Optimizer
18:21
Python
19:10
Quantile
21:05
Regularization
22:35
SGD
23:26
Training
24:48
Underfitting
26:11
Validation
27:48
Weights
29:31
eXpectations
30:43
whY?
31:48
Zero
AWS AI & Machine Learning Podcast
Episode 3: Machine Learning from A to Z
Dec 30, 2019
Julien Simon

In this episode, I explain in plain English 26 Machine Learning and Deep Learning terms that you need to know. If you're beginning with the field, this should save you quite a bit of frustration!

⭐️⭐️⭐️ Don't forget to subscribe to be notified of future episodes ⭐️⭐️⭐️

Audio samples have been generated with Amazon Polly :)

The podcast is hosted at https://julsimon.buzzsprout.com, and is also available on YouTube at https://bit.ly/368Iz7g

For more content, follow me at https://medium.com/@julsimon and at https://twitter.com/julsimon.

Show Notes Transcript Chapter Markers

In this episode, I explain in plain English 26 Machine Learning and Deep Learning terms that you need to know. If you're beginning with the field, this should save you quite a bit of frustration!

⭐️⭐️⭐️ Don't forget to subscribe to be notified of future episodes ⭐️⭐️⭐️

Audio samples have been generated with Amazon Polly :)

The podcast is hosted at https://julsimon.buzzsprout.com, and is also available on YouTube at https://bit.ly/368Iz7g

For more content, follow me at https://medium.com/@julsimon and at https://twitter.com/julsimon.

speaker 0:   0:00
Hi, everyone. This is Julian from edible. Yes, this episode is called Machine Learning From A to Z Not because I'm going to explain all of it in 15 minutes, but because I'm going to go from A to Z and explain some machine learning terminology that is frequently used. And it's often confusing and intimidating, especially if you're just starting with machine learning. So hopefully these short episode will give you a better understanding of those important words. And as usual, I will try to explain all of it with minimal jargon and and minimal theory. Okay, don't forget to subscribe to my channel or my podcast to get all the future episodes. Let's get started. So here we go. Accuracy. Okay, accuracy. Accuracy is important. Accuracy. Is that one of the key metrics you're going to use in order to evaluate how well your model does? So the simple definition of accuracy is run. A number of predictions count how many correct predictions you're making. Divide that number by the total number of predictions you made, and you have accuracy. Okay, simple as that easy to understand, especially for non technical users. So that's probably the number one metric people will want to hear about back propagation. There we go back propagation is really central Thio Deep learning it's ah, it's an algorithm that lets you update the weights in the neural network. So forward propagation means placing input data on the input layer off the neural network and letting the neural network compute activation values layer by layer, et cetera. And then when you get to the end, you have some kind of prediction. And, of course, in the early stages of the training process, it's gonna be it's gonna be wrong. So back propagation is actually going to go back. Ah, from the output layer to the input layer and update waits as it goes. And there's a bit of magic here because, of course it has to update the weights, um, in the right direction. Right? Weights are numerical values, and they need to be updated just right so that every update actually reduces prediction. Error okay, and back propagation is the algorithm that does that Convolution convolution is a mathematical function. It's ah, it's a way to combine to math functions in the context of neural networks. Convolution has really been the you could see the breakthrough technology that made Computervision efficient with neural networks with young loca and and other researchers. So in the context, off deep learning convolution is basically how you extract patterns from images, and you do this using convolution filters. There are so called Colonel's and they're small, small, two or three dimensional array off numbers that you slide across the image or the batch of images that you want to extract patterns from. And and that's what convolution is. And once again, this is what made all those crazy Computervision applications possible, Data said. Well, that's an obvious one. You can't pretty do machine learning without data and and building a data set is really the the main task. You know, once you have your data ready, cleans et cetera, you could say, you know the hardest part of the job is done because if you have a poorly maintained, it is that you're never going to get any good results. Well, as I say all the time, you know, garbage in garbage out. If you have the best algorithm possible, it's not gonna be efficient. So caring for your data, you know, cleaning it, filling in meeting values. Adding new data, et cetera, is really, really central. And that's what data sent to spend a lifetime doing. Curating the data sets because again, that's the starting point for the whole machine learning story ethic. So on that book is just a complicated word for Anita oration. Really. So again, in the context, off off deep learning specifically an epoch means pushing the day to set through the neural network once. Okay, so you go through the day to set, um, batch by a batch and, uh, or a sample by sample If you want to do that and you, Ah, once you've reached the end of the day to set let's call a netbook. And typically you trained for maybe how hundreds of that box if you have really large problems to work on my again. Computervision models are typically very, very slow to train, so that's what an epoch is going through. The dates at once feature feature. Um, after after data sets, I think features are maybe the next most important thing. So features are high level variables that that the model will use to that The algorithm. Will you sorry to train the model and say, Well, if I have ah, you know, well defined data set. Let's say have ah! Data set with data and columns. These are my features, right? Well, maybe, maybe not. Maybe some columns. Let's call him like that in your date set are good enough, expressive enough for the algorithm to learn. Or maybe you could just transform them or are builds new futures from them that help the model learn. Okay. And the example I take all the time is imagine you have a street address in your data set. It's gonna be a bunch of strings, right? And that's not really helpful for for a machine learning I'll go. They want numbers. So if you transform the street address into GPS coordinates, then you know that starts making much more sense because it's a numerical representation and and the ALGO can work much better with that. Then it would work with text strings, even if you encoded them in some way. So feature engineering is the the set of techniques that you apply in order to build features from the raw data set. And again, that's one of the key skills for their scientists. Lots of cooking recipes in black Magic sometimes Grady int How gray agent, um So again, ingredient is a complicated word to say something simple and, ah, great. It is basically a tiny update that you apply thio, a machine learning parameter. And the word is mostly used for again for Jeep running, where during back propagation, running an optimization hour ago, we decide how to object. Waits, you know, increased them a bid to crease them a bit. And we do that for each individual. Wait, and the update that we apply is called the reagent, and this comes from math. As you would expect, you're probably familiar with derivatives. So when you compute the derivative for, ah, for a simple function, well, it's called a derivative. And when you compute the derivative for all dimensions of a function that has multiple multiple variables, then you build a vector with all the individual derivatives. It's called radiant, so that's where the word comes from. But again, it's a complicated thing for ah, for a simple thing, right? Just tiny updates that are iterated Li applied to machine running parameters hyper parameter. So hyper parameters are training parameters. Okay, when we save parameter in machine learning. We really mean model parameter. Okay, So parameters that are learned and updated during the training process Okay, automatically, you could say now, hyper parameters are parameters that you, the user, the machine learning engineer set for the training process. So, for example, how many books do you want to Ah, run to train for, uh, what's the size? What's the batch size you want to use for the training process and so on? Okay, so these are really generic parameters, But every machine learning argo is gonna have specific hyper parameters. And you could say for ah, neural network how many layers you have and how why they are etcetera. These are hyper parameters to and finding the correct set the optimal set off hyper parameters. The one group of values that give you them. The best accuracy is a hard problem, especially if you do it manually. Which is why a lot of petitioners used technique all hyper parameter optimization that use machine learning to find those optimal hyper parameters. Okay, so machine learning to improve machine learning that that's pretty cool. Hyper promises. That's what they are generation. So it orations are against central to machine learning. Okay, The training process will slice the data set into batches there are fed to the algo, and then we run some kind of optimization process on dhe. Once we were done with an epoch, then we do that again. So another level of situation over epochs and even from, ah, development perspective. You know, machine running is a highly attractive process. You're going to try lots of different Argos, lots of different parameters, lots of different weeks. And when you get started with machine on and usually you think Oh, wow. Okay. I trained this model. That's really cool. It works. Ah, and you're happy with it? And you think you're done. But no in the process, In the course of a single project, she learning engineers typically train hundreds, if not thousands of different models. So you have to have this right mindset, OK? Try all kinds of things, use your intuition, and keep looking for the best possible combination. You know, I think your your work is never really done. When you working with machine running. Just do it. Yeah. I couldn't come up with, ah words starting with. So So here's my motivational speech. I'll just do it. I mean, if you're listening to this, you probably knew to machine learning are we are quite new. So just don't don't be afraid. I mean, just go for it And, uh and don't let the mumbo jumbo and don't let all the you know the the elite mentality that you sometime have to deal with in this community. Don't let that stuff bring you down. I mean, you can old do it right. It's Ah, machine learning is mostly co. There's a little bit of theory, but it's mostly code and, uh, you can do it. So when we need a lot of machine learning engineers, we need many, many more. So just go on, do it right. Don't Don't let anything stand in your way. Caress Cara Cara is ah is no pencils library for ah, for machine learning and deep running. And I mentioned it because it's my favorite. Well, by far Think it's the easiest one to get started with. It has really good documentation, really good blawg, times of tutorials. And it's ah, it's really beginner friendly, and yet you can go really, really deep with it. It's ah it lets you build extremely advanced models, especially now that it's tightly integrated. In the new Tensorflow version, Keira started as a high level A p I on top of tensorflow knights really, really integrated. And, uh, and you can go from super high level two super custom. So again, I would recommend that you start there. Loss loss again is another complicated word. Meaning error eso. When you hear about prediction loss, that means prediction error. And this is really central Thio to a lot of machine learning Argos and especially deep learning where we measure the difference between the predictions and reality became what machine learning people call ground truth. Okay, so this image is a dog. This image is a cat. This images are on elephant. Okay, go on, predict all three, and then you measure, um, the distance quote unquote between the predictions and their reality. Right? And and that's the role of this function called the lost function to measure that distance. Okay, because a lot of those predictions are vectors, so you need to have some kind of way to measure the difference or the distance between those factors. Okay. Lost functions are part of the package, which you use libraries like Cara's, tensorflow and more. So you have a whole range of floss functions to choose from off course. You can implement your own if you If you really know what you're doing, you're working with a specific problem and you want a different way of measuring error between predictions in the and the truth. Then, of course, you can write your own, okay? And ah, again, this is really central to the learning process because how well you measure that error tells you how you will update your parameters using maybe back propagation. So you have to get your lost function right? For sure. Model while a model is what we're where we're trying to get, um so model starts from a new algorithm that you apply on the data set. And ah, and by exploring, looking at the data set on the Argo will updates its parameters. And when the training process is done, you have a model. Okay, so the model is really a combination off an algorithm. Hyper parameters for that specific training job and a data set to run from Cana model is what you use to predict neuron neuron. Well, uh um I have a few left hopefully, but in the context of deep running, a neuron is just a simple mathematical construct with inputs which are floating point values. And each input is a sign of weight, which again is ah, floating point value. And ah, and the operation that on your own computers called multiply and accumulate, which is very simple. It takes each input, multiply it, multiplies it by its associated weight, okay, and then it serves everything. Okay, so if you have three inputs, then you have three multiplication sze, right, Wait multiply by input and then you add up those three Those three products multiply and accumulate, and that's what a neuron does. And the basic idea, of course, is to mimic the biological neuron which has inputs and based on how much electrical current is flowing there, the neuron fires or not. Okay, so the neuron is really that multiply and accumulate operation, and it's always associate ID or extremely often associated with a non activation function, which is another small math function that introduces a nonlinear behavior. Because, just like I said, a neuron sometimes does fire or sometimes doesn't Okay, so there's a threshold there, Andi. And that's the purpose of the activation function to say, Hey, this neuron shoot fire or it shouldn't And the popular function used today's go really are e l you. And it's a very simple function if the input. So if the multiply and accumulate value is negative, then really outputs zero. Okay, so the new Ron does not fire. And, ah, if the multiply and accumulate value, also called the activation value is positive than real output, that same value. Okay, so that introduces that nonlinear behavior. Okay, so to the left of zero, nothing happens to the rights of zero. You just output whatever you received. And this could be a really large value too. Okay, so that's very simple math. That's how new Rome's work and activation functions. And of course, when you put all those New Orleans together, you know, magic happens. Optimizer optimizer. So the optimizer is the function, the actual function that updates the the weights during back propagation. Okay, remember, back propagation starts from the output layer Looks at los. Now we know what loss is, and it goes layer by layer from the back to the front and updates. The weight's OK and the optimizer is how you do that. Okay, so the optimizer decides how weights are actually updated again. And there are a whole bunch of functions to do that and are and we'll see one called SG in a few minutes. Python. Well, I don't want to get a religious, but, you know, python is really the number one language you you should learn if you want to get into machine learning. So are is another popular choice and that we have libraries for Java and what not? But python is still the dominant language in libraries. Again like a tensorflow caress pytorch, mxnet and so on have ah python a p I and plus the python ecosystem is super rich in libraries like numb by pandas. Psychic learned, etcetera. Make it, you know, they're really mandatory tools thio to work with. So, you know, python is the one to start from If you have to start from something Kwan tile quintile. So that's Ah, that's ah, a little more obscure. So sometimes I would say most of the times when you when you build your model, you're going to predict and output and single prediction. Okay, so you're going to output a numerical value, predicting the price of houses in your district, or how many kilometers off traffic jams Paris will have tomorrow. While the answer is more than yesterday and and that's just a single number. So some for some problems, like forecasting, for example, outputting a single value is not really helpful. Okay, Because you don't want to know that it's gonna be 556 kilometers off traffic jam. Why 56? Why not 57 who have not 53. So you won't help put probabilities. You would say Well, it would be more useful to say. Well, there's an 80% chance that we have between 5 50 and 600 and ah, and there's, Ah, 95 probability that we'll get less than 680 kilometres. Okay, so you won't hurt ranges off values and with probabilities. And these Uncle Quintiles Okay, so they're referred to as P 90 p 50 etcetera. And for example, P 90 means 90% of predictions will be lower than this. Okay, on, only 10% will be higher. So our if you compare p 10 to P 90. Okay, you will have 80% off probabilities. 80% off predictions stored between those two values. Okay, so when you want to, um, well, probabilistic predictions, Quentin's are what you need to do. Regularization. Regularization is a technique that helps your model learn better. Okay, Andi, the way you do this is ah, you penalize you make it a little harder for the model to predict. You make it a little harder for the model to update weights. And and you say, Why do we want to do this? Well, we want to do this because sometimes the model learns to Well, okay, especially neural networks are extremely good learning literally anything. And sometimes there were learned really too well. And there is such a thing or a CZ learning too. Well, unfortunately. And the problem here is, if they learn the training set too well, then they won't do a good job at anything else. They won't do a good job at other data riel life data that you're gonna sense to the model. So to make it a harder okay to ah, father model to learn the training set, you use regularization techniques. Okay, there are plenty. But if you see your model learning really, really well, doing great on the training set and doing poorly on real life data, then maybe you need to apply regularization to make it just a little harder. Just make the model work harder at at learning that day to set, hoping that you know it will also do better on 100 life data. STD SG means a stochastic gradient descent and it's the granddaddy off all optimizers. This is actually very, very old technique. I think it was invented in 1950 one's away before I was even invented, and it's still heavily used today. And it's very well understood and predictable, you could say. And, uh, you know, I guess that's the first optimizer you should start from before you try the more advanced one. Just run the baseline using S G D and see what kind of actress you get. And then you can go and try them or, ah, you know, the fancier on more complex optimizers. But as Judy is always a good place to start training, So we've pretty much covered the training process, right, s. So just a quick recap we start from a data set. And again, I'm focusing on deep running here. Ah, start from a date set Slice it in batches, push each batch through the neural network. Computing multiply and accumulates et cetera, et cetera, and getting to the output layer and then using the lost function to, ah, measure the difference between truth, right? And, uh, your predictions and then running back propagation from the output layer all the way to the front, using an optimizer to update model parameters. Okay, so that's Ah, that's for ah, for Jeep running for traditional machine learning algorithm. Um, the technique is going to be a little a little different, but, you know, it's the same. I mean, it's the same big picture again. Start from a data set and let the ALGO learn interactively from the data and updates on parameters in whatever way you know, it's it's designed to do, and at the end you get a model, okay? And you do this thing again and again and again, measuring accuracy until, uh, until you get good results under fitting. So, uh, I talked about a regularization just a minute ago, and and regularization is trying to fight a problem called overfeeding. Okay, overfeeding means learning the data. The training data too. Well, under if eating is the opposite under feeding means you have a hard time learning even even on the training set. So the accuracy that you get on the training set is pretty low, and it looks like you're not learning well, so there are a 1,000,000 reasons why this could happen. Maybe your data eyes all messed up, and it's hard to extract patterns from that. Maybe you don't have enough data. Um, maybe you set your hyper parameters wrong. Um, in the case of deep running, maybe your neural network is too shallow. Maybe you need more neurons, more layers and other. There are plenty of problems that could be could be the root cause here. So these are the two conditions you could You want a fight right? Under if eating, because it means you're really not running anything and overfeeding, which means you're learning too well. And that's hurting your the the generalisation off the model Thio. Two new data. Okay. And regularization self. That second problem. Okay, let's go on. Validation. So, validation is how you measure whether the the model is doing a good job on data that it hasn't seen what we just talked about training. So you can easily measure how well a model is doing on the training set. Okay, but and you also want to see if it's able to predict correctly data that hasn't seen before. So that's the purpose of the validation set. So when you start a tw the very start of the training process, you will split your data set into okay, It's a one part is actually used for training, and it will be used to learn and update parameters, etcetera. And a small report is set aside for validation where at the end of each epoch or each round, if you're not doing the running, the validation set will be scored on will be used to measure the models accuracy. OK, so that gives you a sense off. Okay, Yes, I'm doing well. I can see I'm doing well on the training set, but how am I doing on data that I haven't seen before? Okay. And so that validation step is extremely important. And if you are very good training accuracy and very low validation accuracy, then That's the That's not great. You need to fix it. Maybe you overfeeding you know, maybe something else. But you really want validation. Accuracy to be on part are really close to training actresses because if not, then you're not going to be able to use that model for real life data. So validation getting the validation step right? Is critical waits. Well, waits, I guess we've covered them. So weights are, um, the parameters inside Ah, neural network. And you actually have weights which are the the actual parameters assigned Thio neuron connections. And you have biases. There are additional parameters tied to individual neurons and ah, and again these are used to compute, multiply and accumulates and other other operations. And they're updated during the training process. And and you want Thio. You know, ideally, you know, everything is fine and you never want to look at those values. But if something goes wrong in your training process, imagine a whole bunch of weights go to zero, then it means, you know those connections are dead. So maybe that's ah normal thing, because maybe those connections mean you know, they're not useful to actually solve your problem. But Maybe, you know, if a lot of connections go to zero, then you know your model is probably not predicting very well. Or, um, and weights could also go to very large values on and, you know, increasingly large values potentially exceeding, you know, the max values that can be stored in the floating point value. So again, you know, that could create all kinds of problems. And those problems are called exploding cancers and vanishing cancers, et cetera, and they're really, really difficult to debug. So sometimes, yes, you have to go and inspect weights and inspect gradients and understand what's happening in that in that black box that is a neural network. So, yeah, that's fun stuff expectations. Yeah, so I couldn't come up with an ex word. But expectations are important. So especially if you're dealing with in the early stages of the process of the project. If you're dealing with people who have little understanding or no understanding of machine learning, you know there's so much bullshit. Excuse my French laying around and flying around. And if you read magazine articles, you know, people seem to think machine learning is magical and just throw data at it and fine and blah blah, blah, blah. Well, that's not the case. I mean, it's It's a proper engineering domain, so things are not simple. Things are. They never work on the first try. So you have to set expectations and explain how you're going to tackle the problem and what kind of metric it's reasonable to expect in the and what kind of improvements can be delivered in the future. But you have to tell those stakeholders and business owners that no, it's not gonna work in five minutes on and, um, and it's going to be an intuitive process, right? So expectations need to be said very early on Why and yeah, why why do you even do machine learning in the first place? And that's the first question I asked customers. When I meet them, it's What's the business problem you're trying to solve? Why do you want to use machine learning? Why do you think machine learning is a good technique? Um, you just want to try it or or is there something else? Right? Um and this is central. I mean, so many people embark on ML projects, having no clue on dhe just for the sake of it or because they think it's the way to go because it's trendy I don't know about. It's not. It's it's it's a recipe for disaster. So make sure you understand the business problem you want to solve and challenge again business owners on that and and then see if the problem can be solved with machine learning or not. Not all problems can be solved with machine learning, so sometimes you have to use something else. So the why is really critical. And don't start on the project until you have understood this zero. And while the last word is zero right, we would like to have zero prediction errors, and that's never gonna happen. So you will always always have prediction errors. Even if you have a high performing model. Even if you have 99 point whatever percent accuracy you will, your model will still make mistakes. And so getting to zero errors is just not possible. And again, that's one of those expectations you need to set. Having said that, you you need to look at production errors. They are really interesting. They may point you at specific samples, data samples that are not predicted well on. And maybe you can come up with solutions. Maybe you need to ADM or off those samples to your data set to help the model learn them better. Maybe these are just bad, bad, bad samples. Maybe they're bad pictures in your data set, and even a human looking of them would not get it right. So should you leave them in, should you drop them? You know you can have that discussion, but yes, looking at prediction mistakes is a great way to improve your model. Well, that's the end off this of this episode. And if you want to learn more, if you want to get started with machine learning And if you're looking for learning resources, I would recommend our own machine learning classes so you can just go to edible yes dot training slash machine learning in one word and you'll find a collection off machine learning classes from you. No beginner level content to pretty advanced content. So keep an eye on that AWS dog training slash machine learning. And of course, you can always read my blawg and follow me on Twitter and you'll get more content and again don't forget to subscribe to my channel and my podcast so that you won't miss further episodes. If you have questions or comments, happy to read them. Please get in touch and I'll see you around.

Accuracy
Backpropagation
Convolution
Dataset
Epoch
Feature
Gradient
Hyperparameter
Iteration
Just do it!
Keras
Loss
Model
Neuron
Optimizer
Python
Quantile
Regularization
Training
Underfitting
Validation
Weights
eXpectations
whY?
Zero