Hello thanks for the best procentation but i was thinking that doing model builing you we will be able to predict what amount to the model suggests to loan a customer? But nothing have been said,
Good presentation. However, is this not just traditional statistics? I thought a key principle of machine learning is that the algorithm adjusts, refines, becomes more predictive (it learns) based on new real time data? Or perhaps ML is unsupervised learning where the algorithm finds patterns in the data without predetermined structure or y-variable target?
To answer to why model is not built on the entire data and split into training and validation, is not mugging :( . The actual statistical reason are 2 fold. 1. If you random-sample-split entire data into 70% training and 30% validation (if you have a big data set otherwise you can use entire data to build the model and take the jack-knife approach or boot strapping approach for validation). 2. Now Tchebychev inequality in Statistics, says as the sample size increases , the robustness of the parameters estimated increases and their standard deviations become stable. If your objective in building a model is classification, then this does not matter. From statistical stand point, if you have to address the question "How accurate your classification/decision based on this model" then you have to perform Inference , and parameter estimates, their means and Standard deviations define the statistical distributions, and then you arrive at the Inference. 3. Stability of the model : for example if you have 10 explanatory variables in the model, the distributions of these 10 variable in the training data set might be different from validation data set, and might not be exactly same, in that kind of scenario, is the model able to still hold true, that is what we are trying to address in the validation step. 4. If you do, stratified sampling based split (on the target variable) of 70% training and 30% validation, and the model deteriorates on the validation, after the stratification, that means your model is unstable, and is picking noise as a predictive signal and actually mis-classifying, so if you use this model, you end of with wrong business results.
I think this is pretty much an analogy and I quite liked it. You are jotting down technical definitions which is intended for practitioners. It is true, if you train the data on the entire dataset and get fraction of the test set from the entire dataset, then the model is just responding to what it has learnt from, not something that is testing the model's learning, the flexibility and if it is worth extending the learning to unseen instances.