The code and data for this video can be found as part of my Petrophysics & Python Series on Github: github.com/andymcdgeo/Petrophysics-Python-Series Direct Notebook Link: github.com/andymcdgeo/Petrophysics-Python-Series/blob/master/33%20-%20Auto%20Outlier%20Detection%20-%20Isolation%20Forest.ipynb Data Folder: github.com/andymcdgeo/Petrophysics-Python-Series/tree/master/Data
I haven't found a single video that basically explains what lines 8, 9 and 10. Some videos talk about trees but are too generic and don't give real examples in the nodes. Videos like this shows the code but don't talk about how any of this is related to an actual tree or set of logic. How the heck are we getting there? Also, I don't think you showed an example row of data. Are all of the data numbers?
The code and data for this video can be found as part of my Petrophysics & Python Series on Github: github.com/andymcdgeo/Petrophysics-Python-Series Direct Notebook Link: github.com/andymcdgeo/Petrophysics-Python-Series/blob/master/33%20-%20Auto%20Outlier%20Detection%20-%20Isolation%20Forest.ipynb Data Folder: github.com/andymcdgeo/Petrophysics-Python-Series/tree/master/Data
Removing outliers needs to be done with due consideration. The cause of them being outliers needs to be properly understood and then the appropriate course of action can be taken. I discuss multiple methods of dealing with outliers in my medium article here: towardsdatascience.com/well-log-data-outlier-detection-with-machine-learning-a19cafc5ea37
got a question: I have created a model using IF, and I fitted the model with my training dataset, now I want to apply this model to my test dataset. I don't really understand how I actually need to imagine this process of "fitting the IF model"? I mean, when I set contamination to, let's say, 5%, then my model calculates the anomaly scores of all values in the training dataset assigning to the 5% "most anomaly-like" data points the value -1 describing them as anomalies, right?, and after that when I pass my test dataset to the model, does my model then actually just reuse this structure of the IF trained with the training dataset for calculating the anomaly scores of the test data points and then it just compares if there are any anomaly-scores of test data points that superate the lowest one of these 5% "most anomaly-like" datapoints of the training dataset regarding their anomaly-score? And if any test data points are superating the lowest anomaly score of the 5% "most anomaly-like" data points in the training dataset then the data points in my test dataset are described as anomalies?
Yes, that's correct! When you fit an Isolation Forest (IF) model to your training data, the model will create a number of decision trees and use them to calculate anomaly scores for each data point in the training set. The data points with the highest anomaly scores will be considered the "most anomaly-like" and will be given a label of -1 to indicate that they are anomalies. When you apply the model to your test data, the model will use the same decision trees and calculation process to determine the anomaly scores for each data point in the test set. If any data points in the test set have anomaly scores that are higher than the lowest anomaly score of the "most anomaly-like" data points in the training set, they will also be given a label of -1 to indicate that they are anomalies. This process allows the model to identify anomalies in the test data that are similar to the anomalies identified in the training data. However, it's important to note that the model may also identify anomalies in the test data that were not present in the training data, as the model is designed to detect unusual or unexpected patterns in the data. I hope this helps to clarify the process of fitting and applying an IF model to your data! Let me know if you have any other questions.