I like that as the founder of Dataquest, you yourself are providing the tutorial (as opposed to hiring someone). Also, thanks for offering the free access to educators and students.
Does this model completely ignore who the opponent is?! From what I see, the features used are: a) general match features - time of the game, home/away b) rolling averages for one team As a result the program tries to predict the outcome of the game completely ignoring who the opponent is. It will come with a predictions which is purely based on general match factors, and the past performance of one team, completely ignoring the specific opponent features. I.e. for a Arsenal game it will give me the same result retrospectively if Arsenal plays the 1st or the last team in the table. Do I get it right? If so, how can it make sense?
I also see it this way. I believe it would be predicting based on the the form that both teams are in looking at their last 3 games. So if a team like bournemouth had a three game stretch against burnley, luton, and sheffield and won all three, and city played against arsenal, liverpool, and villa and won 2 of 3, I think the model would predict bournemouth to beat city. I could be wrong and the random forest model might accumulate strength's of teams based on the multiple branches in the decision tree tho. I am not the most familiar with this model. I would love for someone to correct me if im wrong
This is a very cool project! I ran it across 7 leagues and it is interesting how the same set of predictors get very different results. In England and France, it does pretty well but in Brazil and Japan, not so much.
I have a question. In the predictors you have opp_code for the opponent but no code for the actual team (could be called team_code for example). Why is this not nessesary?
I really love this work. I will try with 10 seasons and make my train 70% of the dataset and my test 30%. But I want to ask, after all is done. How do I predict specific upcoming matches. I plan on adding upcoming games I want to predict to the test part and then predicting from there.
I'm trying to predict who will win the NHL championship, their divisions, and the rest of the regular season l. I need help with this project, I will be using machine language. I'm using colab. I need help with this. Any takers? Any and all help, would help!
Inspiring well done ! Can you use gf and ga direct columns in your predictors with no using rolling_avarage function ? Now imagine you can get a very good algorithm for prediction after you save the model , how do you use this algorithm for the next season to predict games ?? Can you give me a clue ? For example sesson 2022 - 2023 to predict one game? thank You
You have to use rolling averages because when you try to predict the outcome of the match (before it has started) you wont know gf and ga yet. But we know average gf and ga of last 3 games the team has played. Model can be used for new seasons, but the problem is data. You will have to gather data about games after this video. That is the tricky part, but he made also video before this one about Web scraping (getting new data direct from web). Or maybe you can find some updated data set online (maybe Kaggle). From my experience, those data sets you find online wont have more detailed statistics of game, so it would be best to web scrape the data yourself.
thanks Vic, i tried to run the rolling average function but it's give me this error value ValueError: closed only implemented for datetimelike and offset based windows
I generally opposed to the idea of using AI/ML model for EPL or in any sports , but definitely concept can be reused in multiple business cases . Great job mate !
Great video thanks. But I was wondering how do you get the model to predict the upcoming football matches. Let's say Manchester United vs Liverpool etc.
Why do you use RandomForest Classifier for this? Is it superior in someway for this application as compared to other Machine Learning models eg KNN, ANN etc
This is a very great video, but i don't understand exactly how to predict the individual matches. what parameters and how should i put in rf.predict() if i want to have the outcome of a single match?
Nice video Vic, learned a lot from your videos recently my only criticism is that some of the viewers may feel that they can generate positive returns based on probability higher than 50 or 60 percent. It would be better to predict the probability of winning because the betting reward is based on probability. So assuming we predict that a team wins is 70 percent and the odd reward is less than 7/10 we are going to lose on average, even though our model was right. The reason the model is able to predict with a probability of higher than 50 percent is that some teams are better than others and the betting odds reflect it. One can scrap the odds also and do the analysis but I believe the betting companies already use AI to predict the initial odds. There will be opportunities when the odds differ substantially from a good predictive model.
yeah its basically a massive nothing burger, you'll still lose money and if by some miracle you can model it well, then your bookie will back you off before you make any money!
Hi, I have enjoyed watching your demonstration of predicting the EPL game results. However, the predicted results don't reflect the actual results. So my question is, how can I predict more accurate results, and how can I train the dataset. Looking forward to hearing your reply.
Hi, Thanks for the awesome video. I had one doubt (might be stupid) The aim of the model is to predict the winner of match between two teams (suppose team A vs team B). But for training the model on a single match result , we are only giving the stats for home team (A). Would'nt it make more sense to add stats for team B also in the same row , and then ask it to make the prediction.
Hi, thanks for the great video. Why didn't you involve "team" as a predictor in each model as you've used opponent team information? Doesn't this miss the relationship between team A vs team B and so on?
Hi Yigit - great question. You are welcome to try it with team and measure error. The reason I didn't use it is because using a column like that can have a tendency to overfit. Some teams have performed really well in the last few seasons, but that doesn't necessarily mean they'll perform well in the future.
This is by far the best and most practical video on football predictions I've seen online, very well explained and actually leaves you with something useful afterward. Great work!
When creating the new columns using rolling_averages, we lost the first few games of the season when we dropped na rows. We also carried rolling averages into other seasons. How do we fix this?
i just started learning Python n Machine learning. I started learning from your tutorials and it is making me better in Data science day by day. Keep it up. you are best online teacher.
Hi! What if I have all the data in a .txt file, one column, and separated rows? How do I translate that in a dataframe? Exemple: FT Greece 3 - 0 Italy Sunday 12/04/2008 FT France 1 - 2 England ....and so on
hey @Dataquest amazing content. i created the algo to predict games using your tutorial. im asking now what i have to do to make the algo do the predictions for the futures games since i noticed of course it predicted the past games. Could u tell me? thanks!
I was looking to extend this, however there would be a problem extending the data. The one problem with these types of predictory models is that there are financial takeovers, financial problems, key players coming in and leaving, player injuries, etc. For example, the massive spending on the Chelsea squad, and them actually doing worse, and that is something that a AI most likely would not be able to predict.
why I get this error = TypeError: list indices must be integers or slices, not list after I write this code rf.fit(train[predictors],train['target']).Thanks
Hi Jamshid - `train` should be a DataFrame, but it looks like you might have it stored as a list. The full code is here if you want to compare - github.com/dataquestio/project-walkthroughs/blob/master/football_matches/prediction.ipynb .
@@xsquirrel7091 Hi, Thank you very much. I have already put " predictors" as variable to choose de columns name. like this ( predictors = ['venue_code','opp_code','hour','day_code']).
Yes , you are right. I passed the 'train ' and " test " as a list not dataframe. train = [matches[matches["date"] < '2022-01-01']] test = [matches[matches["date"] > '2022-01-01']] But should be like this train = matches[matches["date"] < '2022-01-01'] test = matches[matches["date"] > '2022-01-01']
Hi Vikas. Very nice tutorial. I was able to code all along and i was my first ML project. Seems awesome how the computer predicts stuff like this. I have a question: we have our training and testing datasets, right? How can we ask the algorithm to predict an event that it's not on the training data? For example, let's say I have a csv of next weekend's matches. How Can I ask the algorithm to try to predict the winner? Sorry if it seems a silly question, but I actually couldn't find a more clearer way to ask. Thanks and well done once again!
Hi Alexandre - you'd basically put the information for next weekend's matches (opponent code, venue code, rolling averages, etc) into a new testing set, and then make predictions on that set.
Excelente video con muy buena información. Solo una pregunta, como se haría la predicción del resultado para cierto equipo en la siguiente fecha, jornada o partido? ... gracias!
Great tutorial! Do you have any advice for future matches - what values should I add to the data in my CSV file in a situation when I want to predict the results of future matches? I mean the values that we do not know yet, such as distance, shots on target, etc. All test data in the video have these data supplemented, so I wonder what to put in these "empty" columns. Thank you.
Hi there - distance, shots on target, etc, are only looked at for prior matches. If you're trying to predict future matches, you would use the rolling average of those columns from previous matches (this is what the video shows).
What you did with the rolling averages was impressive. Is there such a thing as when a ML algo creates such features for you? I.e. it randomly multiply/dividing this by that or rolling averages or random features to create a new feature?
HI, I have a question, everything was built without taking into consideration the matches that still have to be played so there is no real prediction of future matches but only on those already actually played, correct?
And then behind the scenes corruption happens that causes players to matchfix/throw/lie on the ground for excessive amounts of time and all your betting money is gone.
I just finished writing this out and for the most part it works except for this line: combined, error = make_predictions(matches_rolling, predictors + new_cols) error: ValueError: Found array with 0 sample(s) (shape=(0, 12)) while a minimum of 1 is required This line in particular is giving me trouble in both the one I hand wrote myself and copying and pasting your program. I've looked through the code and some forums but nothing seems to be wrong. I think maybe it could be a year issue in that the way to write this out has changed as time went on and that this form of writing it is old. I'm not sure what the issue is so if someone could help me out that would be great. I'm planning to use this as an American Football predicter to see if the program will be able to predict which team will win. I'm doing it primarily because of my cousin and his fondness for fantasy football. It got me a little interested in the sport but I figured I'd create a model to make things a little fun for me.
Hi Vikas! When doing the rolling part I'm facing an issue that says: "closed only implemented for datetimelike and offset based windows" You know what can be the problem? Thank you!
am new to this.....was asking how one can get the predictions from the machine learning, am stuck at the combined precision stage and cant find a way of extracting future predictions.any help will be highly appreciated
It's up to you. You could make this a 3-class classification problem, and code loss as 0, tie as 1, win as 2. You can also do what's done in the video, and code a tie as a loss.
I have seen a lot of other people ask this in the comments, but there hasn't really been a solid reply... how can you apply this to predict the results of matches that haven't occurred yet? Because this is all well and good to split the data into parts that the ML algorithm sees and does not see, but it is pretty useless when applying it to life because we already know the result of that game that occurred, even if the ML doesn't. Could someone either explain to me what I am missing, or suggest the next steps for predicting matches of which there is limited data recorded already?
i got an error like this after writing below code can you please explain how to resolve it preds = rf.predict(test[predictors]) NotFittedError: This RandomForestClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
My English is weak, so I'm using Chatgpt for translation. Can we combine data from different websites to create a CSV file and analyze it to increase our chances of winning? For example, we could gather match data and odds from Flashscore, voting results from Oddsportal for each match , and win/loss probabilities from Tablesleague. Then, we could use artificial intelligence to create a prediction program. Would you be interested in this?
Big thanks for this video! Helped me a lot! Tried this method on my project with soccer data analysis and everything went fine until this function: "def make_predictions(data, predictors):". Got KeyError: "["rolling_cols"] not in index". Any advice on solving this issue? Thanks in advance!
Interesting. I was running a similar model on football matches, except that I had rolling attributes of both teams as the predictors and the class was home_win, draw, away_win. A match is included only once. However I think your approach might be better.
Did this using logistic regression with binary classification and achieved a 70% precision. Used different parameters for training the model though. Also had to put the sleep time to 10 seconds when scraping to avoid 429 HTTP response.
Hi, 1. Kindly suggest a roadmap for me to adequately comprehend this project. I have no experience in the field nor programming background. 2.How do I run this project in the meantime as i upscale my skills? Awsome tutorial. Got yourself a believer.
I would recommend following the data scientist path at Dataquest - www.dataquest.io/path/data-scientist/ . This will help you learn all of the skills (including programming) to build this model.
Hi Vik! I am learning so much through this video and decided to try adopt it to NBA data too:) . I am running into an issue where I merging the combined dataframe with on left_on = game_date, team and right = game_date, opponent. However, my new merged table is blank. My theory is that despite my data having the same 3 letter abbreviations for the teams (LAL, WAS, CHI, etc) in both the team and opponent, python is saying they aren't the same and not joining the tables. They are both 'object' data types (if that matters...). Any recommendations on how I can make them identical? Thank you!
Hi Adrian - do you actually have data from both sides of the match? For example, if LAL played WAS, you would need a row where WAS is the team and LAL is the opponent, and a row where LAL is the team and WAS is the opponent for the same game day. If you don't have this, you would need to create those rows (by duplicating the dataframe then swapping team and opponent) before merging.
Why are we only looking at matches that have been played? I mean, i understand it for the learning part and the back testing, but the machine hasn't actually predicted a match, that hasn't been played, from the date of the video going forward. That would have been useful. Is it like we just have to add these upcoming matches to the matches.csv? It is what i am trying to do, but it is pretty tough for a beginner, like me. Will push harder, hopefully find a solution. Thank you for the video and the great explanations.
When we merge the 'matches' with 'shooting', we basically get rid of all the future matches. I should probably keep the not-played matches in the list somehow with NaN values under shooting?
If you want to predict future matches, you can just feed them into the prediction methods. The reason we remove the rows where matches haven't been played is because we can only use data for training if we know the outcome. But once we train a model, you can feed that data in to get future predictions (the same way we feed in the test set).
Hi Liam - thanks for the suggestion. What you need to do is pass in future data to the predict methods, the same way we're passing in the test set now. I can look into making a video.
@@Dataquestio after asking this question, I actually gave it a go myself but unless I add future data to my test data, I’m unsure how to do it, and it takes the accuracy is way off for me :D
hello, my question is how would you deal with predicting newly promoted teams results ? especially teams that maybe are promoted for the first time in a very long time.
This is a tricky one. You could build a separate model to predict how well a team will do in the first season after promotion based on lower league results.
Thank you greatly, this has been extremely helpful. I ran into a KeyError issue when running make_predictions telling me that all of the rolling columns were not in index (gf_rolling,..). Do you have an idea as to why this is happening? I followed the code exactly, so I'm not sure what is causing this... If I remove "+ new_cols" when calling the function it works fine. Thanks again
Hi Eric- this would happen if the new columns aren't in the matches_rolling dataframe. This is the code that adds the columns - "matches_rolling = matches.groupby("team").apply(lambda x: rolling_averages(x, cols, new_cols))"
At line 30, on the 17:49 mark, when we run, preds = rf.predict(test[predictors]) , I get a ValueError, "ValueError: Found array with 0 sample(s) (shape=(0, 4)) while a minimum of 1 is required." Is anyone running into a similar issue?
@@Dataquestio what about line 58? I get a ValueError saying ValueError: Found array with 0 sample(s) (shape=(0, 12)) while a minimum of 1 is required What can I do to fix this? I typed everything in correctly and I even did it 5 times and it gives the same result.
Hi, I used your scraping code to collect the model data and code taking "result" (astype("category").cat.codes) the accuracy became as I assumed lower suddenly i 've used RandomizedSearchCV to see if there would be improvement. Then...then added "gf","ga" as predictor. With the same parameters except criterio="entropy", using sklearn's classification_report I got an accuracy of 0.98 and f1_score>=0.95,précision>=0.93 for each of the target values (0,1,2). However I don't know much about football so maybe I took observable preachers after the game. Anyway I wanted to say thank you
You don't want to use `gf` and `ga` as predictors. Because you won't know these until the match is over and you already know the winner. That's why your accuracy is so high - because the model is being fed the answer.
Hi!, I don't think I understand how you can use the rolling_average cols on the predict dataset, you wouldn't have that information until after you match is finished, right? so, how can those columns be used in the predict dataset? , Many thanks for your great videos and content! Well explained and very educative.
This video came at the right time i trying to figure how to get rolling averages for a dataframe and especially that part with the 'left' argument, Thanks so much.