Anti-Learning (So Bad, it's Good) - Computerphile

Подписаться 2,4 млн

Просмотров 204 тыс.

50% 1

How getting something completely wrong can actually help you out. Professor Uwe Aickelin explains anti-learning.
Could We Ban Encryption?: • Could We Ban Encryptio...
XOR and the Half Adder: • XOR & the Half Adder -...
The Singularity & Friendly AI: • The Singularity & Frie...
Machine Learning Methods: • Machine Learning Metho...
/ computerphile
/ computer_phile
This video was filmed and edited by Sean Riley.
Computer Science at the University of Nottingham: bit.ly/nottscomputer
Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharan.com

Опубликовано:

31 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 299

@robmckennie4203 8 лет назад

Doesn't actually explain anti-learning. Explains how they thought of anti-learning, sure, but the closest we got to an actual explanation was "and then I flip it and it works"

@eugen9611 8 лет назад

+Rob Mckennie if your algorithm gives you "1" as a solution, you take "0" instead. if it gives you the correct solution 3/10 times, it will now give you the wrong solution 3/10 times

@robmckennie4203 8 лет назад

TheBloodPainter okay cool? Still should have been in the video.

@matthewjohnson3037 8 лет назад

Agreed. I still liked the video though. I wanted to know how "flipping it" applied to their data set

@robmckennie4203 8 лет назад

matsv201 I'm not asking someone to explain this for me, I'm criticising the video for not including it. Also, don't call me Pimply

@AndorianBlues 8 лет назад

Reverse your lack of understanding, then you'll understand it.

@ffanatic13 8 лет назад

Reminds me of a joke from business school a professor once said. "If you find a financial advisor who consistently gets it wrong, hire him! Do exactly the opposite of what he says and you will get rich. The dangerous ones are the ones who are 50% right and 50% wrong"

@danielgmjr 3 года назад

I always use an analogy like this to explain why 50/50 is always the worst case scenario, when it comes to investing.

@jurusco 3 года назад

I'm so unlucky that i will hire the bad advisor to do the opposite and then he will quickly get better exactly when i'm doing something realy important with his reversal advise.

@pureatheistic 2 года назад

There are several things for which precision is valued over accuracy.

@skit555 8 лет назад

Where's the part where he explains the method? It's like having a book made of an introduction only :/

@Nuriyasov 8 лет назад

+B Skit Shogun by James Clavell

@skit555 8 лет назад

Timur Nuriyasov I guess a book with only introduction exists :p Thx

@ZachBora 8 лет назад

You forgot to talk about anti-learning...

@lmiddleman 8 лет назад

This is the plot to every Star Trek NG episode. "Mr. Crusher, reverse the shield polarity." "It's working!"

@amihartz 8 лет назад

+lmiddleman I still hate that kid.

@MouseGoat 8 лет назад

+Amelia Hartman that kid, or all kids?

@trucid2 8 лет назад

+Amelia Hartman He's on reddit now.

@Kuchenrolle 8 лет назад

As many others have pointed out in the comments, this video is quite uninformative as to what "Anti-Learning" is actually supposed to be. So I've looked at a couple of the papers of Aickelin and Roadknight and even though those are not particularly clear on what they're doing either, it seems like things are being misrepresented here. So first of all, what they call "Anti-Learning" is not the reversal of the classifier's predictions, but the phenomenon that a classifier achieves less than chance performance on the test set (or sets, as they are cross-validating). They claim that instead of being a sign of overfitting to the training data, this situation arises because the structure of the population is such that many non-similar cases are summarized under a single label, such that the population ought to be misrepresented in the sample similar to a situation of XOR, where one of the four combinations (00, 01, 10, 11) is missing from the sample, which necessarily leads to an incorrect classification of new data points that show said combination. They substantiate their distinction of anti-learning from overfitting by showing how, for a learnable data set, a neural network's performance on the training AND test set increase with an increase in flexibility of the model (more hidden nodes), but only up to the point where the test performance starts decreasing again (as the model starts overfitting), while for a data set that results in anti-learning, the test performance stays below 50% throughout, despite an increase in training set performance. (And for a random data set the performance on the test set stays around 50%). (I don't really find this convincing. The classifier picks up on distinctions suggested by the sample that don't hold up in the general population - that's overfitting to me and this is just a particular case, where the sample is systematically unrepresentative.) The classifiers they tried were not only linear classifiers, they used Bayesian Networks, Logistic Regression, Naive Bayes, Classification Tree Approaches, MLP and SVM (though I don't understand why this list doesn't include KNN), all of which performed poorly trained on sets of 35, 45 or 55 features. As stage I and IV could be classified reliably, their analyses focus on distinguishing stage II and III - so it's a binary classification problem. This is relevant, because their trick of flipping the classification wouldn't work otherwise. Reversing the predictions only makes sense when the reverse is actually specified (0 instead of 1 and vice versa), which it wouldn't be were there four classes. Lastly, I don't see where the 78% accuracy he reports on come from, though. From their 2015 paper all I see is the accuracy they get when they use an ensemble of different classifiers (half of those are trained to perform well, while the other half is trained to perform terrible and gets reversed) and - and this is what he really should have mentioned if that's what he is talking about - they only get this higher accuracy in a subset of the cases, namely those where the respective ensemble agrees. So they get the highest accuracy (~90%) for those cases where all six classifiers give the same label, but that is also a very small subset of the sample (29 data points).

@HDRUSKYBOOM 8 лет назад

Thanks for that!

@gcgrabodan 8 лет назад

indeed, thanks.

@poiuytrewq1553 2 года назад

Thanks for the information 👍

@alpw1234 2 года назад

@Kuchenrolle one of the best comments i have ever seen on youtube

@Qbe_Root 8 лет назад

I expected an explanation of anti-learning, and all I got was “Just put the hay in the apple and eat the candle”.

@RSP13 8 лет назад

The argument at the ending was so Bad, it was Good.

@jasondalton6111 8 лет назад

Ugh. Video should be called "Computer science professor rediscovers nonlinear statistics and gets disproportionately excited about it."

@alpertokcan9899 6 лет назад

Jason Dalton lol

@atomheartother 8 лет назад

... that was such an odd video, even as someone who has a fairly good grasp on XOR and logic and such, I'm left pretty confused by the whole thing. He took the wrong answer and he just reversed it? what?

@TheSpacecraftX 8 лет назад

+Gijs Schenk Yeah how can you reverse it when there are more that 2 possible states? He said there were 4.

@atomheartother 8 лет назад

***** Har-har.

@ThaPartyyBoy 8 лет назад

+atomheartother In case there are only two possible answers (sick or not sick), you could consider this: Previously they were getting about a 45% accuracy, which is obviously lower than choosing at random. Now if it is only 45% accurate, you could reverse the result such that there is a 55% accuracy. Since previously it was wrong 55% of the time. If they are then able to make the AI perform even worse, the final accuracy after reverting the result will be higher. Might not be this simple if there are multiple possible answers though, except if there is some clear logical structure between all the results. Note that the amount of possible states does not equal the amount of possible results, as I've seen in some comments. For example, XOR has 4 possible states yet only has two possible results (0 or 1).

@erikm8710 8 лет назад

So I work in this field... here's the gist of anti-learning using a football analogy. That's 'American football' to our UK friends. Imagine that you have a friend, let's call him Alan, who grew up in New York and was a huge Bills fan. Your friend never played football in school, doesn't know anything about football strategy, and growing up he only watched games that the Bills were in. Your friend (slowly) notices that the Bills have a great running back, and that when they play against a team with a weak run defense, the Bills usually win. When they play against a team with a good run defense, they usually lose. Using this simple rule, he is able to correctly predict the outcome of Bills games 80% of the time. This is the modeling equivalent of getting a high accuracy score on the 'training' data set. Your friend Alan then moves to Massachusetts and starts watching Patriots games. He applies the same theory about running backs and run defenses of opposing teams to try and predict the outcome of Patriots games. Because the Patriots rely heavily on the passing game, your friend's predictions are certifiably terrible, and he is only able to predict the outcomes of Patriots games correctly 30% of the time. This is the modeling equivalent of getting a low accuracy score on the 'test' (out of sample) data set. Because of his incomplete picture of football dynamics, you notice that your friend has become very good at incorrectly predicting the winner of any football game in the NFL, because today the NFL is predominantly a passing league. In fact, for any given football game, your friend is wrong 80% of the time. Still, your friend Alan clings to his theory about running backs and run defense. Now, it's the Superbowl, and you call up your friend and ask him his opinion of who is going to win. He gives you his opinion, and you bet on the other team. That is the gist of anti-learning. In machine learning, anti-learning just means that the model is drawing incorrect conclusions from the data, for whatever reason (noisy and incomplete data, small sample size, over-fitting, etc.). By using anti-learning, you are trying to maximize the rate at which the model draws these wrong conclusions, so as to get the worst model possible. Then you simply invert the outcome, and voila!--you have your prediction.

@petergransee2937 8 лет назад

+Peter Gransee - hmmm. I feel the XOR example and what happened with the medical data are substantively different. Apparently, they started with some signal using conventional methods (45% with the medical data and the straight line in the XOR example) but to further increase that by "reversing it" must mean you know when to reverse it and when not to. Not sure how that very helpful information came about without it peeking at the test corpus. The more it peeks at the test data the less you can call it a learning system.

@meinradrecheis895 6 лет назад

thanks man, your explanation is a lot easier to understand than the professor's.

@davidelmkies6343 5 лет назад

Great explanation, thanks. It seems weird trying to improve the algorithm because you'd be doing it in reverse.

@jeremyluna9956 8 лет назад

My understanding is that in this unique case is is actually quite easy to make a function that plots the wrong answers. When you check data in this case, you are checking to see if it does not fit in the function.

@MrZebth 8 лет назад

Basically the machine was so bad at guessing that they had to start asking 'what don't you think it is?'

@2Cerealbox 8 лет назад

I wish he'd explained more what the solution actually entailed. And plus, I happen to understand what he means by separating categories by a line, but I don't think most viewers are really going to understand. They don't even really know how he was approaching the data. Strangely subpar content for this channel which is generally pretty high quality.

@Vulcapyro 8 лет назад

+Ryan N After this and the previous video, I think Prof Aickelin needs a bit of work on disseminating these complicated topics in simple manners. In many senses he seems to simplify too much to the point where the uninitiated won't recognize the meaning behind what he's saying due to expert knowledge, and those already knowledgeable don't find the explanation useful.

@2Cerealbox 8 лет назад

Vulcapyro Brady should be asking him better questions, too, I think. ML can be fairly technical, but you can also just start with basics, too.

@Vulcapyro 8 лет назад

Ryan N Brady doesn't do the Computerphile interviews, but sure.

@RodrigoVzq 8 лет назад

But he didnt explain it

@KennethSorling 7 лет назад

Am I the oney one who shudders at the waste of paper? He's only drawing at every other sheet. And his examples are simple enough that you could fit many of them onto a single one.

@alanstanley2847 7 лет назад

Kenneth Sörling Yes, you are the only one...

@OkSharkey 6 лет назад

not anymore!

@tomatensalat7420 8 лет назад

I think I got the answer / what the new approach is but not what he was doing before. This stuff with the medical data was a bit too without context for me to understand what he anted to say there.

@ArbitraryDoom 8 лет назад

+oggi mog Yeah I found it a bit confusing too, but I think he was talking about how different medical data has different criteria for good and bad scores. Like any amount of lead is bad and gets progressively worse the more you have (which is easy for computers to deal with), while things like blood pressure are best in the middle (which is more like a xor problem). I can't think of anything that is bad in the middle, but I am not a doctor.

@DoisKoh 8 лет назад

+ArbitraryDoom All that doesn't quite matter I think. I believe they were trying to some kind of linear classifier and all that talk was probably just to explain how many variables (200+ !!!) they had and how their relationships weren't all straightforward, making the problem extremely complex. The deal was that their classes were not linearly separable and he managed to use this new "anti-learning" technique to quickly solve the problem. If you're not familiar with the terms; you can imagine linear classification as when you try to draw a line between groups of data to decide how to classify them. Example: You have some data, let's say the height of a bunch of people. You know that they are all either basketball players or ballet dancers. You put their heights on a graph (this graph just has 1 axis because it is 1 variable/dimension) and find the point on the graph that when you cross, you get majority (or all) of one of the classes of people on that side.. Let's say this point is 175cm: More than 175cm = Basketball Player Less Than or Equal to 175cm = Ballet Dancer There might be some overlap, in that case your accuracy won't be 100% (e.g. you might have a ballet dancer that is 180cm tall and a basketball player that's 173cm tall, but overall you get the most accurate results by drawing the line at the 175cm mark) With machine learning, we can use computers to learn (we train the computers by giving them the values and the expected answers) and tell us where the line is. He didn't say specifically what his problem was but that it had to do with 500+ patients, each with over 200+ variables (imagine a graph with 200 dimensions) and that it had to do with colon cancer. After applying his anti-learning technique, they got it from 45% to 70+%.

@GingerJack. 8 лет назад

So kinda like how the choice between doors on numberphile where you reject the first door you pick and pick a different one.

@xybersurfer 8 лет назад

+StraightJacketRED i don't think it's related to the Monty Hall problem

@TechyBen 8 лет назад

+StraightJacketRED Perhaps. In that both are statistical and probability based. Though this one is more statistical, and less about probabilities.

@Paboty 8 лет назад

+StraightJacketRED I don't think they are that similar. On the Monty Hall problem, the "switch" is offered after one door is discarded. This updates your statistical information. That doesn't happen on this method, as far as I can tell.

@simargl2454 8 лет назад

the amount of paper this guy wastes...

@brokenwave6125 8 лет назад

I've written a few things in this large piece of paper. Next!

@matthewludivico1714 5 лет назад

Kind of ironic...computer professor doesn't explain concepts with digital slideshow

@jaredmulconry 8 лет назад

This video is enough to get me very intrigued about how this solution works. I'm hoping another video is in the works that takes a deep dive into how anti-learning works and how it applies to understanding this data.

@joealias2594 8 лет назад

I was trying to understand how the XOR example applies to what he's saying, and I don't really get it. If you can draw a line that separates the goods and the bads, then you've solved it: anything above the line is good, below is bad. But you can't do that for XOR. So, he drew a line through the middle. This solution is "wrong" in that it's not a good solution, but not in that it's always wrong. If it was always wrong, you could do what he said. It seems like what he's done, based on that visual, is found a "solution" that is right 50% of the time and wrong 50% of the time. So, reversing it is the same and not reversing it. That's how I understood his example to apply to what he was saying. Is this not correct?

@RandomNUser 8 лет назад

We need more info on this, maybe an example with more variables, or an explanation on why the machine gets the wrong answer with xor. Please do more Computerphile!

@ButzPunk 8 лет назад

I might just be being dumb, but I don't really understand what he meant by reversing the wrong answer. If the wrong answer is a horizontal line, what's the reverse of that?

@dmaust1 8 лет назад

I would be curious how it "anti-learning" works precisely, and how well it generalizes when compared to adding quadratic interaction terms or using a simple three layer neural network.

@theslavegamer 8 лет назад

If we use that graph as an analogous for the Xor problem, can we not make the graph 3D? As in the values that are the same are transformed into a 1 value on the z-axis, and values that have a null-result transform into a -1 z-axis result, and you can simply draw a "line" (Or plane I suppose) between the two values? I'm not really sure what I am talking about but I am just thinking from a practical standpoint.

@AltainiaInfinity 8 лет назад

If you swap one of the inputs on the XOR problem, it becomes easy to draw a line through them

@dbsirius 8 лет назад

This is simple. Data doesnt always fit in a particular model (categorization), so by process of elimination, you find out what model/relationship remains, which best fits your data. Anti-learning is esentially knowing you'll get an answer wrong, but using that answer as part of an overall strategy towards finding the solution.

@Smittel 8 лет назад

To seperate those 2 you could use the geometrical distance to (0|0). (1|1) is ~1.41 units away from (0|0), so you could say everyone whose distance is between 0.8 and 1.2 away from (0|0) is healthy... This should work no matter how many dimensions you have

@thomasrichardson5425 8 лет назад

is 500 a good set for machine learning? i thought they needed a bit larger but i could be wrong

@ezadviper 8 лет назад

well i didnt understand anything, but i still watch it

@davidalearmonth 8 лет назад

Have you tried PCA and PLS? (though it is still tricky once you've made something a mixed integer problem. better with the raw data.)

@perlindholm4129 8 лет назад

I'm just wondering. Can a computer do something like "similar problem recognition"?

@BigDmitry 8 лет назад

As far as I know, a neural net with one hidden layer can solve the xor problem, and you can build a deeper (or wider?) net to generalize it to many dimensions. Did they try neural nets?

@miri64 8 лет назад

How do they "reverse" it? The REVERSAL of XOR is XOR.

@Tomwithnonumbers 8 лет назад

+Martine Lenders It'd be fun to learn more about that but it must be complicated. If it's discrete data, they're trying to find a function that has the least points sharing a group with each other that shared a group in the original function and has the most points sharing groups with each other that didn't share a group in the original function. If the learning for the XOR function was that (0,1) and (0,0) were in one group and (1,0) and (1,1) in the other, then the groupings (0,0),(1,1) and (1,0),(0,1) satisfies those criteria. But it must be crazy hard to do that in a real example, even in the simple one you could just as easily choose (0,0),(1,0) and (1,0),(1,1)

@silberwolfSR71 8 лет назад

+Martine Lenders The way i understood it was that they didn't reverse the XOR operator. They just let XOR do its thing and, knowing it would mess up most of the time, just reversed the answer. So if XOR said something like "patient is healthy", the (usually) correct answer would be the opposite, so "patient is ill". Note: As was mentioned in the video, this is an oversimplification of the problem. But the basic idea is: the algorithm that they used would get it wrong more often than not. So they just adapted to this. "The algorithm says it's this thing (and we now it's usually wrong), so the right answer is probably the other thing." This method only worked so well because they were dividing patients into 2 groups. If they had 7 groups instead, it wouldn't help much to know that group 4 isn't the correct answer.

@UnashamedlyHentai 8 лет назад

+silberwolfSR71 If it was as simple as switching the answer, then their 45% hit rate would have switched to a 55% hit rate, but it flipped to 75-80%. There's more to it than simply flipping the answer.

@rich1051414 8 лет назад

+UnashamedlyHentai No, the additional 5% gained from the inverting the answer from the faulty algorithm contributed to them increasing the overall predictability to 75-80%. Any individual algorithm to boost the accuracy by 5% would be amazing for them I am sure, so the fact that they used a broken one to achieve such an accuracy boost by exploiting how poor it was makes for an interesting story :)

@jmmckk 8 лет назад

+Martine Lenders He is not reversing "XOR." The data set reflects XOR characteristics. The Machine Learning has a hard time learning data sets of XOR characteristics - especially clustering with KNN. "Hard time" refers to predictions results of

@ammobake 7 лет назад

If you included a third dimension of time (for example, from the medical data mentioned in the video) you might be able to program a parabolic arc to include the functions you are looking to compute?

@smileyball 8 лет назад

Would pushing the parameter space into a higher dimension have helped? Either rbf-svm or a sufficiently large neural net? Or does the existence of a 200-dimension XOR problem seriously mess things up?

@maxsnts 8 лет назад

wouldn't be easier to renumber the sets? whenever the best values are in the middle... number them like 1,3,4,2? And when the good values are in the extremes, number them 3,1,2,4? What am i missing in the problem?

@AdamDitchfield 8 лет назад

"reverse it"? What? The inverse of the function is also unsuitable, this doesn't really explain anything.

@DoronKliger 8 лет назад

have you tried to square the explanatory variables?

@nenharma82 8 лет назад

Isn't that what kernel functions would be used for in support vector machines, to separate the data in a higher dimensional space? When do you actually have linearly separable data in real world examples? Or am I missing something here..?

@nenharma82 8 лет назад

Transform your data where (0,0) and (1,1) are bad and (0,1) and (1,0) are good into the 3rd dimension, e.g. by (x1,x2) -> (x1^2,sqrt(x1*x2),x2^2) and you'll find the data to be much better separable. Although it doesn't work particularly well with this example to be honest ^^ Nevertheless, it's an intriguing technique!

@G.Aaron.Fisher 8 лет назад

There's seven and a half minutes of video explaining a problem, leading into a 10-second explanation overview of how it was solved. It is still completely unclear to me what this video is trying to explain. Was there more to this interview that was edited out for some reason?

@mrhappy192 8 лет назад

Ok, here's my understanding: Say xY and Xy mean healthy patients. (Capitalization represents value) Machine learning will tend to group similar x and similar y values together, so we'll get groups like these: (xy, xY), (Xy, XY) This grouping doesn't mean anything, but if we "reverse" it, the grouping becomes: (xy, XY), (Xy, xY) which is meaningful. Another way I understood is, if you got a function with binary output that has a 30% success rate, reversing the output will produce a 70% success rate.

@mrhappy192 8 лет назад

It's really unclear though. You guys should make another video on this I think.

@rich1051414 8 лет назад

+MrHappy We need to understand exactly what these 4 hypothetical groups are and why they are sorted into them to fully understand. Too much missing information. And i do not understand why its so clever, we have XNOR to describe the inverse of XOR.

@TechyBen 8 лет назад

+Richard Smith Not really. It's a statistical problem. The context part of the problem, but the weighting is the other part. The computer incorrectly sees lot's of groups it has no idea of how to separate. The "can you draw a line to divide this" example. Once we tell the computer it needs to do the opposite, it needs to find the samples where it cannot divide them up, and group these, then it can get the correct results.

@adamcrume 8 лет назад

XOR-looking data is one reason why I used neural nets instead of decision trees for my research problem. Decision trees and XOR data do not mix.

@yandodov 7 лет назад

I guess what he meant to say by "reversing" is that they used machine learning, but for the wrong answers since they are easier to find. This way, with enough of them, you will eventually find the right answers.

@silkwesir1444 6 лет назад

THAT would make a lot of sense, thank you. but i'm still not completely convinced that that's what he meant.

@7Tijntje 8 лет назад

100 - 45 = 78. Thanks professor Uwe Aickelin

@MrHatoi 4 года назад

I think they intentionally trained it to be wrong even more often; 45% was the failure rate when it was trained to get it right, then it went down to 22% when they retrained it to get it wrong.

@HeavyDwarf 8 лет назад

best accent!!! he should make a audiobook !! :D

@NoshNosher 8 лет назад

+Ars Philo not a fan tbh

@pcfreak1992 8 лет назад

+Ars Philo made in germany hahaha :D

@NoshNosher 8 лет назад

Celrador It definitely sounds like a mix of an English accent and a German one, and I generally don't like German accents (I'm German).

@HeavyDwarf 8 лет назад

ich auch ;) ... klingt eher wie Schwenisch oder Dänisch

@Celrador 8 лет назад

Ars Philo Google nach seinem Namen. Du wirst einen Facebook-Post von ihm finden, wo er sich über den Frankfurter Flughafen aufregt. Ich würde sagen, er ist Deutsch. (Und wegen derartig leicht gemachten Stalking, neben diverser anderer Gründe, ist Facebook einfach scheiße. :P)

@jmmckk 8 лет назад

How would one apply anti-learning to a data set with more than two labels? In other words, in cases where the prediction is simply not "true" or "false." Lets say a data set has four labels (1,2,3,4), and you supply test data which yields the following prediction in ranked order: "2, 1, 3, 4" (2 being the closest match, 4 being the furthest match) the learning approach would yield "2" as the respective prediction. When applying the reverse method to this same prediction ranking (2, 1, 3, 4), would you conclude that "4" would be the respective anti-learning counterpart since it's the prediction deemed to be furthest away from the test data or is this claim flawed? From the binary anti-learning example we can assume that proximity doesn't necessarily mean similarity. Surely "1", "3", or "4" are closer to the actual test label than "2" given that the training data set represents XOR characteristics.

@jmmckk 8 лет назад

+Jonathan McKenzie Perhaps we can contrive an additional data set of the 1,2,3,4 data set of XOR characteristics, learning the proximity that yields the correct predictions - since clustering implies closer proximity equates to higher similarity. I think all of these solutions to an original weak-predicting Machine Learning Algorithm is "boosting." Whereas "anti-learning" is a subset of "boosting."

@TechyBen 8 лет назад

+Jonathan McKenzie I read a paper where they used random data, and checked random predictions against it, to give a "null" hypothesis or known "wrong answer". Then they tested the theories and predictions against the random data, and the real observed data. This let them know which theories were "as bad as random garbage" and which ones were not "accidental correlation". Could we do the same with the learning AI?

@jmmckk 8 лет назад

+TechyBen Perhaps a set of data is truly random where there are no correlations/patterns or behavior that can be learned. This would yield random predictions, the system has learned to emulate randomness of the domain provided by the training set. On another hand, maybe the use of various machine learning algorithms will tell you what there is to be learned about the data set - if there is anything to learn at all; this is probably more applicable to Unsupervised Learning. I believe accidental correlation will rule itself out so long as you have a sufficient amount of data (or other testing techniques, use different subsets of the training data as test data) Using the difference between a random prediction generator and the results of your chosen learning algorithms is indeed a legitimate benchmark. Just like they test students on multiple choice exams, if in fact the student performs no greater than randomly bubbling in the answers, one can conclude the student learned nothing at all.

@Vulcapyro 8 лет назад

+TechyBen The error in randomly assigning classes is trivially just 1 - 1/C for C classes. You generally don't need any special method to see how good/bad a classifier is. Comparing against random guessing is also usually a pretty awful benchmark to begin with.

@Koseiku 8 лет назад

ah, we had some similar problem in our remote sensing class. basically, a computer can't categorize data such as we do. for example in pictures a computer can use different algorithms to sample pixels and put them into categories to say that x area resembles an apple. to do that, the computer looks at the RGB-values of pixels, number of pixels, their positions etc. but for complex problems you can't use such simple methods for making said categories. but it's easy to let a computer find out what doesn't work as a sampling method.

@Nguroa 8 лет назад

Nice, I'm happy I've found this channel. One thing that did surprise me was there is still dotmatrix printing paper in the world. ;)

@adamkowalczyk6103 5 лет назад

I like the video, well done. Looking on some comments below: it is not easy to explain some concepts, which are naturally captured by a few algebraic equations, in words. The idea of anti-learning is clearly stated in Uwe's presentation. However, it interferes with attempts by listeners, perhaps subconscious, to make a simple analytical model of it. Even simplest algebraic equation in words is a mess, until you have a proper model in your mind. Personally, I would like to see addition yet another, 2-dimensional version of xor which I used in my explanations of anti-learning. Maybe it is time for another you-tube video. In more general terms, I see here a need for formal maths which we cannot escape. At this moment I appreciate legacy of Rene Descartes, who introduced co-ordinates to geometry, so we now can perform very complicated derivations on space and time, universe, etc. Before that, the geometry was really hard to deal with and reduced in its scope.

@xbreak64 8 лет назад

Haven't Support Vector Machines solved the issue of linearly separating such a data set?

@michaelvarney. 8 лет назад

Aren't back propagation neural networks pretty robust for an XOR/XNOR type classification problem?

@hrmt_anon 8 лет назад

So what did he actually do there?

@jaredhouston4223 3 года назад

You learned how to apply derivatives to xor?

@davidwilkie9551 8 лет назад

(?) a transistor is stop or go, and a neuron can do that or sort into categories up to a full reflection. So if you set up a circuit to sort data like a set of partial mirrors, would that do this job? ...Inspiration from organic process that cycles absorb/exclude over and over.

@ricanteja 8 лет назад

This is awesome!

@TechyBen 8 лет назад

I've read of this being done in other experiments and to test predictions. Example: We do not know the right answer, or how to get there. But we do know it's not "random" data. So we generate a set of random data and random predictions. Now we try the scientists or the AI/learning code predictions against our random data. We know if anyone matches the random data, they are wrong. So those who do not, are possibly right. We still need to test the surviving theories and predictions against real data and real answers later, but it helps find a place to begin when we are almost completely clueless. :P

@ilkero1067 2 года назад

Please tell us more

@joshuawade1495 8 лет назад

I could see how SVM might fail in the example he gave because it tries to linearly separate the data, but wouldn't a simple decision tree be able to learn data grouped in that way? Of course I realize the 2D example is illustrative and that his actual dataset is higher dimensional.

@johnvonhorn2942 8 лет назад

Sir James Goldsmith would use this technique back in the 70's. He'd phone up a few stock brokers and ask them their opinion. if they all said, "sell, sell! get out!" He'd buy and if they all shouted, "buy, buy! It's going through the roof!" he'd sell.

@NoriMori1992 8 лет назад

That's hilarious.

@robinvik1 8 лет назад

What?

@DeusExAstra 8 лет назад

Yeah this video needs a second part

@mrosskne Год назад

so what is it?

@TheOmildlyOinformed 2 года назад

Couldn't regression analysis solve this problem? You would need to create additional composite variables of some form but it doesn't seem like a 2 year problem.

@michaelwiesinger3684 8 лет назад

why didnt they use an artificial neural network? isn't solving this x-or problem without the the problem of the "curse of dimensionality" a classic task for a NN?

@unvergebeneid 8 лет назад

+Michael Wiesinger What little I gather from this video is that they did use a learning algorithm that can separate these things (a kernel SVM would do just as well) but they didn't have enough data to learn these complex relationships. And then they made their classifier extra bad and just inverted the classification. How they did this or why this was easier to learn ... I have no fucking clue. But it's not like he explained it in the video so I personally don't feel that bad about it.

@judgeomega 8 лет назад

+Michael Wiesinger NNs are exactly what he was talking about. Under the hood, NN use lines to categorize points into groups. We can get more fancy by using lagrangians to try to examine the data from a higher dimension, but even then its still just lines.

@beegieb4210 8 лет назад

+judgeomega Not really. NNs learn arbitrarily complex functions. It's been proven that neural networks can approximate any function.

@beegieb4210 8 лет назад

+Michael Wiesinger If I were to take a guess, it would be from the fact that the dataset is very small. 500 training examples for a neural network will likely lead to some serious overfitting. Neural Networks (and by extension Deep Learning), are best applied to large datasets with many examples.

@freefood89 8 лет назад

+BeegieB Doesn't cross validation minimize overfitting? I hope they had more than 500 data points if their feature vectors had 200 dimensions. Also, why are they limited to classifying into 2 outcomes? Perhaps there are latent variables? This really is an unsatisfying video

@PullerzCoD 8 лет назад

Very interesting indeed.

@tilago 8 лет назад

Interesting, sort of reminds me of finding the value of what the denominatiors cannot be in algebra.

@Yeraus 8 лет назад

Interesting, I'd like to see an application of this though, how can one just reverse a false answer to get the right answer instead of another false one?

@murphy54000 8 лет назад

+Yeraus By asking a question with only two answers.

@Yeraus 8 лет назад

***** Yeah, but that applies to that easy example only. The initial problem has way more possible answers.

@murphy54000 8 лет назад

Yeraus It had many more variables, not many more answers. They ask, "Is this patient healthy?" and then reverse the answer the machine gives them, because it's more likely to be correct. They're not asking "How many milliliters of what drug will cure this patient?" and that's the key difference.

@stefanlippeck6230 8 лет назад

not too sure what a "learning function" is but after about one fourth int the video my little inner programmer started screaming "fuzzy logic" so loud, i coudn't quite understand the rest of the problem.

@juliusfucik4011 4 года назад

This is exactly the stuff neural nets were made for to solve. I think even a support vector machine would do it. Or even simpler a decision tree.

@andremelo5465 7 лет назад

He could used the blank side of the paper to do that simple graphic...or div by 4 a single side that works too...

@bepstein111 4 года назад

"If it's wrong, just reverse it!. There can only be two possible options!"

@lucidmoses 8 лет назад

If you add Rolled and Folded versions of the input date as input. Wouldn't standard machine learning deal with the XOR problem.

@karatsurba4791 3 года назад

I can see its use in binary classification problems. Hard to see how it could u useful in a non-binary classification problem. Any thoughts ?

@AP07H30515 8 лет назад

My understanding is that when you know you are wrong, the opposite is more likely to be true.

@andrewwew 8 лет назад

amazing!

@PongoXBongo 7 лет назад

Wouldn't it be more efficient to use pattern recognition on data sets? Spit the cases into A-B groups "healthy" and "not healthy", human categorize them, then have the machine look for commonalities. Then spit each group again, re-run the algorithms, split again, re-run, etc. That way you'd end up with a machine that can not only do the initial A-B grouping, but also potentially tease out new patterns that the humans didn't notice before. Sounds like they were trying to have the machine do all the work unaided.

@JimFortune 8 лет назад

Maybe I'm oversimplifying, but why not just NOT one of the inputs?

@michaeldemertzi5973 8 лет назад

+Jim Fortune you can, but I think the argument is there are around a hundred input features so you would need to add (N choose 2) 'NOT' features. Plus the XOR relationships aren't clean and binary like the toy example but much messier

@sator666666 8 лет назад

Great! So we should take abs(x-y) as input also.

@EamonBurke 8 лет назад

The computer is trying to sort into a binary group: healthy or not. The little bar graphs had a y axis that represented "good health" so to speak, which is actually what they need to computer to learn to determine. Given that there is a desire for a "happy medium" or a "happy extreme" on some values, applying a binary decision will be correct on binary problems, and slightly more often than not, WRONG ON QUESTIONS WHERE 1 OUT OF THREE POSSIBILITIES IS CORRECT. Basically, since the computer is drawing a binary line by educated guess through a tertiary problem, it will slightly more often be wrong. So if you simply reverse it the RESULT (not the process), the odds of getting the right answer go up, and even benefit from an educated guess. So, +Computerphile , correct me if I'm wrong, but this is a bit like the Monty Hall Problem.

@Lvvcassss 8 лет назад

Pat, The NES Punk 20 years later :D

@kyoung21b 8 лет назад

I've got to go with the commenters who feel that something critical was left out. Look at the XOR problem in a little more detail to see that (e.g. he's recreating the old proof that a single layer neural network is limited in what it can learn, i.e. given that it can only generate linear boundaries it can't learn XOR). Label the edges of a square A,B,C,D. To get XOR right, the right grouping would be 2 groups containing the opposite edges, i.e. (A,C), (B,D). But there are 2 other ways to group the edges corresponding to separating the edges with a linear boundary. So the linear boundary algorithm will learn one of the "wrong" groupings (i.e. either (A,B),(C,D) or (A,D),(B,C)). Given that, it wouldn't appear that it's obvious that the proper alternative is (A,C),(B,D) rather than the other of the "wrong" linear boundary groupings. So it's not clear at all from the video what "just reversing it" means. Perhaps there's some way to exhaust the wrong choices and pick an alternative grouping but that would seem to lead to a combinatorial explosion for reasonable amounts of data so that doesn't seem likely to be the answer.

@onlainari 8 лет назад

He's so happy about all this it's funny.

@ddjanji 8 лет назад

that was definitely a lot of paper

@HairyPixels 8 лет назад

Well, I've failed to anti-learn.

@BlazeCyndaquil 8 лет назад

Wait, why wouldn't you just use something like k-nearest neighbor for this data? I can almost guarantee that k-nearest neighbor will do better than 45% on this kind of data. And the way that the med student seemed to describe it, I would think a decision tree would work quite well also. Are we just limited to SVMs for some reason?

@rich1051414 8 лет назад

+BlazeCyndaquil the point is 45% is actually 55%, if you invert the answers.

@BlazeCyndaquil 8 лет назад

Richard Smith That's not what he said at all. That explanation would make sense, but he said they were getting around 80% from that. I just don't see getting that good of performance on an SVM without some kind of special kernel. I assume he's using an SVM, (or maybe logistic regression) because he's talking exclusively about drawing a single line between classes. Anyways, my point is that they would probably be able to get really good performance using k-nearest neighbor or decision trees, since the way he described their approach doesn't seem like the most natural way to try to separate the data.

@unvergebeneid 8 лет назад

+Richard Smith But that's not exactly the rocket science part. Apparently the rocket science part is how they got the classification rate way below 45% but it's not explained how they did that.

@SCPlayerTwo 8 лет назад

He's so excited :D

@robertochen8672 8 лет назад

It would be awesome if the video had subtitles

@PKMartin 8 лет назад

This video seems to spend too much time on the initial cardinality bit (sometimes high X is good, sometimes low X is good, sometimes medium/extreme X is good) which doesn't seem to connect to the eventual solution, and then just gets started on the explanation of why certain classes of problem are hard for machine learning (because clustering doesn't work for XOR) and then.... stops. It really needs more time devoted to explanation of what "it didn't work so we flipped it and now it works" means.

@ceputza 8 лет назад

I did not understand anything. Was there a point?

@SuperJimmyChanga 7 лет назад

Explanation: "I've learned a simple way of choosing an answer to a complicated problem involving 2 choices. I know that I'm usually wrong more often then right. So, I'll choose the opposite answer to one I would normally pick. Now I'm right more often than not."

@morphman86 8 лет назад

So basically, it's like looking for your wallet. You know where to look to not find it, so you look in the places where it could be instead. When you've excluded all places where it isn't, the only remaining place must be where it is.

@Wouldntyouliketoknow2 8 лет назад

Im not a data scientist.. it sounded like it was harder for them to weave a line between all the 200 dimensions of datapoints when they were looking to group patients into 2 groups. I think the solution they went with was to instead just concentrate on trying to how produce the worst result.. and then just flip that. Its basically like saying, we want to flip a biased coin but produce heads. If we know the coin is biased to produce tails 80% of the time, then we all need to do is toss the coin, then flip it once = 80% chance of heads. The thing I dont understand is why is it easier to classify the data wrongly, i.e why is it easier to wrongly conclude a sick patient is healthy, or vice versa, when looking at those 200 data points (and get that wrong answer 80% of the time) than finding the right answer 80% of the time without the "flipping". Perhaps thats where I need to learn maths ;)

@itscomingoutofbothends8385 8 лет назад

I'd like to see a HowToBasic version of this video.

@bqfilms 8 лет назад

lmao when he briefly explained it I laughed like you laugh at a bad joke, its so bad its good!

@boumbh 8 лет назад

I didn’t get the resolution. "Reversed it"? You had an algorithm that was so bad that it was worse than random, so you decided to do the opposite of what it was telling you? If the algorithm was 45% right, you could only get to 55% with this method... not 78%... Wait a minute, are you using the same methodology for computing your stats? Anyways, after hearing this, I wouldn’t feel much secure if I was one of your patients ^^ .

@klaxoncow 8 лет назад

Hmm, isn't "anti-learning" just another way of saying "learning from your mistakes"? I did this. It was completely wrong. So, in future, I'll aim to do the opposite of all my previous mistakes and, lo and behold, I'm achieving greater success than when I was getting it all wrong. All my former errors have "anti-learnt" me what I should actually be doing instead!

@9600bauds 7 лет назад

"It's so bad, it's bad, let's not do that"

@MaceOjala 8 лет назад

Great story about how research works in real life. Thanks I appreciate it. I'm reading Rojas' "Neural Networks: A Systemic Introduction", and it discusses the XOR problem in the context on NN:n, McCullogh-Pitts units and Perceptrons, and conceptualizes with Minsky's research contributions et cetera. It's a good book, check it out people!

@Xylankant 8 лет назад

wait, you're a professor in data science/machine learning, and you look at this data for 2 years and never even think about checking whether it looks like it isn't linearly separable? wth?

@jasondoe2596 8 лет назад

My thoughts exactly!

@Vulcapyro 8 лет назад

+Philip John Gorinski The problem isn't whether it looks linearly separable as-is, the problem was (likely) that they couldn't find a sufficient method to transform the data representation space into something workable with data of such high dimensionality and complexity.

@Xylankant 8 лет назад

Vulcapyro possible, but from what he says in the video, once they knew it was inseparable, they knew they could do 'anti learning'. Which, to be honest, I am not really sure what he means, I would have to read the paper. I think it's most likely due to the video being so short, this is way to superficial for such a complex topic. Another thing that is completely unclear to me, they have a dataset with 500 instances, and 200(!) variables with 4 outcomes. If that's really the case, I am not surprised they can't learn to classify too well...

@Vulcapyro 8 лет назад

Philip John Gorinski Yeah I'm just thinking that with how oddly oversimplified Aickelin's explanations are here, his description of his own research problem is also cursed to be incomprehensible, so I'm not even going to try taking it at face value in any way.