#69 DR. THOMAS LUX - Interpolation of Sparse High-Dimensional Data [UNPLUGGED]

Подписаться 136 тыс.

Просмотров 15 тыс.

50% 1

Today we are speaking with Dr. Thomas Lux, a research scientist at Meta in Silicon Valley.
In some sense, all of supervised machine learning can be framed through the lens of geometry. All training data exists as points in euclidean space, and we want to predict the value of a function at all those points. Neural networks appear to be the modus operandi these days for many domains of prediction. In that light; we might ask ourselves - what makes neural networks better than classical techniques like K nearest neighbour from a geometric perspective. Our guest today has done research on exactly that problem, trying to define error bounds for approximations in terms of directions, distances, and derivatives.
The insights from Thomas's work point at why neural networks are so good at problems which everything else fails at, like image recognition. The key is in their ability to ignore parts of the input space, do nonlinear dimension reduction, and concentrate their approximation power on important parts of the function.
Pod: anchor.fm/machinelearningstre...
Patreon: / mlst
Discord: / discord
[00:00:00] Intro to Show
[00:04:11] Intro to Thomas (Main show kick off)
[00:04:56] Interpolation of Sparse High-Dimensional Data
[00:12:19] Where does one place the basis functions to partition the space, the perennial question
[00:16:20] The sampling phenomenon -- where did all those dimensions come from?
[00:17:40] The placement of the MLP basis functions, they are not where you think they are
[00:23:15] NNs only extrapolate when given explicit priors to do so, CNNs in the translation domain
[00:25:31] Transformers extrapolate in the permutation domain
[00:28:26] NN priors work by creating space junk everywhere
[00:36:44] Are vector spaces the way to go? On discrete problems
[00:40:23] Activation functioms
[00:45:57] What can we prove about NNs? Gradients without backprop
Deep learning on sets [Fabian Fuchs]
fabianfuchsml.github.io/learn...
Interpolation of Sparse High-Dimensional Data [Lux]
tchlux.github.io/papers/tchlu...
A Spline Theory of Deep Learning [_Balestriero_]
proceedings.mlr.press/v80/bal...
Gradients without Backpropagation ‘22
arxiv.org/pdf/2202.08587.pdf

Опубликовано:

20 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 25

@priyamdey3298 2 года назад

I really wanted it to go on for a solid another hour! When are we having the round 2?!

@_tnk_ 2 года назад

Really enjoyed this episode! Interesting topics, and lots of insights

@paxdriver 2 года назад

Sounds like automatic differentiation is going on the list of talk subjects lol great episode MLST, I really got a lot of geometric insights into ml out of this one.

@tchlux 2 года назад

Nice. 😆 Happy to answer questions or talk details here in the comments everyone! Also I wrote a reddit post about topics related to this talk just now, feel free to interact there too. www.reddit.com/r/MachineLearning/comments/tcott9/d_is_it_possible_for_us_to_make_fixedsize/

@nathanwycoff4627 2 года назад

Thanks for taking the time! Browsing through the reddit thread, I love that the geniuses over there decided they knew more about the subject than you hahaha... Unfortunately I have seen this happen a couple times, so don't take it to heart :)

@tchlux 2 года назад

@@nathanwycoff4627 oh I'm not discouraged at all! I don't blame people for not immediately being convinced by my random reddit post (that might contradict their own expectations about ML). I actually think of that as good feedback, in the end it's all about trying to communicate as clearly and effectively as possible for me. 😊 Thanks for the support.

@Self-Duality 2 года назад

Extremely intriguing! Thanks for sharing!

@mobiusinversion 2 года назад

Another thing I’d like to share here is related to the idea of “truly making progress in AI” and it dovetails on the discussion of current neural networks being “slightly conscious”. We’re missing two major paradigms altogether that are present in every known conscious being, notwithstanding panpsychism. Namely continuous closed loop autonomy of learning and predicting. We exist in a constant state of simultaneous perception and action and learning with a purpose whereas today’s function estimators are in a read only state when fed and must be fed manually by a user rather than on their own accord to pursue a natural objective. A closed loop is important for sentience, and today, feed forward passes of any model are certainly not that. The second factor is plasticity upon stimulus. A major difference between human and artificial neurons is that human neurons alter connection weights when stimulated, not just corrected or updated by targeting error. So, really we’re missing two entire paradigms!

@tchlux 2 года назад

Yeah definitely. My priorities are: - solve general purpose prediction (with proofs), one algorithm for all modalities of input - use that to build a good RL agent (context with predetermined objectives) - research into how "objective setting" can be posed as a prediction problem (this is actually one of the harder problems that I think evolution has solved in the mammalian brain) and how to encourage it towards manifesting language-like communication patterns

@oncedidactic 2 года назад

This is what I mean when I say keep the technical depth. Good 👏 stuff 👏 I feel like y’all have been collapsing noisy dimensions down to key understanding over the last year or so, and we’re about to plateau on a shiny polished grokking that’ll be hard like diamond to crack further. Have to go looking for new gemstones after that 😅 Really neat hearing from Dr. Lux (megaman much?), this no nonsense mathematical approach is kicking butt.

@sabawalid Год назад

Very nice episode, as usual. I have a question for Thomas: I AGREE with his thesis that the continuous subsumes the discrete and so in theory, these models “contain” the solution and they can, again in theory, get at the almost discrete solution. But, who said this is easier then formulating the discrete solution in the first place? What I mean is this: carving out the discrete the solution from the MASSIVE continuous hyperplane might be, LITERALLY, like looking for a needle in haystack, where the haystack in the continuous manifold and the discrete solution is the needle (literally). In other words, the “search” for the discrete portion might be effectively non-computable, espeically that for many discrete solutions we cannot except any "fat" - the function is strict in the sense that it will not accept any "additional" approximations. So, the fact that the continuous “includes” the discrete should not give us any comfort. I hope I can get an answer on that.

@vinca43 2 года назад

I really enjoyed this, particularly because there was collegial debate throughout, rather than flat acceptance. Dr. Duggar specifically gave appropriate pushback. It reminds me of my grad school days and that was appreciated. A few thoughts: 1) If you want to learn from the best on Delaunay triangulations and combinatorial geometry more generally, please pick up Herbert Edelsbrunner's "Geometry and topology for mesh generation" or if you have more time and mental bandwidth, "Algorithms in combinatorial geometry" (the latter crushed my mind on many occasions). 2) I realize Euclidean space is a good approximation of reality in many instances, but think we should avoid framing problems in terms of any specific metric space for as long as possible, and should certainly not start with the assumption that we're in E^n. (Unfortunately, the Delaunay guarantees are only in Euclidean space, so I suppose if using this triangulation approach, it forces the Euclidean constraint.) 3) Dr. Lux made a comment (16:54) about how the real world is 3+1 dimensions, so the 1,000,000 dimensions of our function was not a result of the inherent dimensionality of our image space but a necessary tool for sampling theory. I disagree in part. The coordinate system of the world + time may be 4 dimensions, but that does not account for anything in the world, the many possible states, or actions we can take in it. The data we're analyzing is based on sensing states of things in the world; it's not based on an empty coordinate system. My linear algebra professor once posed "How many dimensions is a tank?" I said "three" to which he replied, "Does that account for the angle or height of the barrel, whether the barrel is loaded or not, whether it's moving forward or backward, etc." That's just one thing in the world. 'dimension' and '|basis| of our coordinate system' are not equivalent. I'll concede that our function's parameter space has natural redundancy and that plays into sampling theory in part, but the world together with the stuff in it in their many states, taking many possible actions is way, way more than 4 dimensions. 4) Dr. Lux's comment about the order between positional and apositional learning was fascinating, and I'm glad Keith pushed him on this point, because I would have dismissed his seemingly arbitrary choice on architecture had he not explained further. 5) All of this made me think Stephane Mallat, who continues to dedicate time and research on understanding DNNs, would be a great guest. 6) The discussion (14:26) on data density in relation to the varying surface made me think of UMAP which uses some clever Riemann Geometry to stitch together localized metrics. Perhaps Leland McInnes would be another great guest. Really enjoying the UNPLUGGED format. Thanks!

@tchlux 2 года назад

Thanks for the feedback Chris! I'm excited to look at those Delaunay related references you provided. Some thoughts of mine in response to your points, hopefully explained more clearly: 2) The issue with avoiding a specific metric is that it makes the problem difficult to define. In contrast, I think that we should pick a metric that we think can work, and start defining things (data & their approximations) from there. In my mind picking R^n and the 2-norm should be good enough, let's start building robust theory off of that. I'm happy to change my approach, but we need to start somewhere. 🤷‍♂️ 3) I also totally agree that there is more information than just a few dimensions. But at the same time, do we really think it's *millions*? Even *thousands*? The real trick is finding out how to reduce the dimension while maintaining "important" information. Defining "important" is hard problem number one (predicting outcomes?), then doing the dimension reduction is hard problem number two (neural networks?).

@vinca43 2 года назад

@@tchlux thanks for the thoughtful response. 2) My 2nd thought was actually tied to a comment Tim made in the introduction (0:36), where he said, "all training data exists as points in a Euclidean space..." I disagree with this as a starting assumption. We can model and approximate our data space this way but it's not and should not be a given. With your work and the Delaunay guarantees residing in E^n, I agree that getting hands dirty is the right approach. 3) I will now have to spend the day trying to estimate the number of dimensions arising from objects and their actions in the real world. :) I suspect the number is larger than you think, and smaller than I think. Regardless, I agree that getting to what is "important" through dimension reduction is key (at least in the context of this discussion), and a good latent space is probably much closer to 4 dimensions than 10s, much less 1000s, given strong regularity properties of the function we're learning. Of course, dimension reduction always leaves stuff behind, and while those patterns we lose (and justify by maintaining some separation guarantee) may be lower order effects/insights, they speak to input data space dimensions that have been projected/contracted away. Hope that clarifies. Really looking forward to reading your paper!

@alexijohansen 2 года назад

Great episode!

@mobiusinversion 2 года назад

Additionally, data augmentation sometimes creates new invariances. For example, masking tricks represent exogenous intentions to induce invariances not contained in the original data in the first place.

@benjones1452 2 года назад

Thanks!

@nias2631 Год назад

At 34:00, maybe I am mistaken but that sounds like a convolutional conditional neural processes model.

@ThomasCzerniawski 2 года назад

Bookmark for self 14:30. Optimal sampling strategy.

@youndukn 2 года назад

Please do the episode on forward gradient. It will be very interesting.

@MachineLearningStreetTalk 2 года назад

Good idea!

@mfpears 2 года назад

25:00 I think a transformer is a performance optimization on top of an RNN. Just my noob opinion 30:00 LOL, well yes, every single node in a network is an abstraction. Technically. Abstraction is the essence of intelligence.