Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 2 - Neural Classifiers

Stanford Online

Подписаться 648 тыс.

Просмотров 164 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

14 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 38

@AramR-m2w 10 месяцев назад

🎯 Key Takeaways for quick navigation: 00:29 Today's *lecture focuses on word vectors, touching on word senses, and introduces neural network classifiers, aiming to enhance understanding of word embeddings papers like word2vec or GLoVe.* 01:52 The *word2vec model, using a simple algorithm, learns word vectors by predicting surrounding words based on dot products between word vectors, achieving word similarity in a high-dimensional space.* 03:15 Word2vec *is a "bag of words" model, ignoring word order, but still captures significant properties of words. Probabilities are often low (e.g., 0.01), and word similarity is achieved by placing similar words close in a high-dimensional vector space.* 06:31 Learning *good word vectors involves gradient descent, updating parameters based on the gradient of the loss function. Stochastic gradient descent is preferred due to its efficiency, especially in large corpora.* 10:18 Stochastic *gradient descent in word2vec involves estimating gradients based on small batches of center words, enabling faster learning. The sparsity of gradient information is addressed, and word vectors are often represented as row vectors.* 15:21 Word2vec *encompasses the skip-gram and continuous bag of words (CBOW) models. Negative sampling is introduced as a more efficient training method, using logistic regression to predict context words and reducing the computational load of softmax.* 20:57 Negative *sampling involves creating noise pairs to train binary logistic regression models efficiently. The unigram distribution with a 3/4 power transformation is used to sample words, mitigating the difference between common and rare words.* 23:40 Co-occurrence *matrices, an alternative to word2vec, represent word relationships based on word counts in context windows. The matrix can serve as a word vector representation, capturing word similarity and usage patterns.* 28:23 When *working with negative words in word vectors, sampling 10-15 negative words provides more stable results than just one. This helps capture different parts of the space and improves learning.* 30:46 Co-occurrence *matrices can be created using a window around the word (similar to word2vec) or by considering entire documents. However, these matrices are large and sparse, leading to noisier results. To address this, low-dimensional vectors (25-1,000 dimensions) are preferred.* 32:42 Singular *Value Decomposition (SVD) is used to reduce the dimensionality of count co-occurrence vectors. By deleting some singular values, lower-dimensional representations of words are obtained, capturing important information efficiently.* 35:54 Scaling *counts in the cells of the co-occurrence matrix addresses issues with extremely frequent words. Techniques like taking the log of counts or capping maximum counts can improve word vectors obtained through SVD.* 37:52 The *GLoVe algorithm, developed in 2014, unifies linear algebra-based methods (like LSA and COALS) with neural models (like skip-gram and CBOW). GLoVe uses a log-bilinear model to approximate the log of co-occurrence probabilities, aiming for efficient training and meaningful word vectors.* 43:29 GLoVe *introduces an explicit loss function, ensuring the dot product of word vectors approximates the log of co-occurrence probabilities. This model helps prevent very common words from dominating and demonstrates efficient training scalable to large corpora.* 51:50 Intrinsic *evaluation of word vectors, such as word analogies, demonstrates the effectiveness of models. GLoVe's linear component property aids in solving analogies, and its performance benefits from diverse data sources, like Wikipedia.* 56:34 Another *intrinsic evaluation involves measuring how well models match human judgments of word similarity. GLoVe, trained on diverse data, outperforms plain SVD but shows similar performance to word2vec on word similarity tasks.* 58:00 The *objective function aims for the dot product to represent the log probability of co-occurrence, leading to the log-bilinear model with wy, wj, and bias terms.* 59:24 In *model building, a bias term is added for each word to account for general word probabilities, enhancing the representation.* 01:00:23 Multiplying *by the frequency of a word adjusts for common words, giving more importance to those with higher co-occurrence counts.* 01:02:40 Word *vectors can be applied to end-user tasks like named entity recognition, significantly improving performance by capturing word meanings.* 01:06:23 Exploring *word senses, having separate vectors for each meaning was experimented with, but the majority practice involves a single vector per word type.* 01:11:05 Word *vectors for a word type can be seen as a superposition of sense vectors, a weighted average where weighting corresponds to sense frequencies.* Made with HARPA AI

@hackerzhome_org 8 месяцев назад

how about the prompt?

@Chrisoloni 2 года назад

Thank you so much for this great course!

@stanfordonline 2 года назад

Hi Chrisoloni! Thanks for your comment, we're glad to hear you're enjoying the content - happy learning!

@tseringjorgais2811 2 года назад

@@stanfordonline Can I get the lecture slides somewhere?

@Xufana Год назад

I guess the second question section ends on 45:55 and you might want to add a timestamp there

@Xufana Год назад

I would added these ones: 45:55 Word vector evaluation 48:30 Intrinsic evaluation 57:42 Question 1:01:45 Extrinsic evaluation 1:03:25 Word sense & ambiguity

@sumekenov 10 месяцев назад

bless you@@Xufana

@jded1346 4 месяца назад

Wonderful course! Clarification: @11:29: The sparseness of affected/updated J(θ) elements depends only on the window size, not whether Simple Grad Descent or Stochastic Grad descent is used, right? Since within a window, the computation doesn't change across the two methods.

@whatsupLoading 3 дня назад

At 37:00,marry-> bride might be more appropriate than to priest.

@AdityaAVG 4 месяца назад

Can we get the lecture slides somewhere ?

@goanshubansal8035 Год назад

when first two videos will be understood then I will be on the ladder number two

@ryancodrai487 Год назад

At 2:45 I think what you said about the word2vec model being a bag-of-words model is not strictly correct. Word2vec does gain some understanding of local word ordering. If I am incorrect, could you please explain?

@mrfli24 11 месяцев назад

If you look at the probability formula, it only contains dot products and doesn't have any specific position information.

@jongsong5370 2 года назад

I think... marry should be matched to bride and pray to priest on page 21.

@jeromeeusebius 2 года назад

Good point. It is not clear if the lecturer drew the vectors or if it's taken as is from the paper and the mismatch may indicate that the system is not perfect.

@jakanader Год назад

@@jeromeeusebius it looks like the lecturer drew the vectors as the endpoints are varying distances from the words

@carlloseduardo2917 Год назад

could be that the corpus where the embedding model where trained had more sentences with marry and priest in the same context

@kiran.pradeep Год назад

@@carlloseduardo2917 Can you explain how the log-bilinear model 'with vector differences' formula came out that? Which property of conditional probability was used? Any useful links? Timestamp 43:03

@vohiepthanh9692 Год назад

i agree with you.

@ronitmndl Год назад

22:36 word2vec ends

@nanunsaram Год назад

Great again!

@RomilVikramSonigra Год назад

While using Stochastic Gradient descent, if we choose a corpus of 32 center words, how do we make updates to the outside (context) words that surround it. Because these words will show up when we compute the likelihood and if out corpus doesn't include them then how does their probability of occurring get updated? Thanks!

@AshishBangwal Год назад

i think you query is answered at 11:36 , so actually we are only calculating gradient for those 32 words hence we are getting a sparse gradient update.

@raghavkansal9701 6 дней назад

I feel this course giving me tough time doing Mathematics. Sad : _ _ (

@goanshubansal8035 Год назад

this lecture is about neural classifiers

@goanshubansal8035 Год назад

what are neural classifiers

@goanshubansal8035 Год назад

have you understood deep learning standards yet

@darkmember727 4 месяца назад

Just found out he wrote the GLoVe Paper.

@yukisuki5380 Год назад

3:40 reasonably hahahahah

@annawilson3824 8 месяцев назад

1:10:37

@annawilson3824 8 месяцев назад

38:00

@shawnyang2851 Месяц назад

some parts are damn confusing

@amitabhachakraborty497 10 месяцев назад

the lectures are not so good as per stanford its just recitation

@葛浩宇 11 месяцев назад

any chinese student here?

@happylife4775 Год назад

Great material , bad explanation

@vohiepthanh9692 Год назад

i think you should read paper "Efficient Estimation of Word Representations in Vector Space", "Distributed Representations of Words and Phrases and their Compositionality" and "GloVe: Global Vectors for Word Representation" to better understand this lecture. I don't think he can cover all the concepts in detail in just 1 hour and 15 minutes.