Thanks! That's high praise. Chris and Dan know much more than I do, but I like to think that my ignorance helps me sometimes explain things better, because I know what confuses people (from experience).
For the negative sampling, the negative examples are word pairs with the same focus word for a number of noisy context words randomly sampled. But here it is done in a reverse way. Please let me know if the two ways are the same or it is a mistake here.
At 11:00, what does "Features" and "Evidence" refer to? How is that formula similar to logistic regression? (I was expecting some e^()/1+e^() on the RHS). In the same formula, what does c' refer to? Is it all the words that are NOT in the context of a particular word w? How did this formula become the 6 sigmoids at 12:00?
1) The sigma function encodes the exponential function that you're looking for 2) The features and evidence are word and context vectors 3) c' are the negative samples 4) This akin to the positive examples in logistic regression, while c' is like the negative examples
@@JordanBoydGraber For 3) Aren't the negative samples the focus word as shown at 12:30? I'm confused because sometimes the negative sample is context word and sometimes focus word. Does this depend on whether CBOW or skipgram is used? (like negative sampling CBOW means negative the focus word and negative sampling skipgram means negative the context words).
I'm still confused about n-gram model and skip-ngram model. Did he made any mistake or I'm confused? Basically, n-gram models uses n-1 words to predict nth word, so it means its somehow using context words wo predict target word(n). Here in this video he said skip-ngram uses target word(focus) to predict context words. They both contradict each other!!! Any experts opinion on this is highly appreciated.
Hi, My name is Ari. i am from Indonesia. can you help me explain about the sent2vec (Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features) model as you make a video about word2vec?
Ignoring the negative samples, why do we need to optimize by gradient descent of dot products rather than merely counting the occurrence of context words for each occurrence of each focus word in the training data? (and then normalizing)
That's a great question! What you're proposing is essentially PMI, which word2vec is an approximation of (projected into a lower dimension). word2vec is throwing some information away through this projection, but it seems to help.
@@JordanBoydGraber I see, it's a lower dimension because you simply initialize random vectors (of arbitrary, lower length) and consider dot products, rather than having a (# of words)-long vector for each word. Thanks a ton!
It's the length of the embedding. It really doesn't mean much other than the size of the representation that you're using. I.e., how complicated your model is going to be.
10:20 in the probability function, you're using exp vc.vw. But, didn't you say that the context and focus word have different vectors? Then why are we choosing the context and focus words from the same vector v?
Apart from some errors (the theta parameter never occurs on the right side on your equations and it is even incorrect, as the "probability" given by exp)=/sum(exp(...)) IS basiclly the theta parameter), worse is that is looks like you copied most of the math from the stanford lecture on NLP and did not even give them credits. BTW, the theta parameter is explained in that lecture...
I did draw on Yoav Goldberg's lectures (and credited him). I suspect the Stanford folks did the same, but the equations themselves come from the original word2vec paper. Using Theta as a general catchall for parameters of a model is quite common in ML.