ChatGPT - Semantics: Transformers & NLP 2

Подписаться 12 тыс.

Просмотров 8 тыс.

50% 1

How do transformers like ChatGPT learn and represent words?
Transformers are a type of neural network architecture that are used in natural language processing tasks like language translation, language modelling, and text classification. They are effective at converting words into numerical values, which is necessary for AI to understand language. There are three key concepts to consider when encoding words numerically: semantics (meaning), position (relative and absolute), and relationships and attention (grammar). Transformers excel at capturing relationships and attention, or the way words relate to and pay attention to each other in a sentence. They do this using an attention mechanism, which allows the model to selectively focus on certain parts of the input while processing it. In the next video, we will look at the attention mechanism in more detail and how it works.
We can encode word semantics using a neural network to predict a target word based on a series of surrounding words in a corpus of text. The network is trained using backpropagation, adjusting the weights and biases of the input and hidden layers until the updates become negligible and the network is said to be "trained". The weights connecting the input neurons to the hidden layer will then contain an encoding of the word, with similar words having similar encodings. This allows for more efficient processing and a better understanding of the meaning and context of words in the language model.
Video links:
On www.lucidate.co.uk:
- One-hot vector Encoding - www.lucidate.c...
- Neural Networks Primer - www.lucidate.c...
On RU-vid:
- One-hot vector Encoding - • EDA 2 - Categorical Data
- Neural Networks Primer - • Neural Network Primer
=========================================================================
Link to introductory series on Neural networks:
Lucidate website: www.lucidate.c....
RU-vid: www.youtube.co....
Link to intro video on 'Backpropagation':
Lucidate website: www.lucidate.c....
RU-vid: • How neural networks le...
'Attention is all you need' paper - arxiv.org/pdf/...
=========================================================================
Transformers are a type of artificial intelligence (AI) used for natural language processing (NLP) tasks, such as translation and summarisation. They were introduced in 2017 by Google researchers, who sought to address the limitations of recurrent neural networks (RNNs), which had traditionally been used for NLP tasks. RNNs had difficulty parallelizing, and tended to suffer from the vanishing/exploding gradient problem, making it difficult to train them with long input sequences.
Transformers address these limitations by using self-attention, a mechanism which allows the model to selectively choose which parts of the input to pay attention to. This makes the model much easier to parallelize and eliminates the vanishing/exploding gradient problem.
Self-attention works by weighting the importance of different parts of the input, allowing the AI to focus on the most relevant information and better handle input sequences of varying lengths. This is accomplished through three matrices: Query (Q), Key (K) and Value (V). The Query matrix can be interpreted as the word for which attention is being calculated, while the Key matrix can be interpreted as the word to which attention is paid. The eigenvalues and eigenvectors of these matrices tend to be similar, and the product of these two matrices gives the attention score.
=========================================================================
#ai #artificialintelligence #deeplearning #chatgpt #gpt3 #neuralnetworks #attention #attentionisallyouneed
#ai #artificialintelligence #chatgpt #gpt3 #neuralnetworks #deeplearning #machinelearning #attention #attentionisallyouneed

Опубликовано:

7 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 71

@labsanta Год назад

Takeaways: Understanding Transformers in Natural Language Processing: Key Concepts for Designing Better Prompts What are Transformers and their Significance in Natural Language Processing? Transformers are a type of neural network that excels at capturing relationships and attention in natural language processing tasks. They have been used to achieve state-of-the-art results in various language-related tasks such as machine translation, text classification, and text generation. Key Concepts for Designing Better Prompts with Transformers To design better prompts for Transformers, it is essential to understand the following key concepts: Semantics: This refers to the meaning of a word, which can be encoded into numerical values using a neural network. However, the position of a word in a sentence or relative to other words is also crucial, and hence, additional encodings are required. Position: The absolute and relative positions of words matter in encoding them into numbers. Positional encodings can be used to provide the model with information about the position of words in a sentence. Relationships and attention: Transformers excel at capturing grammatical relationships between words using attention. Attention allows the model to selectively focus on certain parts of the input while processing it. This is achieved by encoding attention and relationships between words into numbers in a matrix. The Importance of Encoding Words Numerically To convert words into numbers, one approach could be to use a dictionary and assign an integer to each word in order. However, this simple scheme will not work as the numeric assignments to words matter. A better approach is to use one-hot vectors, which provide a unique representation for each word. The overall Transformer architecture has different pieces of functionality in different areas, and semantics is addressed in one area. Transformers are a powerful tool for natural language processing tasks, and understanding the key concepts of semantics, position, and attention is essential for designing better prompts. By encoding words into numerical values, Transformers can capture relationships and attention in language, providing state-of-the-art results in various language-related tasks.

@lucidateAI Год назад

Thanks for the excellent summary and for your contributions to the channel.

@jazonsamillano Год назад

My goodness! So much work went into the creation of this transformer and word embedding video. It's really well done. I would've gladly paid money for this type of high quality material. If you add a "Super Thanks" button to your video, I will give a donation. Creators who put in this much work should get paid. We should help this guy make this video more popular so he can get paid for his hard work.

@lucidateAI Год назад

Jazon, Thank you so much for your kind words and for your support of the channel! I'm glad you found the video informative and well-done. Your generosity and willingness to support fellow creators who put in a lot of hard work is truly appreciated. I appreciate your support and your efforts to help spread the word about the video. Your engagement and feedback are what make the work that went into creating the video worth it. Thank you again for your comment and for supporting the channel. Greatly appreciated (really, really greatly appreciated) - Lucidate.

@jazonsamillano Год назад

You are a fantastic human being, Richard Walker. The world is a better and smarter place because of people like you. I have downloaded all your NLP Transformers videos into my phone so I can watch them over and over while driving! The videos are so academic, not just superficial summaries. Thank you again.

@lucidateAI Год назад

You are welcome. Many thanks for your kind words. I can't endorse watching my videos while driving....! But I hope you are able to get something from each of them. More to come in this series in the next few weeks. Eyes on the road my friend, eyes on the road.....

@jazonsamillano Год назад

@@lucidateAI Oh no, haha. I have RU-vid Premium which lets you listen to videos even after the phone's screen has been turned off. So, I should be clear that I will repeatedly listen your videos while driving. Hehe.

@lucidateAI Год назад

Glad to hear that. Safe driving and smooth listening...Keen to hear what you think about some of the other content on the channel.

@SimonDickerman Год назад

Thank you for this explanation: Easily the most understandable breakdown on YT for those us new to ML

@lucidateAI Год назад

Glad it was helpful!

@maeloraby Год назад

Beautiful, sufficient and clear. Thank you

@lucidateAI Год назад

Glad it was helpful!

@JuergenAschenbrenner 11 месяцев назад

I've been waiting to get this kind of video from three blue one brown, finally I got it here. Great work keep it up

@lucidateAI 11 месяцев назад

Thanks. Praise indeed! @3blue1brown sets the standard against which explainer videos are judged. I’m glad you enjoyed the video, I hope that you find the rest of the content on the Lucidate channel as insightful.

@sjpeters64 Год назад

Such great content, sadly underappreciated. Sincere thanks Richard!

@lucidateAI Год назад

Thanks Stephen! Clearly not underappreciated by you!! Glad you enjoyed this video, as I hope you have the rest of the channel. Thanks once again for your support and comment.

@AykutKlc Год назад

Great video. Thanks to all clever people behind this tech. Really super smart ideas.

@lucidateAI Год назад

Glad you liked it!

@syedtahsin9717 Год назад

You put so much effort into making this video so wonderful. Thank you so much sir! It was much needed.

@lucidateAI Год назад

Glad it was helpful!

@snehotoshbanerjee1938 10 месяцев назад

Great content with fantastic explanation!

@lucidateAI 10 месяцев назад

Glad you liked it! I hope you like the rest of the Lucidate videos on this topic just as much.

@BarryThomas62 Год назад

Great video. The best I've seen on this topic by quite a margin.

@lucidateAI Год назад

Glad you liked it!

@JorgeMartinez-xb2ks 9 месяцев назад

Excellent, master 😃

@lucidateAI 9 месяцев назад

Thank you! Cheers!

@JoshCav Год назад

Sir, this was an excellent overview of Transformers. I couldn't wrap my head around how vectors are used in a neural network. Your 3D explanation of this concept was *brilliant* . Also, what application are you using for the animations? Did you draw all those individual lines when describing a neural network? LOL! If so, you are dedicated to the craft.

@lucidateAI Год назад

Glad it was helpful in illustrating the concepts! I use 'Manim' for animation - developed by @3b1b. Appreciate the comment and the support of the channel.

@mr.wonder8168 Год назад

Another great video in the series!☮️

@lucidateAI Год назад

Glad you enjoyed it!

@StoutProper Год назад

Superb video, tysm

@lucidateAI Год назад

Thanks. Greatly appreciated.

@StoutProper Год назад

@@lucidateAI you’re up late/early!

@lucidateAI Год назад

AI never sleeps....!

@lorenzoleongutierrez7927 Год назад

Great !

@lucidateAI Год назад

Thanks!!

@gowiththeflo59 Год назад

Beautiful, just wish you named that the Algo you describe is CBOW or Continuous bag of words unless I'm mistaken. Great animations.

@lucidateAI Год назад

Thanks Florian for your comment, I'm glad you found the video aesthetically pleasing. As regards CBOW you are correct to point this out, however given that word embeddings are a complex topic and would merit and entire series of videos in their own right to do them justice I was trying to keep things generic and not favour any particular word embedding learning technique. such as CBOW. As the comments section is part of the video (just) I hope that the following gets you a little closer to your wish. Continuous Bag of Words (CBOW) is a popular algorithm for learning word embeddings in natural language processing. It belongs to a family of algorithms called "distributed representation" or "distributed semantic models", which aim to represent words as continuous vectors in a high-dimensional space based on their co-occurrence patterns in a large corpus of text. Two other popular algorithms for learning word embeddings that are similar to CBOW are: 1. Skip-gram: This is a variant of CBOW that instead of predicting the center word given its context, it predicts the context words given the center word. This results in a similar word embedding, but with some differences in terms of the accuracy and the computational efficiency. 2. GloVe (Global Vectors): This is another algorithm for learning word embeddings that is similar to CBOW, but instead of using a neural network to learn the word vectors, it uses a matrix factorization approach to directly optimize a weighted least-squares objective function that captures the global co-occurrence statistics of the words. Other related algorithms include the hierarchical softmax, negative sampling, and matrix factorization approaches to learning word embeddings. All of these algorithms aim to learn a high-quality distributed representation of words that captures their semantic and syntactic relationships based on the co-occurrence patterns in a large corpus of text. So while the above is perhaps too much detail for a video (but may make it into a future video _series_) I hope this has gone some way to addressing your excellent comment. Thanks for your support, greatly appreciated - Lucidate.

@holthuizenoemoet591 Год назад

Excellent video, a few questions : how does this work in for example BERT? did they use the same technique for creating the embeddings? How can I find out which embedding strategy was used in a language model?

@lucidateAI Год назад

Hi Holthuizen. Many thanks for the feedback on the video. I haven't made any videos specifically on BERT I'm afraid, so I can't point you in the direction that anything that Lucidate has produced to answer your question. But please see this excellent resource which discusses BERT embeddings in some detail -> mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/#3-extracting-embeddings

@kerryjkim Год назад

This was fantastic! Thank you for making this video. One question: How do transformers handle misspelled words and words that aren't in the training set? For example, if I ask chatGPT a question with a bunch of misspellings, it usually understands and answers appropriately. Similarly, if I say 'tell me a story about a frog named Xzyzzyl', it seems to be able to do this. Are these words replaced with a best guess correction before applying the semantic/word embeddings or is something else occurring?

@lucidateAI Год назад

Thanks Kerry Kim for your kind words and for engaging with our channel! We're glad you enjoyed the video and appreciate your support. Regarding your question about how transformers handle misspelled words and words that aren't in the training set, this is a great point. In general, transformers use subword "tokenization", which breaks words into smaller units called subwords. This allows the model to handle out-of-vocabulary words and misspelled words by combining the subwords that make up the word. Also if a word such as "Xzyzzyl" is not in the training set, the model would infer its meaning from the words around it ("You shall know a word by the company it keeps"). In this case the words around it might give the model strong cues that this was a name. This can help the model to generate meaningful responses even for words that it has never seen before. Thank you again for your comment and for contributing to the channel. Your engagement and contribution makes the channel greater than I could make it on my own and is greatly appreciated.

@kerryjkim Год назад

@@lucidateAI Thank you! Really interesting!

@lucidateAI Год назад

You are welcome.

@delhiboy0105 Год назад

Excellent video.. could you please help me with 14:16.. when we scale up the input layer. What do the additional input layers mean? What do they represent. And how come after scalin up the input, we get a single vector representation against each single word. Any article or video I can refer to to know more?

@lucidateAI Год назад

Hi Arjun, thanks for your question and comment on the video. These embeddings essentially use counts of the co-occurrence of words. If you are looking for the co-occurrence of words then you have to say over what number of neighbouring words you are considering. So when I speak of 'scaling up the input layer' I am taking about the number of words preceding and the number of words following the word we are building the word embedding vector for. The number of neighbouring words that are considered when building an embedding varies depending on the specific embedding method being used. In general, embedding methods like Word2Vec and GloVe typically consider a window of a few words to the left and right of the current word. For example, in Word2Vec, the default window size is 5, which means that the model considers the 5 words to the left and the 5 words to the right of the current word when creating the embedding. The size of the window can be adjusted depending on the specific task or corpus being used to build the embedding. In some cases, a larger window size may be more appropriate to capture longer-range dependencies between words, while in other cases a smaller window size may be preferred to focus on local context. Ultimately, the optimal window size for a given task or corpus is an empirical question that can be explored through experimentation and evaluation. We get a vector for each word in a word embedding because it allows us to represent the meaning of a word as a point in a high-dimensional space. Each dimension in the vector represents a different feature of the word, such as its semantic or syntactic properties. By using a vector, we can perform mathematical operations between words, such as addition and subtraction, to get meaningful results. If we were instead to use a matrix or higher rank tensor to represent each word, we would need a very large matrix/tensor to capture all the different features of the word, which would make computation much more complex and time-consuming. In contrast, a vector allows us to represent a word with a relatively small number of dimensions, typically in the range of 50-300, which makes computation much more efficient. Additionally, using a vector instead of a matrix allows us to take advantage of the geometric properties of vector spaces. We can measure the distance between vectors using metrics such as cosine similarity or Euclidean distance, which can be used to determine the similarity between words or to perform clustering and classification tasks. There are many good articles and videos on word embeddings, but here are a few recommendations: "A Primer on Neural Network Models for Natural Language Processing" by Yoav Goldberg: This article provides a comprehensive introduction to word embeddings and how they are used in natural language processing. It covers the basic concepts and some of the most popular models, including word2vec and GloVe. "Deep Learning, NLP, and Representations" by Christopher Olah: This is a more technical article that provides an in-depth look at how word embeddings are created and used in deep learning models. It covers the mathematics behind embeddings and explains some of the most popular models in detail. "Word Embeddings" by Andrew Ng: This video lecture is part of a deep learning course by Andrew Ng, and it provides an introduction to word embeddings and their use in natural language processing. It covers the basic concepts and provides some examples of how word embeddings are used in practice. "The Illustrated Word2vec" by Jay Alammar: This article provides a visual guide to the word2vec algorithm, which is one of the most popular methods for creating word embeddings. It explains the intuition behind the algorithm and provides some examples of how it can be used. These resources should provide a good starting point for learning about word embeddings. Thanks again for your excellent question and your support of the channel! - Lucidate

@delhiboy0105 Год назад

@@lucidateAI Thank you so much for taking the time to answer the question in detail :)

@lucidateAI Год назад

You are welcome. Comments and questions greatly appreciated. - Lucidate

@delhiboy0105 Год назад

how is the similarity for 300 dimentional vectors being captured as a single number? what matrix operation is being used to get to a single digit?

@lucidateAI Год назад

The single number calculated is 'Cosine similarity'. Cosine similarity is a measure of similarity between two non-zero vectors, typically used to compare the similarity of two documents in a text analysis task, or with respect to this video two word embeddings. It calculates the cosine of the angle between two vectors in a high-dimensional space, where each dimension represents a feature of the data. In other words, cosine similarity measures the cosine of the angle between two vectors, which ranges from -1 (opposite directions) to 1 (same direction), where 0 means the vectors are orthogonal. The formula for cosine similarity is: cosine_similarity(A, B) = A·B / ||A|| ||B|| here A and B are the two vectors being compared, · is the dot product, and ||A|| and ||B|| are the magnitudes (lengths) of the vectors. When comparing text documents, word embeddings can be used to represent the documents as high-dimensional vectors, with each dimension representing the importance of a specific word. Cosine similarity is then used to compare the similarity of two documents by calculating the cosine of the angle between their corresponding vectors. Cosine similarity is useful because it provides a way to measure similarity between high-dimensional vectors, which can be difficult to visualize or interpret. It is also relatively efficient to compute, and can be used to compare documents of varying lengths. In addition to embeddings t is commonly used in text analysis tasks such as information retrieval, document classification, and clustering.

@delhiboy0105 Год назад

@@lucidateAI Thank you so much :)

@lucidateAI Год назад

You are welcome. Thanks for your questions and support of the channel.

@hopesy12u4 Год назад

Are they live though!?

@lucidateAI Год назад

ChatGPT is live if that is what you mean - or were you referring to something else?

@weizhang5023 Год назад

so cool. can you share the ppt?

@lucidateAI Год назад

Hi Wei. Glad you enjoyed the video. There is no ppt. All the animation is bespoke and done with Manim. www.manim.community Definitely worth checking Manim if you are interested in creating these type of animations yourself. Lucidate.

@kevinehsani3358 Год назад

I can not find NLP 1 could someone please give me the link for it. thanks

@lucidateAI Год назад

Here is a link to the video you requested: Unbelievable: "How GPT-3 Can Analyse Financial Markets" ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-jo1NZ3vCS90.html.

@kevinehsani3358 Год назад

@@lucidateAI thanks

@lucidateAI Год назад

You are welcome. Please let me know what you think of the vid.

@kevinehsani3358 Год назад

@@lucidateAI Quite good. I wonder if there is a room for a video or two for prompt engineering of how to ask questions and clever output formats. I have seen all the videos, quite good, but wondering if you could make videos that are at technical interview level. For example argmax() function is used to convert the softmax() output into final words, leave nothing to imaginations. Even deeper than what you have presented so far.

@lucidateAI Год назад

Thanks. I'm glad you enjoyed the video and other videos on the channel. I've just added a members area -> ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-CTVGNclzMBQ.html where I do have a small number of more technical videos, and I plan to add a couple more each month. This is available at the 'Managing Director' level. I appreciate that not everyone can afford to pay a monthly subscription; but if you can it is a great way to show your appreciation for the material you have seen and fund the production of more high quality free-to-view content.

@cptechno Год назад

CONSTRUCTIVE CRITICISM: You should review your voice capturing technologies because your voice does not come out well when it's played on speakers. I would encourage you to get a better quality microphone like YETI microphone or higher end microphones.

@lucidateAI Год назад

Thanks @cptechno, I sincerely appreciate the constructive criticism and suggestions. This is not one of my more recent videos, and I believe that I have upped my audio game in the intervening time. (Maybe I have, maybe I haven't). I think in my case the problem has not been the hardware (I have always recorded with a Yeti Blue) but my poor mastery of audio software. I think, among many other things, that the voice levels were too low and the ambient music too high. Might you take a look (I guess more of a 'listen') to some more recent content - e.g. those in this playlist ru-vid.com/group/PLaJCKi8Nk1hwZcuLliEwPlz4kngXMDdGI . Has the audio quality improved? Do I still have work to do? With huge thanks for your critique, suggestions and taking the time to provide constructive feedback. Lucidate.

@girabbit Год назад

11:25 GPT would complete this task better than ChatGPT (and with less disdain)

@lucidateAI Год назад

Agreed, as would Llama 2 and Claude (1 or 2). Clause would be the least full of disdain. (I should know - I get a lot of disdain!). Thanks for your comment and support of the channel. Lucidate