No video :(

The future of MultiModal Models: Small, Powerful, and Efficient. [TinyGPTV Paper Reading]

Подписаться 684

50% 1

PAPER LINK: arxiv.org/pdf/...
Join Agora the open source AI research lab: / discord
My implementation of this paper in Zeta:
github.com/kye...
Follow me on GitHub: github.com/kye...
My Notes
Linear Projection Layer to Embed Visual Features: In the Tinygptv framework, we incorporate a linear projection layer to embed visual features. This step helps in mapping the input visual data to a high-dimensional space, enabling better representation learning.
Qformer as a Projection Layer: We also utilize Qformer as a projection layer in Tinygptv. Qformer, inspired by the transformer model, helps capture the interdependencies and relationships among visual features, enhancing the overall understanding of the input data.
Initializing Linear Projections with a Gaussian Distribution: The linear projections in Tinygptv are initialized with a Gaussian distribution. This initialization strategy aids in introducing diversity and randomness in the projection weights, enabling the model to learn more effectively.
First Stage: Image - ViT - Qformer - Projection - Linear Projection - - Transformer - Normalization: The initial stage in Tinygptv involves processing the input image through a Vision Transformer (ViT) to extract visual features. These features are then passed through the Qformer projection layer, followed by linear projections. The resulting projected features are further processed through a transformer model and normalized for improved performance.
Incorporating Better or Bigger Vision Embedding Models: In order to enhance the performance and capabilities of Tinygptv, we suggest exploring and utilizing better or bigger vision embedding models. This can involve using more sophisticated and advanced architectures such as state-of-the-art vision models like ViT-Large or ViT-Huge.
LLMS are Databases: In the context of Tinygptv, LLMS refers to Learnable Layer Memory Stores, which act as databases to store important information and context learned during training. These LLMS serve as memory banks that the model can refer to for improved decision-making and multi-modal understanding.
Treating Weights as Databases: We treat the weights of the model as databases, indicating that they store and hold valuable information crucial for the model's performance. Proper management and utilization of these weights in Tinygptv play a vital role in achieving optimal results.
Normalizing q, k, v, and Overall Data: To ensure stable and effective training, we apply normalization techniques to various components of the model architecture in Tinygptv. The query (q), key (k), and value (v) vectors are individually normalized, aiding in better information flow and attention calculations. Additionally, we apply overall normalization to the data processed at different stages to prevent any potential issues or instabilities.
Addressing Issues with Smaller Models: Smaller models in Tinygptv may be susceptible to NaN (Not a Number) or inf (infinity) values during multimodal computation due to limited capacity. As a result, the loss function may encounter NaN values, leading to the failure of initial batch propagation. Moreover, smaller models with a limited number of trainable parameters can suffer from gradient vanishing, making it difficult for the model to learn complex representations.
Post-Norm and Input Norm Mechanism: To mitigate potential training instabilities and enhance the learning process, Tinygptv incorporates a post-norm and input norm mechanism. This involves implementing RMSNorm (Root Mean Square normalization) after the Multi-Head Attention (MHA) operation to normalize the data. This normalization technique ensures the stability and consistency of the model's representations.