No video :(

[MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices][Paper Reading 📑]

Подписаться 684

50% 1

Vicuna and llava are projection layers used in the system
Vision transformers are used as the backbone for multi-modal LMs
The system aims to eliminate energy requirements
LDP is a layer that aligns textual and visual space and consists of depthwise, pointwise, and layernorms
Input image type is represented as H x W x C
The visual encoder dissects images into patches, applies linear embeddings to each patch, integrates positional encodings, and passes the resulting vector sequence through a transformer encoder to obtain a classification token
Various techniques and components are used in the system, including a Mobilellama tokenizer, pre-normalization, rmsnorm, SWIGLU activation, flash attention v2, etc.
qformer is considered an inferior projector due to slow convergence
MLP is criticized for retaining information but filling it up with useless tokens
Convolutions are utilized to enhance positional information and encourage local interaction
Convolution with a stride of 2 is used to reduce the number of visual tokens by 75% while keeping spatial information
LDP has significantly fewer parameters compared to the visual encoder and runs much faster
Layer Normalization is used instead of batch normalization for stable training
LDP takes visual embeddings as input and outputs aligned visual tokens
The multi-modal training strategy involves training the vision encoder, projector, and llm
Training parameters include AdamW optimizer, a deepspeed backend, zero1, and 8 A100s
Data is randomly shuffled with a fixed seed to disturb sequential order
A data shuffler is created to randomize positions in a huggingface dataset
During vLM training, both the projector and llm are fine-tuned to enhance visual understanding
The gain provided by the visual model becomes saturated after a certain amount of training data
The number of visual tokens impacts performance, and LDP reduces the number of tokens from 576 to 144
Performance improves as pretraining costs increase
Image-level alignment has greater potential for better performance compared to object-level alignment
Swin transformers outperform vit transformers
llava utilizes a 2-layer linear structure
The system explores different projector designs and benchmarks the highest performing design
The number of visual tokens affects inference speed
The system explores reducing input resolution and finds that LDP is better
The dataset consists of portions from Arxiv, Book, C4, Common Crawl, Github, StackExchange, and Wikipedia

Опубликовано:

25 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 1

@MindStarsSoulStops 12 дней назад

I look forward to more of the content. but as an audio engineer... I can not stress how important it is for good audio on RU-vid. You will lose so many people in seconds just because you didn't do EQ.