Reformer: The Efficient Transformer

Подписаться 262 тыс.

Просмотров 20 тыс.

50% 1

The Transformer for the masses! Reformer solves the biggest problem with the famous Transformer model: Its huge resource requirements. By cleverly combining Locality Sensitive Hashing and ideas from Reversible Networks, the classically huge footprint of the Transformer is drastically reduced. Not only does that mean the model uses less memory, but it can process much longer input sequences, up to 16K tokens with just 16gb of memory!
arxiv.org/abs/...
ai.googleblog....
Abstract:
Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L2) to O(LlogL), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.
Authors: Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya
Links:
RU-vid: / yannickilcher
Twitter: / ykilcher
BitChute: www.bitchute.c...
Minds: www.minds.com/...

Опубликовано:

13 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 25

@autripat 4 года назад

Loved this presentation. Essential watching for the rest of us. Amazing references as well.

@michaelcarlon1831 4 года назад

My man! Great, super useful video! A true contribution to the community.

@allenhung4390 3 года назад

You explained concepts pretty clear. Keep up the good work!!

@tsupeichen693 4 года назад

Great Tutorial with great Voice! Thank you!

@rramjee1 3 года назад

Hi Yannic. Thanks for this wonderful explanation. Can you please share any practical implementation of the reformer architecture. Thanks again.

@xaviergastaldi3214 4 года назад

Nice one! I think that, for the chunks illustration, the arrows only go to the left because the authors used a decoder reformer i.e. attention is only looking at past words and not at future ones

@YannicKilcher 4 года назад

That makes sense. In the text they say every chunk can look at itself and its neighbors, I guess in the encoder that would be to the left and right.

@matthewtang1489 3 года назад

That makes so much more sense! thanks!

@avihudekel4709 3 года назад

in 11:00 I thought about an alt j song. great video!

@mohammadkhan5430 4 года назад

Great Explanation

@manos_v163 3 года назад

Still not clear how nlogn factor comes out from LSH Attention. Quite vague how they manage this complexity. My intuition is that through chunking, each vector is involved in a constant number of computations, hence asymptotically smaller than logn factor.

@xiquandong1183 4 года назад

Great video. If possible, can you please explain ALBERT paper next? Thanks.

@RaviTeja-zk4lb 4 года назад

Both the longformer and this reformer are trying to solve the problem of large sentences. But, both uses different ideas.here it is using lsh and their it is using sliding window and global attention concept. Which works wel?

@petroschristodoulou7987 4 года назад

thansk a lot for your video

@tamasionut2279 4 года назад

I was wondering if you can do a presentation on the RevNets article as well.

@lucasnestler6933 4 года назад

memcnn explains it rather well while also providing a pytorch wrapper for it

@tae898 3 года назад

You are so good!

@mathmagic9333 3 года назад

Hi Yannic, nice video and presentation! I had a question about 18:42 -- how do we know that a specific "bucket" does not spill over into multiple chunks (in the diagram, max bucket size

@YannicKilcher 3 года назад

true, I guess rare cases are just left out for now :)

@pratikkorat790 3 года назад

How can i contact you ? Plzzz answer..!!

@simleek6766 4 года назад

Huh. This reminds me of error correction a bit. I think we should be looking into error correction more for transformers. I tried using an error correction algorithm and a bool array for the place/number encoding for my transformer. Multiple numbers could be represented as overlayed on top of each other, and space can be reduced by removing these binary neurons by layer, or just at random. I wonder if that backprop fix could apply to that system...

@YannicKilcher 4 года назад

That sounds like it could work, but I haven't thought about it deeply. Maybe worth a try

@YannicKilcher 4 года назад

The main problem with drawing parallels to things like security or cryptography is the following: Machine learning thinks in terms of distances, ratios, proximity, high and low numbers, whereas these other fields think about data in a much more strict sense: Either two things are exactly equal or not at all, either two hashes match or not, they don't care if the two are similar, so the fundamental goals of the fields are opposite.