LayoutLM: Pre-training of Text and Layout for Document Image Understanding (Paper Summary)

Подписаться 11 тыс.

Просмотров 12 тыс.

50% 1

#ai #documentparsing #languagemodel #transformers
LayoutLM v1/v2 proposes a pre-training objective to understand document better by incorporating layout, text and actual text-image snippets. Fits very well in use-cases like Resume parsing, Bills parsing, Table parsing, etc.
⏩ Abstract: Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the widespread use of pre-training models for NLP applications, they almost exclusively focus on text-level manipulation, while neglecting layout and style information that is vital for document image understanding. In this paper, we propose the LayoutLM to jointly model interactions between text and layout information across scanned document images, which is beneficial for a great number of real-world document image understanding tasks such as information extraction from scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for document-level pre-training. It achieves new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42).
⏩ OUTLINE:
0:00 - Background and Abstract
03:58 - LayoutLM pre-training mechanism, architecture and intuition
⏩ Paper Title: LayoutLM: Pre-training of Text and Layout for Document Image Understanding
⏩ Paper: arxiv.org/abs/1912.13318
⏩ Author: Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou
⏩ Organisation: Harbin Institute of Technology, Beihang University, Microsoft Research Asia
⏩ Code: github.com/microsoft/unilm/tr...
Enjoy reading articles? then consider subscribing to Medium membership, it just 5$ a month for unlimited access to all free/paid content. Subscribe now - / membership
*********************************************
If you want to support me financially which totally optional and voluntary :) ❤️
You can consider buying me chai ( because i don't drink coffee :) ) at www.buymeacoffee.com/TechvizC...
*********************************************
⏩ IMPORTANT LINKS
Research Paper Summaries: • Simple Unsupervised Ke...
*********************************************
⏩ RU-vid - / @techvizthedatascienceguy
⏩ LinkedIn - / prakhar21
⏩ Medium - / prakhar.mishra
⏩ GitHub - github.com/prakhar21
*********************************************
⏩ Please feel free to share out the content and subscribe to my channel - / @techvizthedatascienceguy
Tools I use for making videos :)
⏩ iPad - tinyurl.com/y39p6pwc
⏩ Apple Pencil - tinyurl.com/y5rk8txn
⏩ GoodNotes - tinyurl.com/y627cfsa
#techviz #datascienceguy #documentAI #naturallanguageprocessing #resumeparsing #transformers
About Me:
I am Prakhar Mishra and this channel is my passion project. I am currently pursuing my MS (by research) in Data Science. I have an industry work-ex of 3+ years in the field of Data Science and Machine Learning with a particular focus on Natural Language Processing (NLP).

Опубликовано:

4 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 17

@TechVizTheDataScienceGuy 2 года назад

Watch more paper summaries at ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-ykClwtoLER8.html

@sudhirpol1895 Год назад

Content is really good but one thing is that, in hugging face implementation they have not used OCR output for Fine-tuning task. During pre-training it is a not a multimodal model, but during fine tuning it should be called as multimodal model, right?

@TheMarComplex Год назад

This was pretty interesting, love to know about the V1 architecture as well!

@marinamaher8211 Год назад

Great, thanks for this clear explanation. If you do V2 & V3, it will be awesome.

@TheMarComplex Год назад

Thanks!

@yosefasefaw4207 Год назад

thanks a lot! you are amazing

@TechVizTheDataScienceGuy Год назад

You’re welcome ☺️

@mariussame9357 Год назад

Hi ! Thanks for the video ! I want to ask you a question i'm working in different use cases and the majority of the time the goal is to extract information and i found this model really interesting the problem that I have is I'm a french person so the text from which I want to extract the information are in french and I assume that this model was pretrained on english document so do you think that I can still fine tuned the model on my french document or do you have any recommendation?