Fine-tune LiLT model for Information extraction from Image and PDF documents | UBIAI | Train LiLT |

Подписаться 6 тыс.

Просмотров 21 тыс.

50% 1

Structured document understanding has attracted considerable attention and made significant progress recently, owing to its crucial role in intelligent document processing. However, most existing related models can only deal with the document data of specific language(s) (typically English) included in thepre-training collection, which is extremely limited. To address this issue, we propose a simple yet effective Language-independent Layout Transformer (LiLT) for structured document understanding. LiLT can be pretrained on the structured documents of a single language and then directly fine-tuned onother languages with the corresponding offthe-shelf monolingual/multilingual pre-trained textual models. Experimental results on eight languages have shown that LiLT can achieve competitive or even superior performance on diverse widely-used downstream benchmarks,which enables language-independent benefit from the pre-training of document layout structure.
Video explains the Fine-tuning of the LiLT model to extract information from documents like Invoices, Receipts, Financial Documents, tables, etc.
✅ UBIAI Annotation Tool Detail Video: • Annotate Text, PDF & I...
✅ Signup for UBIAI Annotation Tool: ubiai.tools/Signup?...
1.Notebook:github.com/karndeepsingh/Extr...
2. LiLT Paper: arxiv.org/pdf/2202.13669.pdf
3. FUNSD Dataset: guillaumejaume.github.io/FUNSD/
Connect with me on :
1. LinkedIn: / karndeepsingh
2. Telegram Group: telegram.me/datascienceclubac...
3. Github: www.github.com/karndeepsingh
#datascience #nlp #deeplearning #documentunderstanding

Опубликовано:

10 дек 2022

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 60

@enkiube Год назад

Can you elaborate on the 4th txt file you review at 12:26 ? Are these the ordered list of individual tags we labelled in all files? If that's the case then I was expecting to see the same labels as the second file (The file ends with bf853.txt). Unless the 4th file is just the unique list of all labels?

@user-xs4kv1eq3b 11 месяцев назад

Could you suggest me a way to extract key value pairs of the predictions on the image.

@zendr0 4 месяца назад

Good video. How do we get the information from the inference result image? Do we need to use a separate ocr for this?

@phuongtrinh3031 10 месяцев назад

I like your video and can you make a video on how to use DiT?

@karthiksaran1860 Год назад

Hi Karandeep this is good but the results we get from inference is an image base, is it possible to get output as text like ex : input : invoice number? Output : 1243553

@atyabtosif584 Год назад

If i am using my own dataset and i get an index error: index out of range in self during inference, how could this be solved?

@ahmedouahmedmahmou5906 5 месяцев назад

how I can find the files "raw_data, train_split, sample_data, eval_split, .... " and other files because "UBIAI Annotation Tool" only provides a folder which contains four text files and annotated images ?

@yakinthigalatidou1316 11 месяцев назад

Thank you for your video Karndeep, it’s is very interesting and helpful. I am working on a similar task and I was wondering how many documents you have annotated in your project

@karndeepsingh 11 месяцев назад

100-120 samples

@shaigrustamov5115 3 месяца назад

Great video 👍 If I have 5000 invoices and want to train a model for Data Extraction, do I have to prepare both OCR and Label for this? Labels must also contain bounding boxes? Is it very time-consuming? Are there any models that can be trained without bounding boxes?

@praneethp3687 Год назад

Hey, Thanks! Can this model be used to detect diagrams and figures in the document as well?

@karndeepsingh Год назад

This model is not suitable to extract figures and diagram. May be you can use Layout Parser which is pretrained model. Otherwise, you can train your own object detection model for this purpose

@masoudparpanchi505 Год назад

HI, Thanks. what if I want to extract text data in invoices . Is there any diference in fine tuning or training? should I use any other module?

@karndeepsingh Год назад

You can try with pretrained model. But it would be to fine tune on your dataset.

@amineouguenoune5353 Год назад

when you obtain the dataframe containing the predicted boxes associated with their label, apply an OCR on the region of interest.

@enkiube Год назад

Also what is the purpose of `get_zip_dir_name()` folder? Is it just to find where the zip file is extracted? Reason I'm asking is I have created the .txt files and the PNG files separately outside (UBIAI) and try to run the code but this function returns false. all my PNG files and txt files are at the root of the zip file. so when I unzip it inside /content/data/ I know where they are stored. Can I just bypass this function by returning the data folder? Would be great if I can connect with you separately

@karndeepsingh 11 месяцев назад

Its just a function to take zip file exported from UBIAI and get its name.

@Ani7FX Год назад

So it is a must to manually annotate 100s of such documents?

@Sandeep-tn8mz 4 месяца назад

yes

@user-ky5um2nj5r 8 месяцев назад

Please provide remedy to UBIAI ? I have used LabelStudio and annotated my data

@istudio-u6o 7 дней назад

how to text extract

@sudhitpanchal4996 Год назад

Will this work on Restaurant menus? I want relation Extraction models like Donut work good here.

@karndeepsingh Год назад

Yes. It works for restaurants menu. You can also try DONUT

@prabhanshurajpoot7419 Год назад

It says page not found when I click "Notebook" link.

@karndeepsingh Год назад

Check now

@tkarthick1025 Год назад

Can I get this dataset you used to train this model?

@karndeepsingh Год назад

The dataset I used is real one and it is confidential to share. May be you can use SROIE dataset for training the model.

@user-os8mv5iq7k 11 месяцев назад

Hi karandeep ..My requirement is to extract the data from the invoice in the form of key value pair How can I achieve this requirement?

@karndeepsingh 11 месяцев назад

You can use the same approach as discussed in video and use predicted BIEOU tags over the words during inference time and merge it to get key and value pairs.

@vkrts9176 Год назад

Can the model - (LiLt) detect Handwritten text recognition ?

@karndeepsingh Год назад

You may need to use Handwritten OCR model then it may be able to understand it. UBIAI tool is using aws textract, google ocr and Microsoft ocr. So, you can select any of these ocr tool and use it for for handwritten text.

@brunobrandaoborges7484 Год назад

@@karndeepsingh Firstly, thanks for this amazing video. Can you please extend on the general lines how I can correctly approach fine-tuning in order to detect handwritten text? I'm kinda lost from what you said about using a OCR model; How could I transfer the knowledge of a OCR model for handwriting into LiLT? Thanks in advance :)

@jijojoseph8721 Год назад

if we need to extract the line items , what should we do

@karndeepsingh Год назад

You can mark those line items as “items” and extract it

@jijojoseph8721 Год назад

@@karndeepsingh but there wont be any order to the results returned right? so even if we tag all items and write a custom logic how would be identify the which row it belongs to ?

@jeffrey5602 Год назад

@@jijojoseph8721 You have the bounding box information, that should be enough to match the right rows

@harshavardhanachyuta2055 Год назад

Is the model performing better than Layoutlmv3?

@karndeepsingh Год назад

You can compare the performance, it is almost similar. But Layoutlmv3 can’t be used for commercial purpose.

@harshavardhanachyuta2055 Год назад

@@karndeepsingh yes but chatGPT is saying that it can be used for commercial purposes. 😂

@karndeepsingh Год назад

Hahaha! That’s not true! As chatGPT is trained on data till 2021 hence, it misses out on the recent updates.

@khemkavinit Год назад

Can you suggest any alternative to UBIAI?

@karndeepsingh Год назад

Unfortunately, there is no free available tool for data annotation and preparation of Document AI models.

@khemkavinit Год назад

@@karndeepsingh Thanks

@shloimielevitsky4477 7 месяцев назад

@@khemkavinit what about labelstudio? Did u end up using UBIAI?

@khemkavinit 7 месяцев назад

@@shloimielevitsky4477 Nope i didnt use it

@louatiyossr945 Год назад

UBIAI is not free right ?!

@karndeepsingh Год назад

True

@SigmaMale-sn7rb 7 месяцев назад

can i do it for arabic language ??

@karndeepsingh 7 месяцев назад

Yes

@SigmaMale-sn7rb 7 месяцев назад

@@karndeepsingh can we have a meet regarding this ??

@amnasherafal Год назад

Can we use it for information extraction on resumes?

@karndeepsingh Год назад

Yes

@amnasherafal Год назад

@@karndeepsingh Thank you for replying can you guide me a little since you know that resume is a pdf or doc file how can I annotate it , you have only shown it for image.

@chinjumurali1622 Год назад

Hi Is UBIAI is paid?

@karndeepsingh Год назад

Yes

@KidNamedDenji Год назад

Bro wtf your model cant be used then you also dont explain the code. What bs

@user-eb1cx3bz5u Год назад

this is the error i got run_inference(image) Token indices sequence length is longer than the specified maximum sequence length for this model (831 > 512). Running this sequence through the model will result in indexing errors --------------------------------------------------------------------------- IndexError Traceback (most recent call last) in () ----> 1 run_inference(image) 9 frames /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse) 2208 # remove once script supports set_grad_enabled 2209 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type) -> 2210 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) 2211 2212 IndexError: index out of range in self

@sumanthhabib8028 5 месяцев назад

Same issue did not find any relevant solution 😕

@gunavaradhan4081 4 месяца назад

because model has limitation in token length..max limit is 512 for this model...so try adding max_length=512 and truncation=True these params in encoding eg (encoding = processor(image,max_length=512,return_tensors='pt',truncation=True)) @@sumanthhabib8028