How to convert PDF DOCX to Structured TXT Formats for RAG! (UNSTRUCTURED Tutorial)

1littlecoder

Подписаться 78 тыс.

Просмотров 6 тыс.

50% 1

Видео Поделиться Скачать Добавить в

🔗 Links 🔗
Colab
drive.google.com/file/d/1U8VC...
Unstructured Github - github.com/Unstructured-IO/un...

Наука

Опубликовано:

1 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 52

@IdPreferNot1 3 месяца назад

Going over libraries useful for AI dev is a great video series idea!

@1littlecoder 3 месяца назад

Thank you. If you have any interesting choices in mind feel free to let me know :)

@eugenmalatov5470 3 месяца назад

100%

@Raphy_Afk 3 месяца назад

This channel is completely underrated! Thanks for this video

@1littlecoder 3 месяца назад

Glad you think so! Thank you :)

@yusufersayyem7242 3 месяца назад

Honestly, we are lucky to know you..... Many thanks and appreciation to you, Mr. Abdul ❤

@1littlecoder 3 месяца назад

I'm glad you found it useful :)

@sharanbabu2001 3 месяца назад

Thanks for sharing!!

@Nick_With_A_Stick 3 месяца назад

I was looking for something like this to make a raw text of the hugging face documentation, since no LLM’s are trained in it since it’s available in a very weird website format. This is awesome :)

@faisalIqbal_AI 3 месяца назад

Informative Thanks

@jmirodg7094 3 месяца назад

Great tool Thanks!🤩

@drramasubramaniam6724 3 месяца назад

Wow thanks Majeed that’s something which I desperately need. Was facing lot of issues for text conversion in my Rag system. Will also be helpful if you can run a tutorial on sentence window retrieval + rerank for RAG.

@eugenmalatov5470 3 месяца назад

Great video!

@1littlecoder 3 месяца назад

Glad you enjoyed it

@adarmawan1977 3 месяца назад

I like this !

@maizizhamdo 3 месяца назад

great video boss, it support multilangues

@captainoddessy 3 месяца назад

wow you are back after a week. You should take some breaks like this. AI is going crazy. You won't miss anything

@1littlecoder 3 месяца назад

I saw a lot of models being launched. In fact been thinking to do a weekly summary line Ai news this time.

@captainoddessy 3 месяца назад

@@1littlecoder yea I miss you weekly AI news. You should start it again. Not the all AI stuff happened that week but like crazy ground braking invention or paper. Or whatever impresses you. In this way it won't be 20-30 min long. you can make it 10-12 min. There's a youtube channel "the friday checkout" you can follow his format.

@BiMoba 3 месяца назад

An idea for better video structure would be to have a demo at the beginning, while I have some idea but had to watch until the end to understand what the library can do.

@1littlecoder 3 месяца назад

Thanks for the tip. Do you mean like showing the final output?

@BiMoba 3 месяца назад

@@1littlecoder yes, something like input and output. It acts as a hook.

@1littlecoder 3 месяца назад

@@BiMoba Thank you. I'll try to make sure!

@MrKellvalami 3 месяца назад

I always find out if I'm interested in a particular video by reading the transcript summary.

@1littlecoder 3 месяца назад

That's a clever way!

@Saranlisto 3 месяца назад

👏👏👏👏👏

@1littlecoder 3 месяца назад

Look who's here 😁

@nithishkrish3442 2 месяца назад

After extraction the text how to extract some information and write to a excel

@rounaksen1683 3 месяца назад

are you also doing vectara advanced rag hackathon ?

@OP-yr6jb 2 месяца назад

Yes I am looking at unstructured - have you used it? How good is it for tables?

@nirmaldesai4504 3 месяца назад

If it is implemented, it is on-premise or calling Unstructured API which is using our ingestion data

@1littlecoder 3 месяца назад

Whatever we did on this video is on-prem because we aren't calling any api

@DeepakRavi93 3 месяца назад

PDFs will take longer to process than a text file. This creates a need to use Unstructured Commercial SaaS API. For other formats, it is okay to use.

@MrPierreSab 3 месяца назад

Do you know what is the difference with pandoc?

@1littlecoder 3 месяца назад

Afaik pandoc helps you generate PDFs.

@eugenmalatov5470 3 месяца назад

@@1littlecoder and the difference between unstructured html parser and the library html2text? And why are there pages in HTML documents in the first place?

@MrPierreSab 3 месяца назад

@@1littlecoder I see, thanks. pdfminer is an alternative as you mentionned.

@drmetroyt Месяц назад

Sir , how to install and use this on docker , no video on internet

@1littlecoder Месяц назад

I think llama index as its own docker version

@drmetroyt Месяц назад

@@1littlecoder there is a docker image of unstructured io and they also give option to install as docker container but there are no instructions as how to proceed , a video on it would be very helpful

@Macorelppa 3 месяца назад

Stop shtposting please 🙏

@1littlecoder 3 месяца назад

Means

@kalilinux8682 3 месяца назад

@@1littlecoder he is implying this video is shit. Which I disagree with. Although the video could have been shorter.

@1littlecoder 3 месяца назад

@@kalilinux8682 i actually asked the question to make sure it's not a bot

@Macorelppa 3 месяца назад

@@1littlecoder I am not a bot. LMAO.

@1littlecoder 3 месяца назад

@@Macorelppa Glad to know. Dealing with a lot of bots, I'm happy to see humans

@ilianos 3 месяца назад

Oh, you briefly mentioned it uses pdf.miner under the hood? I hope not! From personal experience with testing different Python libs, I found the results of pyPDF and PyMuPDF much better.