I was looking for something like this to make a raw text of the hugging face documentation, since no LLM’s are trained in it since it’s available in a very weird website format. This is awesome :)
Wow thanks Majeed that’s something which I desperately need. Was facing lot of issues for text conversion in my Rag system. Will also be helpful if you can run a tutorial on sentence window retrieval + rerank for RAG.
@@1littlecoder yea I miss you weekly AI news. You should start it again. Not the all AI stuff happened that week but like crazy ground braking invention or paper. Or whatever impresses you. In this way it won't be 20-30 min long. you can make it 10-12 min. There's a youtube channel "the friday checkout" you can follow his format.
An idea for better video structure would be to have a demo at the beginning, while I have some idea but had to watch until the end to understand what the library can do.
@@1littlecoder and the difference between unstructured html parser and the library html2text? And why are there pages in HTML documents in the first place?
@@1littlecoder there is a docker image of unstructured io and they also give option to install as docker container but there are no instructions as how to proceed , a video on it would be very helpful
Oh, you briefly mentioned it uses pdf.miner under the hood? I hope not! From personal experience with testing different Python libs, I found the results of pyPDF and PyMuPDF much better.