Probably one of the best applied AI channel. No fluff, only useful information. Your series on vector similarity is how I understood the entire topic. Thanks and Cheers mate!
Nice approach. Would be interesting to investigate how effective, fast and efficient this is versus using say scene detection built into ffmpeg or by looking for large discontinuities in perceptual hashes of frames indicating a rapid change of scene. Alternatively investigating the way the codec has written data as key frames may also work. I may have a look in to this my copious (ahem!) free time!
Yeah, we’re building an abstraction in semantic chunkers that will provide info like this by default - it will come with some other useful features for video too
*Video Summary: Semantic Chunking for Efficient Video Processing* This video demonstrates how to efficiently process videos using "semantic chunking," a method commonly used with text, but applicable to other modalities like video. *Key Points:* * *Why Chunk Video? (**0:25**)* Recent multi-modal models like GPT-4o and Gemini 1.5 can process videos, but feeding every frame can be inefficient and expensive. * *Semantic Chunking (**0:00**):* This method utilizes image embedding models to identify semantically similar frames and split the video into meaningful "chunks" based on changes in content. * *Implementation (**1:59**):* The video uses the `semantic-chunkers` library and explores two examples: * A simple video with a bunny and butterfly with clear scene changes. * A more complex video of a man driving and interacting with his car in various settings. * *Model Selection (**3:28**):* The video uses a Vision Transformer (ViT) model, highlighting that while effective for broad classification, it might not be ideal for fine-grained semantic understanding. * *Alternative Model: CLIP (**5:58**):* The video showcases CLIP, a model trained on semantic similarity, proving to be more sensitive to nuanced content changes and yielding more granular chunks. * *Benefits (**11:29**):* Semantic chunking allows for more efficient processing by focusing on relevant frames, saving time and money when feeding video data to AI models. *In conclusion, the video advocates for semantic chunking as a valuable technique for processing videos intelligently and efficiently, especially when working with expensive and time-consuming AI models.* i summarized the transcript with gemini 1.5 pro
Cool demo. Any js/ts based libraries that can achieve this kind of functionality? I've looked in transformers.js and CLIP, but they mostly offer image classification. Do you think "Xenova/clip-vit-base-patch32" could work for this?
Great video, this might be exactly what I'm looking for... Is the video semantic chunker able to detect when the content of a given slide for a presentation / course has changed (without needing to do text extraction) ?
Thanks @jamesbringgs. Again super useful! Any idea if we can generate real video processing in AI at this stage. I am eager to animate someone talking via AI in real time. Hope to get there soon.
Really like the idea and how the library is very easy to use!! 🔥 I'd love to know the algorithm behind the different types of chunckers implemented in the library
Damn this is awesome, I want librairies like these for Typescript 😂 I'm so going to turn to the darkside and learn to do webservices both in Python and Typescript/Node hahaha
Thanks for the video! The trouble with colors caused by OpenCV and matplotlib. OpenCV library uses BGR while matplotlib uses RGB. So blue and red are mixed up.