Today we’re joined by Fatih Porikli, senior director of technology at Qualcomm AI Research. In our conversation, we covered several of the Qualcomm team’s 16 accepted main track and workshop papers at this year’s CVPR conference. The papers span a variety of generative AI and traditional computer vision topics, with an emphasis on increased training and inference efficiency for mobile and edge deployment. We explore efficient diffusion models for text-to-image generation, grounded reasoning in videos using language models, real-time on-device 360° image generation for video portrait relighting, unique video-language model for situated interactions like fitness coaching, and visual reasoning model and benchmark for interpreting complex mathematical plots, and more! We also touched on several of the demos the team will be presenting at the conference, including multi-modal vision-language models (LLaVA) and parameter-efficient fine tuning (LoRA) on mobile phones.
🎧 / 🎥 Listen or watch the full episode on our page: twimlai.com/go/688.
🔔 Subscribe to our channel for more great content just like this: ru-vid.com?sub_confi...
🗣️ CONNECT WITH US!
===============================
Subscribe to the TWIML AI Podcast: twimlai.com/podcast/twimlai/
Follow us on Twitter: / twimlai
Follow us on LinkedIn: / twimlai
Join our Slack Community: twimlai.com/community/
Subscribe to our newsletter: twimlai.com/newsletter/
Want to get in touch? Send us a message: twimlai.com/contact/
📖 CHAPTERS
===============================
00:00 - Introduction
3:25 - Clockwork UNets for Efficient Diffusion Models
10:35 - Look, Remember and Reason: Grounded Reasoning in Videos with Language Models
18:53 - EdgeRelight360: Text-Conditioned 360-Degree HDR Image Generation for Real-Time On-Device Video Portrait Relighting
23:06 - What to Say and When to Say it: A Video-Language Model and Benchmark for Situated Interactions
33:18 - Math Search, a benchmark for multi-hop step-by-step visual reasoning over plots
38:28 - Speculative Decoding for multi-modal language models
49:36 - Segmentation-Free Guidance for Text-to-Image Diffusion Models
56:20 - Improving Optical Flow Augmentation by Occlusion and Consistency Aware Interpolation
58:19 - SciFlow: Self-Cleaning Inversion Optical Flow with Regression Focal Loss
1:00:28 - Low-Latency Neural Stereo Streaming
1:04:41 - Demos
1:08:08 - Workshops
🔗 LINKS & RESOURCES
===============================
Clockwork Diffusion: Efficient Generation With Model-Step Distillation - arxiv.org/abs/2312.08128v2
Low-Latency Neural Stereo Streaming - arxiv.org/abs/2403.17879
ELVM: Efficient Large Vision Models (CVPR site) - sites.google.com/view/elvm/home
OmniCV (Omnidirectional Computer Vision) - sites.google.com/view/omnicv2...
On Speculative Decoding for Multimodal Large Language Models - arxiv.org/abs/2404.08856
SciFlow: Empowering Lightweight Optical Flow Models with Self-Cleaning Iterations - arxiv.org/abs/2404.08135
EdgeRelight360: Text-Conditioned 360-Degree HDR Image Generation for Real-Time On-Device Video Portrait Relighting - arxiv.org/abs/2404.09918
MMFM2: Look, Remember and Reason: Grounded Reasoning in Videos with Language Models - arxiv.org/abs/2306.17778
📸 Camera: amzn.to/3TQ3zsg
🎙️Microphone: amzn.to/3t5zXeV
🚦Lights: amzn.to/3TQlX49
🎛️ Audio Interface: amzn.to/3TVFAIq
🎚️ Stream Deck: amzn.to/3zzm7F5
7 авг 2024