Pavlo Molchanov (@PavloMolchanov)

2025-04-04 | ❤️ 661 | 🔁 133


🔥 Vision encoder upgrade: RADIOv2.5 = DFN_CLIP + DINOv2 + SAM + SigLIP + ToMe + multi-res training + teacher loss balancing + smart augmentations, CVPR2025.

Current foundation models have too many limitations: i) tailored for a single task, ii) not flexible on resolution (like CLIP, DINO), iii) work on small resolution, iv) don’t generalize across tasks.

💥 RADIO is one encoder, one pass. Better features than DFN-CLIP, DINO, SAM, and SigLIP - all at once. Like a Swiss army knife for vision tasks. 📢 Appearing at CVPR 2025 🏢 C-RADIO: Commercial-friendly variant (top CVPR 2024 request) 📈 Cleaner PCA features with registers + augmentations - CLIP-like models are noisy, we fix that 🧪 Top scores via linear probing: ADE20k (linear) 54.56, ImageNet kNN 85.81, NYUd, and Pascal 🖼️ Works with any resolution or aspect ratio 🤖 Beats SigLIP & CLIP on VLM benchmarks ⚡ Efficient token feature projection for VLMs via ToMe

📝 Paper (v2.5): https://arxiv.org/abs/2312.06709 💻 Code: https://github.com/NVlabs/RADIO (let’s hit 1k stars) 🤖 Models: https://huggingface.co/collections/nvidia/radio

🧵1/n

🔗 원본 링크

미디어

photo


Auto-generated - needs manual review

Tags

domain-robotics domain-llm domain-vlm domain-dev-tools domain-visionos