From 9dc624807a5e5a98576b518216389ab819230f1c Mon Sep 17 00:00:00 2001 From: A7medM0sta Date: Fri, 6 Dec 2024 15:35:16 +0200 Subject: [PATCH] add VisionZip --- README.md | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/README.md b/README.md index 5de01b0..2cd0443 100644 --- a/README.md +++ b/README.md @@ -4,6 +4,23 @@ **Vision-Language Models (VLMs)** feature a multimodal architecture that processes image and text data simultaneously. They can perform **Visual Question Answering (VQA)**, **image captioning** and **Text-To-Image search** kind of tasks. VLMs utilize techniques like multimodal fusing with cross-attention, masked-language modeling, and image-text matching to relate visual semantics to textual representations. This repository contains information on famous Vision Language Models (VLMs), including details about their architectures, training procedures, and the datasets used for training. **Click to expand for further details for every architecture** - 📙 Visit my other repo to try Vision Language Models on ComfyUI +## Summary + +### huggingface demo space +--- + - [VisionZip](http://202.104.135.156:7860/) + - []() + + + + +--------- +### Papers and Models +-------- +| Model Name | Code | Paper | Conferences +|---------|--------|-------------|-------------| +| VisionZip | [github-VisionZip](https://github.com/dvlab-research/VisionZip) | [paper](https://arxiv.org/abs/2412.04467) | Microsoft 2024 | +----- ## Contents - [Architectures](#architectures)