InternVL家族：通过开源组件缩小与商业多模态模型的差距 —— GPT-4V的开源替代方案

[📖 英文版本] [🆕 博客] [📜 InternVL 1.0 论文] [📜 InternVL 1.5 技术报告] [🗨️ Chat Demo] [🤗 HuggingFace Demo]

[🚀 快速开始] [🌐 Community-hosted API] [📖 中文解读]

文档

安装
- 如何搭建运行环境? [link]
训练或者微调
- 如何复现 InternVL-Chat-V1-2 的SFT阶段? [link]
- 如何在自定义数据集上微调 InternVL-Chat-V1-2? [link]
- 如何在自定义数据集上微调 Mini-InternVL-Chat 系列? TODO
Benchmark 测评

由于此代码库与 VLMEvalKit 之间存在细微的实现差异，在测试同一模型时，性能指标可能会出现轻微差异。
- 如何评测 InternVL-Chat-V1-5? [link]
- 如何使用 VLMEvalKit 评测 InternVL-Chat-V1-5? (推荐) [link]
- 如何使用 VLMEvalKit 评测 Mini-InternVL-Chat-2B-V1-5? (推荐) [link]
- 如何使用 VLMEvalKit 评测 Mini-InternVL-Chat-4B-V1-5? (推荐) [link]
部署
- 如何部署本地的 demo? [link]
- 如何用 Nvidia V100 GPU 运行 InternVL-1.5 8bit? [link] [中文教程]
- 如何进行批量推理？ [link]
- LMDeploy 加速推理 [link] [中文教程]

和 SOTA 多模态大模型对比

什么是 InternVL?

InternVL 将 ViT 拓展到 6B 参数 并与大语言模型对齐。

模型

多模态大语言模型

Model	Date	Download	Note
Mini‑InternVL‑Chat‑4B‑V1‑5	2024.05.28	🤗 HF link	🚀🚀 16% 的模型大小，90% 的模型性能
Mini-InternVL-Chat-2B-V1-5	2024.05.19	🤗 HF link	🚀🚀 8% 的模型大小，80% 的模型性能
InternVL-Chat-V1-5-AWQ	2024.05.28	🤗 HF link	InternVL-Chat-V1-5的 INT4 版本
InternVL-Chat-V1-5-Int8	2024.04.28	🤗 HF link	InternVL-Chat-V1-5的 INT8 版本
InternVL-Chat-V1-5	2024.04.18	🤗 HF link	支持4K图像；超强OCR性能；在MMMU、DocVQA、ChartQA、MathVista等各种基准测试中，其性能接近GPT-4V和Gemini Pro (🔥新)
InternVL-Chat-V1-2-Plus	2024.02.21	🤗 HF link	更多的SFT数据并且更强大
InternVL-Chat-V1-2	2024.02.11	🤗 HF link	拓展 LLM 到 34B
InternVL-Chat-V1-1	2024.01.24	🤗 HF link	支持中文并且有强大的OCR能力
InternVL-Chat-19B-448px	2024.02.03	🤗 HF link	448 分辨率
InternVL-Chat-19B	2023.12.25	🤗 HF link	英语多模态对话大模型
InternVL-Chat-13B	2023.12.25	🤗 HF link	英语多模态对话大模型

视觉语言基础模型

Model	Date	Download	Note
InternViT-300M-448px	2024.05.25	🤗 HF link	蒸馏的300M小型视觉基础模型 (🔥新)
InternViT-6B-448px-V1-5	2024.04.20	🤗 HF link	支持动态分辨率，十分强大的OCR能力 (🔥新)
InternViT-6B-448px-V1-2	2024.02.11	🤗 HF link	448 分辨率
InternViT‑6B‑448px‑V1‑0	2024.01.30	🤗 HF link	448 分辨率
InternViT-6B-224px	2023.12.22	🤗 HF link	视觉基础模型
InternVL-14B-224px	2023.12.22	🤗 HF link	视觉语言基础模型，InternViT-6B + QLLaMA，可以用于做图文对的检索

InternVL 可以做什么?

视觉感知 (点击展开)

Linear-Probe 图像分类 [see details]

ViT-22B uses the private JFT-3B dataset.

method	#param	IN-1K	IN-ReaL	IN-V2	IN-A	IN-R	IN-Sketch
OpenCLIP-G	1.8B	86.2	89.4	77.2	63.8	87.8	66.4
DINOv2-g	1.1B	86.5	89.6	78.4	75.9	78.8	62.5
EVA-01-CLIP-g	1.1B	86.5	89.3	77.4	70.5	87.7	63.1
MAWS-ViT-6.5B	6.5B	87.8	-	-	-	-	-
ViT-22B*	21.7B	89.5	90.9	83.2	83.8	87.4	-
InternViT-6B (ours)	5.9B	88.2	90.4	79.9	77.5	89.8	69.1

语义分割 [see details]

method	decoder	#param (train/total)	crop size	mIoU
OpenCLIP-G (frozen)	Linear	0.3M / 1.8B	512	39.3
ViT-22B (frozen)	Linear	0.9M / 21.7B	504	34.6
InternViT-6B (frozen)	Linear	0.5M / 5.9B	504	47.2 (+12.6)
ViT-22B (frozen)	UperNet	0.8B / 22.5B	504	52.7
InternViT-6B (frozen)	UperNet	0.4B / 6.3B	504	54.9 (+2.2)
ViT-22B	UperNet	22.5B / 22.5B	504	55.3
InternViT-6B	UperNet	6.3B / 6.3B	504	58.9 (+3.6)

零样本图像分类 [see details]

method	IN-1K	IN-A	IN-R	IN-V2	IN-Sketch	ObjectNet
OpenCLIP-G	80.1	69.3	92.1	73.6	68.9	73.0
EVA-02-CLIP-E+	82.0	82.1	94.5	75.7	71.6	79.6
ViT-22B*	85.9	90.1	96.0	80.9	-	87.6
InternVL-C (ours)	83.2	83.8	95.5	77.3	73.9	80.6

多语言零样本图像分类 [see details]

EN: English, ZH: Chinese, JP: Japanese, Ar: Arabic, IT: Italian

method	IN-1K (EN)	IN-1K (ZH)	IN-1K (JP)	IN-1K (AR)	IN-1K (IT)
Taiyi-CLIP-ViT-H	-	54.4	-	-	-
WuKong-ViT-L-G	-	57.5	-	-	-
CN-CLIP-ViT-H	-	59.6	-	-	-
AltCLIP-ViT-L	74.5	59.6	-	-	-
EVA-02-CLIP-E+	82.0	-	-	-	41.2
OpenCLIP-XLM-R-H	77.0	55.7	53.1	37.0	56.8
InternVL-C (ours)	83.2	64.5	61.5	44.9	65.7

零样本视频分类 [see details]

method #frame K400 K600 K700

OpenCLIP-G 1 65.9 66.1 59.2

EVA-02-CLIP-E+ 1 69.8 69.3 63.4

InternVL-C (ours) 1 71.0 71.3 65.7

ViCLIP 8 75.7 73.5 66.4

InternVL-C (ours) 8 79.4 78.8 71.5

跨模态检索 (点击展开)

英语零样本图文检索 [see details]

model	Flickr30K						COCO						avg
	image-to-text			text-to-image			image-to-text			text-to-image
	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
OpenCLIP-G	92.9	99.3	99.8	79.5	95.0	97.1	67.3	86.9	92.6	51.4	74.9	83.0	85.0
EVA-02-CLIP-E+	93.9	99.4	99.8	78.8	94.2	96.8	68.8	87.8	92.8	51.1	75.0	82.7	85.1
EVA-CLIP-8B	95.6	99.6	99.9	80.8	95.5	97.6	70.3	89.3	93.9	53.0	76.0	83.4	86.2
InternVL-C (ours)	94.7	99.6	99.9	81.7	96.0	98.2	70.6	89.0	93.5	54.1	77.3	84.6	86.6
InternVL-G (ours)	95.7	99.7	99.9	85.0	97.0	98.6	74.9	91.3	95.2	58.6	81.3	88.0	88.8

中文零样本图文对检索 [see details]

model	Flickr30K-CN						COCO-CN						avg
	image-to-text			text-to-image			image-to-text			text-to-image
	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
CN-CLIP-ViT-H	81.6	97.5	98.8	71.2	91.4	95.5	63.0	86.6	92.9	69.2	89.9	96.1	86.1
OpenCLIP-XLM-R-H	86.1	97.5	99.2	71.0	90.5	94.9	70.0	91.5	97.0	66.1	90.8	96.0	87.6
InternVL-C (ours)	90.3	98.8	99.7	75.1	92.9	96.4	68.8	92.0	96.7	68.9	91.9	96.5	89.0
InternVL-G (ours)	92.9	99.4	99.8	77.7	94.8	97.3	71.4	93.9	97.7	73.8	94.4	98.1	90.9

多语言零样本图文对检索 [see details]

method	EN	ES	FR	ZH	IT	KO	RU	JP	average
AltCLIP	95.4	94.1	92.9	95.1	94.2	94.4	91.8	91.7	93.7
OpenCLIP-XLM-R-H	97.3	96.1	94.5	94.7	96.0	90.2	93.9	94.0	94.6
InternVL-C (ours)	97.3	95.7	95.1	95.6	96.0	92.2	93.3	95.5	95.1
InternVL-G (ours)	98.6	97.7	96.5	96.7	96.9	95.1	94.8	96.1	96.6

多模态对话 (请看 "和SOTA的多模态大模型对比")

使用 Huggingface 快速开始

使用 InternViT-6B (点击展开)

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor

model = AutoModel.from_pretrained(
    'OpenGVLab/InternViT-6B-224px',
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).cuda().eval()

image = Image.open('./examples/image1.jpg').convert('RGB')

image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-224px')

pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

outputs = model(pixel_values)

使用 InternVL-C(ontrastive) 和 InternVL-G(enerative) (点击展开)

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
from transformers import AutoTokenizer


model = AutoModel.from_pretrained(
    'OpenGVLab/InternVL-14B-224px',
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).cuda().eval()

image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternVL-14B-224px')

tokenizer = AutoTokenizer.from_pretrained(
    'OpenGVLab/InternVL-14B-224px', use_fast=False, add_eos_token=True)
tokenizer.pad_token_id = 0  # set pad_token_id to 0

images = [
    Image.open('./examples/image1.jpg').convert('RGB'),
    Image.open('./examples/image2.jpg').convert('RGB'),
    Image.open('./examples/image3.jpg').convert('RGB')
]
prefix = 'summarize:'
texts = [
    prefix + 'a photo of a red panda',  # English
    prefix + '一张熊猫的照片',  # Chinese
    prefix + '二匹の猫の写真'  # Japanese
]

pixel_values = image_processor(images=images, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()
input_ids = tokenizer(texts, return_tensors='pt', max_length=80,
                      truncation=True, padding='max_length').input_ids.cuda()

# InternVL-C
logits_per_image, logits_per_text = model(
    image=pixel_values, text=input_ids, mode='InternVL-C')
probs = logits_per_image.softmax(dim=-1)
# tensor([[9.9609e-01, 5.2185e-03, 6.0070e-08],
#         [2.2949e-02, 9.7656e-01, 5.9903e-06],
#         [3.2932e-06, 7.4863e-05, 1.0000e+00]], device='cuda:0',
#        dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)

# InternVL-G
logits_per_image, logits_per_text = model(
    image=pixel_values, text=input_ids, mode='InternVL-G')
probs = logits_per_image.softmax(dim=-1)
# tensor([[9.9609e-01, 3.1738e-03, 3.6322e-08],
#         [8.6060e-03, 9.9219e-01, 2.8759e-06],
#         [1.7583e-06, 3.1233e-05, 1.0000e+00]], device='cuda:0',
#        dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)

# please set add_eos_token to False for generation
tokenizer.add_eos_token = False
image = Image.open('./examples/image1.jpg').convert('RGB')
pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

tokenized = tokenizer("English caption:", return_tensors='pt')
pred = model.generate(
    pixel_values=pixel_values,
    input_ids=tokenized.input_ids.cuda(),
    attention_mask=tokenized.attention_mask.cuda(),
    num_beams=5,
    min_new_tokens=8,
)
caption = tokenizer.decode(pred[0].cpu(), skip_special_tokens=True).strip()
# English caption: a red panda sitting on top of a wooden platform

使用 InternVL-Chat (点击展开)

from transformers import AutoTokenizer, AutoModel
import torch
import torchvision.transforms as T
from PIL import Image

from torchvision.transforms.functional import InterpolationMode


IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)


def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform


def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio


def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images


def load_image(image_file, input_size=448, max_num=6):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values


path = "OpenGVLab/InternVL-Chat-V1-5"
# If you have an 80G A100 GPU, you can put the entire model on a single GPU.
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).eval().cuda()
# Otherwise, you need to set device_map='auto' to use multiple GPUs for inference.
# model = AutoModel.from_pretrained(
#     path,
#     torch_dtype=torch.bfloat16,
#     low_cpu_mem_usage=True,
#     trust_remote_code=True,
#     device_map='auto').eval()

tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
# set the max number of tiles in `max_num`
pixel_values = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()

generation_config = dict(
    num_beams=1,
    max_new_tokens=512,
    do_sample=False,
)

# single-round single-image conversation
question = "请详细描述图片" # Please describe the picture in detail
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(question, response)

# multi-round single-image conversation
question = "请详细描述图片" # Please describe the picture in detail
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(question, response)

question = "请根据图片写一首诗" # Please write a poem according to the picture
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(question, response)

# multi-round multi-image conversation
pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

question = "详细描述这两张图片" # Describe the two pictures in detail
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(question, response)

question = "这两张图片的相同点和区别分别是什么" # What are the similarities and differences between these two pictures
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(question, response)

# batch inference (single image per sample)
pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
image_counts = [pixel_values1.size(0), pixel_values2.size(0)]
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

questions = ["Describe the image in detail."] * len(image_counts)
responses = model.batch_chat(tokenizer, pixel_values,
                             image_counts=image_counts,
                             questions=questions,
                             generation_config=generation_config)
for question, response in zip(questions, responses):
    print(question)
    print(response)

通过 LMDeploy 加速推理

如果需要优化InternVL-Chat模型的推理，我们推荐使用 LMDeploy。

在接下来的小节中，我们将以 InternVL-Chat-V1-5 模型为例介绍 LMDeploy 的使用

首先，请按照下面的步骤设置推理环境:

conda create -n internvl python=3.10 -y
conda activate internvl

pip install timm torchvision==0.17.2
pip install lmdeploy

LMDeploy 的 pypi 包默认依赖 CUDA 12.x。对于 CUDA 11.x 环境，请参考 installation guide.

离线推理过程

from lmdeploy import pipeline
from lmdeploy.vl import load_image
pipe = pipeline('OpenGVLab/InternVL-Chat-V1-5')
image = load_image('examples/image2.jpg')
response = pipe(('describe this image', image))
print(response)

有关使用VLM流程的更多信息，包括图像推理或多轮对话，请查看指南 guide 。

在线推理服务

LMDeploy支持将VLM模型一键打包成OpenAI服务，实现与OpenAI API的无缝集成。

该服务可以通过以下命令启动：

lmdeploy serve api_server OpenGVLab/InternVL-Chat-V1-5

api_server的参数可以通过命令lmdeploy serve api_server -h查看，例如，使用--tp设置张量并行度，使用--session-len指定上下文窗口的最大长度，使用--cache-max-entry-count调整用于k/v缓存的GPU内存比例等。

有关更多详细信息，包括使用Docker启动服务、RESTful API信息以及OpenAI集成方法，请查看指导 guide。

许可证

本项目遵循MIT license许可证发布。项目中的部分代码和模型来自其他来源，并受其各自许可证的约束。

引用

如果您在研究中发现本项目有用，请考虑引用：

@article{chen2023internvl,
  title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2312.14238},
  year={2023}
}

@article{chen2024far,
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={arXiv preprint arXiv:2404.16821},
  year={2024}
}

感谢

InternVL 的代码构建参考了以下项目: OpenAI CLIP、Open CLIP、CLIP Benchmark、EVA、InternImage、ViT-Adapter、MMSegmentation、Transformers、DINOv2、BLIP-2、Qwen-VL和 LLaVA-1.5。感谢他们的工作。

如何你想加入我们的项目群，请扫描下方二维码添加我们的小助手。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_zh.md

README_zh.md

InternVL家族：通过开源组件缩小与商业多模态模型的差距 —— GPT-4V的开源替代方案

最新消息🚀🚀🚀

文档

和 SOTA 多模态大模型对比

什么是 InternVL?

模型

InternVL 可以做什么?

使用 Huggingface 快速开始

通过 LMDeploy 加速推理

离线推理过程

在线推理服务

许可证

引用

感谢

method	#frame	K400	K600	K700
OpenCLIP-G	1	65.9	66.1	59.2
EVA-02-CLIP-E+	1	69.8	69.3	63.4
InternVL-C (ours)	1	71.0	71.3	65.7
ViCLIP	8	75.7	73.5	66.4
InternVL-C (ours)	8	79.4	78.8	71.5

Files

README_zh.md

Latest commit

History

README_zh.md

File metadata and controls

InternVL家族：通过开源组件缩小与商业多模态模型的差距 —— GPT-4V的开源替代方案

最新消息🚀🚀🚀

文档

和 SOTA 多模态大模型对比

什么是 InternVL?

模型

InternVL 可以做什么?

使用 Huggingface 快速开始

通过 LMDeploy 加速推理

离线推理过程

在线推理服务

许可证

引用

感谢