[📖 英文版本] [🆕 博客] [📜 InternVL 1.0 论文] [📜 InternVL 1.5 技术报告] [🗨️ Chat Demo] [🤗 HuggingFace Demo]
[🚀 快速开始] [🌐 Community-hosted API] [📖 中文解读]
2024/06/04
: InternVL 1.5 在 Video-MME 数据集的 Image MLLM 类别中实现了SOTA的性能,展示了在多图场景下的泛化能力,超过了许多专门的 Video MLLM,并接近开源SOTA视频模型 LLaVA-Next-Video。2024/05/29
: 🚀 我们开源了 Mini-InternVL-Chat 系列,目前包括以下两个模型:Mini-InternVL-Chat-2B-V1-5 和 Mini-InternVL-Chat-4B-V1-5。我们的小模型在极小的尺寸下实现了令人印象深刻的性能:2B模型仅以8%的模型尺寸实现了80%的性能,4B模型以16%的模型尺寸实现了90%的性能。更多细节请查看我们的博客。2024/05/28
: 感谢 lmdeploy 团队提供的AWQ量化支持。4-bit模型发布在 OpenGVLab/InternVL-Chat-V1-5-AWQ。2024/05/13
: 🔥 InternVL 现在可以作为扩散模型的 文本编码器,支持全球超过110种语言的多语言生成。详情请看 MuLan。2024/04/28
: 我们发布了 InternVL-Chat-V1-5 的 INT8 量化版本,详细请看 HF link。2024/04/28
: 我们在 Infographics VQA 的基准测试中达到了 SOTA 性能(75.74),详情请看 here。2024/04/18
: InternVL-Chat-V1-5 已经在 HF link 发布,在MMMU、DocVQA、ChartQA、MathVista等各种基准测试中,性能接近GPT-4V和Gemini Pro。2024/02/27
: InternVL 被 CVPR 2024 接收!🎉2024/02/24
: InternVL-Chat 模型已经接入 VLMEvalKit。2024/02/21
: InternVL-Chat-V1-2-Plus 在 MathVista(59.9)、MMBench(83.8)和MMVP(58.7)上达到了SOTA性能。详情请参见我们的 blog。2024/02/12
: InternVL-Chat-V1-2 已经发布。它在MMMU验证集上达到了51.6的分数,在MMBench测试集上达到了82.3的分数。 更多信息请参考 blog、SFT data 或者尝试我们的 demo。该模型已经在 HuggingFace 发布,训练、测评的数据和脚本均已开源。2024/02/04
: InternVL-Chat-V1-1 在 MMVP 上达到了 44.67 的得分,高于GPT-4V!2024/01/27
: 我们发布了448分辨率的模型,在MMBench的验证集上达到了76.6的分数,详情请看 here。2024/01/24
: InternVL-Chat-V1-1 已经发布,它支持中文,并且有强大的OCR能力,详情请看 here 或者尝试我们的 demo。2024/01/16
: 我们发布了 定制化的 mmcv/mmsegmentation/mmdetection code,集成了DeepSpeed,可以用于训练目标检测和语义分割大模型。
-
安装
- 如何搭建运行环境? [link]
-
训练或者微调
-
Benchmark 测评
由于此代码库与 VLMEvalKit 之间存在细微的实现差异,在测试同一模型时,性能指标可能会出现轻微差异。
-
部署
![image](https://private-user-images.githubusercontent.com/23737120/326092358-e9065a58-86fa-47ef-be9a-eb734532e73f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyNDE4MzIsIm5iZiI6MTczOTI0MTUzMiwicGF0aCI6Ii8yMzczNzEyMC8zMjYwOTIzNTgtZTkwNjVhNTgtODZmYS00N2VmLWJlOWEtZWI3MzQ1MzJlNzNmLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDAyMzg1MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWUxYzgzMTlmYzI1NTJjZDc0Njc2ODg5NDNkN2RhNmI3ZDEyMjBmZTY5ZmYwODJhM2RiZjQ2MzZjMzQ5N2I0ZTMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.OhNiLmiHGVDCfI1aaMgSW0amEeOSX_LU3c0E6YOZot0)
![image](https://private-user-images.githubusercontent.com/23737120/326576629-2b4f2978-36ea-4065-841d-3651c58955ed.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyNDE4MzIsIm5iZiI6MTczOTI0MTUzMiwicGF0aCI6Ii8yMzczNzEyMC8zMjY1NzY2MjktMmI0ZjI5NzgtMzZlYS00MDY1LTg0MWQtMzY1MWM1ODk1NWVkLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDAyMzg1MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTNiZDdlMTlkNTg0NzI2MjdiNjNhNjA4NmI5ZjEzYjJlOWY3Y2FmOTM4NWViNjNlMzA4NTZlNGQ3ODUwMWFjNGUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.IqN5pIxHeYihC4Hcvh5PhxRjXsmocywyhUIimals55I)
InternVL 将 ViT 拓展到 6B 参数 并与大语言模型对齐。
多模态大语言模型
Model | Date | Download | Note |
---|---|---|---|
Mini‑InternVL‑Chat‑4B‑V1‑5 | 2024.05.28 | 🤗 HF link | 🚀🚀 16% 的模型大小,90% 的模型性能 |
Mini-InternVL-Chat-2B-V1-5 | 2024.05.19 | 🤗 HF link | 🚀🚀 8% 的模型大小,80% 的模型性能 |
InternVL-Chat-V1-5-AWQ | 2024.05.28 | 🤗 HF link | InternVL-Chat-V1-5的 INT4 版本 |
InternVL-Chat-V1-5-Int8 | 2024.04.28 | 🤗 HF link | InternVL-Chat-V1-5的 INT8 版本 |
InternVL-Chat-V1-5 | 2024.04.18 | 🤗 HF link | 支持4K图像;超强OCR性能;在MMMU、DocVQA、ChartQA、MathVista等各种基准测试中,其性能接近GPT-4V和Gemini Pro (🔥新) |
InternVL-Chat-V1-2-Plus | 2024.02.21 | 🤗 HF link | 更多的SFT数据并且更强大 |
InternVL-Chat-V1-2 | 2024.02.11 | 🤗 HF link | 拓展 LLM 到 34B |
InternVL-Chat-V1-1 | 2024.01.24 | 🤗 HF link | 支持中文并且有强大的OCR能力 |
InternVL-Chat-19B-448px | 2024.02.03 | 🤗 HF link | 448 分辨率 |
InternVL-Chat-19B | 2023.12.25 | 🤗 HF link | 英语多模态对话大模型 |
InternVL-Chat-13B | 2023.12.25 | 🤗 HF link | 英语多模态对话大模型 |
视觉语言基础模型
Model | Date | Download | Note |
---|---|---|---|
InternViT-300M-448px | 2024.05.25 | 🤗 HF link | 蒸馏的300M小型视觉基础模型 (🔥新) |
InternViT-6B-448px-V1-5 | 2024.04.20 | 🤗 HF link | 支持动态分辨率,十分强大的OCR能力 (🔥新) |
InternViT-6B-448px-V1-2 | 2024.02.11 | 🤗 HF link | 448 分辨率 |
InternViT‑6B‑448px‑V1‑0 | 2024.01.30 | 🤗 HF link | 448 分辨率 |
InternViT-6B-224px | 2023.12.22 | 🤗 HF link | 视觉基础模型 |
InternVL-14B-224px | 2023.12.22 | 🤗 HF link | 视觉语言基础模型,InternViT-6B + QLLaMA,可以用于做图文对的检索 |
视觉感知 (点击展开)
-
Linear-Probe 图像分类 [see details]
ViT-22B uses the private JFT-3B dataset.
method #param IN-1K IN-ReaL IN-V2 IN-A IN-R IN-Sketch OpenCLIP-G 1.8B 86.2 89.4 77.2 63.8 87.8 66.4 DINOv2-g 1.1B 86.5 89.6 78.4 75.9 78.8 62.5 EVA-01-CLIP-g 1.1B 86.5 89.3 77.4 70.5 87.7 63.1 MAWS-ViT-6.5B 6.5B 87.8 - - - - - ViT-22B* 21.7B 89.5 90.9 83.2 83.8 87.4 - InternViT-6B (ours) 5.9B 88.2 90.4 79.9 77.5 89.8 69.1 -
语义分割 [see details]
method decoder #param (train/total) crop size mIoU OpenCLIP-G (frozen) Linear 0.3M / 1.8B 512 39.3 ViT-22B (frozen) Linear 0.9M / 21.7B 504 34.6 InternViT-6B (frozen) Linear 0.5M / 5.9B 504 47.2 (+12.6) ViT-22B (frozen) UperNet 0.8B / 22.5B 504 52.7 InternViT-6B (frozen) UperNet 0.4B / 6.3B 504 54.9 (+2.2) ViT-22B UperNet 22.5B / 22.5B 504 55.3 InternViT-6B UperNet 6.3B / 6.3B 504 58.9 (+3.6) -
零样本图像分类 [see details]
method IN-1K IN-A IN-R IN-V2 IN-Sketch ObjectNet OpenCLIP-G 80.1 69.3 92.1 73.6 68.9 73.0 EVA-02-CLIP-E+ 82.0 82.1 94.5 75.7 71.6 79.6 ViT-22B* 85.9 90.1 96.0 80.9 - 87.6 InternVL-C (ours) 83.2 83.8 95.5 77.3 73.9 80.6 -
多语言零样本图像分类 [see details]
EN: English, ZH: Chinese, JP: Japanese, Ar: Arabic, IT: Italian
method IN-1K (EN) IN-1K (ZH) IN-1K (JP) IN-1K (AR) IN-1K (IT) Taiyi-CLIP-ViT-H - 54.4 - - - WuKong-ViT-L-G - 57.5 - - - CN-CLIP-ViT-H - 59.6 - - - AltCLIP-ViT-L 74.5 59.6 - - - EVA-02-CLIP-E+ 82.0 - - - 41.2 OpenCLIP-XLM-R-H 77.0 55.7 53.1 37.0 56.8 InternVL-C (ours) 83.2 64.5 61.5 44.9 65.7 -
零样本视频分类 [see details]
method #frame K400 K600 K700 OpenCLIP-G 1 65.9 66.1 59.2 EVA-02-CLIP-E+ 1 69.8 69.3 63.4 InternVL-C (ours) 1 71.0 71.3 65.7 ViCLIP 8 75.7 73.5 66.4 InternVL-C (ours) 8 79.4 78.8 71.5
跨模态检索 (点击展开)
-
英语零样本图文检索 [see details]
model Flickr30K COCO avg image-to-text text-to-image image-to-text text-to-image R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 OpenCLIP-G 92.9 99.3 99.8 79.5 95.0 97.1 67.3 86.9 92.6 51.4 74.9 83.0 85.0 EVA-02-CLIP-E+ 93.9 99.4 99.8 78.8 94.2 96.8 68.8 87.8 92.8 51.1 75.0 82.7 85.1 EVA-CLIP-8B 95.6 99.6 99.9 80.8 95.5 97.6 70.3 89.3 93.9 53.0 76.0 83.4 86.2 InternVL-C (ours) 94.7 99.6 99.9 81.7 96.0 98.2 70.6 89.0 93.5 54.1 77.3 84.6 86.6 InternVL-G (ours) 95.7 99.7 99.9 85.0 97.0 98.6 74.9 91.3 95.2 58.6 81.3 88.0 88.8 -
中文零样本图文对检索 [see details]
model Flickr30K-CN COCO-CN avg image-to-text text-to-image image-to-text text-to-image R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 CN-CLIP-ViT-H 81.6 97.5 98.8 71.2 91.4 95.5 63.0 86.6 92.9 69.2 89.9 96.1 86.1 OpenCLIP-XLM-R-H 86.1 97.5 99.2 71.0 90.5 94.9 70.0 91.5 97.0 66.1 90.8 96.0 87.6 InternVL-C (ours) 90.3 98.8 99.7 75.1 92.9 96.4 68.8 92.0 96.7 68.9 91.9 96.5 89.0 InternVL-G (ours) 92.9 99.4 99.8 77.7 94.8 97.3 71.4 93.9 97.7 73.8 94.4 98.1 90.9 -
多语言零样本图文对检索 [see details]
method EN ES FR ZH IT KO RU JP average AltCLIP 95.4 94.1 92.9 95.1 94.2 94.4 91.8 91.7 93.7 OpenCLIP-XLM-R-H 97.3 96.1 94.5 94.7 96.0 90.2 93.9 94.0 94.6 InternVL-C (ours) 97.3 95.7 95.1 95.6 96.0 92.2 93.3 95.5 95.1 InternVL-G (ours) 98.6 97.7 96.5 96.7 96.9 95.1 94.8 96.1 96.6
多模态对话 (请看 "和SOTA的多模态大模型对比")
使用 InternViT-6B (点击展开)
import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
model = AutoModel.from_pretrained(
'OpenGVLab/InternViT-6B-224px',
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True).cuda().eval()
image = Image.open('./examples/image1.jpg').convert('RGB')
image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-224px')
pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()
outputs = model(pixel_values)
使用 InternVL-C(ontrastive) 和 InternVL-G(enerative) (点击展开)
import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
from transformers import AutoTokenizer
model = AutoModel.from_pretrained(
'OpenGVLab/InternVL-14B-224px',
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True).cuda().eval()
image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternVL-14B-224px')
tokenizer = AutoTokenizer.from_pretrained(
'OpenGVLab/InternVL-14B-224px', use_fast=False, add_eos_token=True)
tokenizer.pad_token_id = 0 # set pad_token_id to 0
images = [
Image.open('./examples/image1.jpg').convert('RGB'),
Image.open('./examples/image2.jpg').convert('RGB'),
Image.open('./examples/image3.jpg').convert('RGB')
]
prefix = 'summarize:'
texts = [
prefix + 'a photo of a red panda', # English
prefix + '一张熊猫的照片', # Chinese
prefix + '二匹の猫の写真' # Japanese
]
pixel_values = image_processor(images=images, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()
input_ids = tokenizer(texts, return_tensors='pt', max_length=80,
truncation=True, padding='max_length').input_ids.cuda()
# InternVL-C
logits_per_image, logits_per_text = model(
image=pixel_values, text=input_ids, mode='InternVL-C')
probs = logits_per_image.softmax(dim=-1)
# tensor([[9.9609e-01, 5.2185e-03, 6.0070e-08],
# [2.2949e-02, 9.7656e-01, 5.9903e-06],
# [3.2932e-06, 7.4863e-05, 1.0000e+00]], device='cuda:0',
# dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)
# InternVL-G
logits_per_image, logits_per_text = model(
image=pixel_values, text=input_ids, mode='InternVL-G')
probs = logits_per_image.softmax(dim=-1)
# tensor([[9.9609e-01, 3.1738e-03, 3.6322e-08],
# [8.6060e-03, 9.9219e-01, 2.8759e-06],
# [1.7583e-06, 3.1233e-05, 1.0000e+00]], device='cuda:0',
# dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)
# please set add_eos_token to False for generation
tokenizer.add_eos_token = False
image = Image.open('./examples/image1.jpg').convert('RGB')
pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()
tokenized = tokenizer("English caption:", return_tensors='pt')
pred = model.generate(
pixel_values=pixel_values,
input_ids=tokenized.input_ids.cuda(),
attention_mask=tokenized.attention_mask.cuda(),
num_beams=5,
min_new_tokens=8,
)
caption = tokenizer.decode(pred[0].cpu(), skip_special_tokens=True).strip()
# English caption: a red panda sitting on top of a wooden platform
使用 InternVL-Chat (点击展开)
from transformers import AutoTokenizer, AutoModel
import torch
import torchvision.transforms as T
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
def build_transform(input_size):
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
transform = T.Compose([
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=MEAN, std=STD)
])
return transform
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
best_ratio_diff = float('inf')
best_ratio = (1, 1)
area = width * height
for ratio in target_ratios:
target_aspect_ratio = ratio[0] / ratio[1]
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
if ratio_diff < best_ratio_diff:
best_ratio_diff = ratio_diff
best_ratio = ratio
elif ratio_diff == best_ratio_diff:
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
best_ratio = ratio
return best_ratio
def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False):
orig_width, orig_height = image.size
aspect_ratio = orig_width / orig_height
# calculate the existing image aspect ratio
target_ratios = set(
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
i * j <= max_num and i * j >= min_num)
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
# find the closest aspect ratio to the target
target_aspect_ratio = find_closest_aspect_ratio(
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
# calculate the target width and height
target_width = image_size * target_aspect_ratio[0]
target_height = image_size * target_aspect_ratio[1]
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
# resize the image
resized_img = image.resize((target_width, target_height))
processed_images = []
for i in range(blocks):
box = (
(i % (target_width // image_size)) * image_size,
(i // (target_width // image_size)) * image_size,
((i % (target_width // image_size)) + 1) * image_size,
((i // (target_width // image_size)) + 1) * image_size
)
# split the image
split_img = resized_img.crop(box)
processed_images.append(split_img)
assert len(processed_images) == blocks
if use_thumbnail and len(processed_images) != 1:
thumbnail_img = image.resize((image_size, image_size))
processed_images.append(thumbnail_img)
return processed_images
def load_image(image_file, input_size=448, max_num=6):
image = Image.open(image_file).convert('RGB')
transform = build_transform(input_size=input_size)
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
pixel_values = [transform(image) for image in images]
pixel_values = torch.stack(pixel_values)
return pixel_values
path = "OpenGVLab/InternVL-Chat-V1-5"
# If you have an 80G A100 GPU, you can put the entire model on a single GPU.
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True).eval().cuda()
# Otherwise, you need to set device_map='auto' to use multiple GPUs for inference.
# model = AutoModel.from_pretrained(
# path,
# torch_dtype=torch.bfloat16,
# low_cpu_mem_usage=True,
# trust_remote_code=True,
# device_map='auto').eval()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
# set the max number of tiles in `max_num`
pixel_values = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
generation_config = dict(
num_beams=1,
max_new_tokens=512,
do_sample=False,
)
# single-round single-image conversation
question = "请详细描述图片" # Please describe the picture in detail
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(question, response)
# multi-round single-image conversation
question = "请详细描述图片" # Please describe the picture in detail
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(question, response)
question = "请根据图片写一首诗" # Please write a poem according to the picture
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(question, response)
# multi-round multi-image conversation
pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
question = "详细描述这两张图片" # Describe the two pictures in detail
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(question, response)
question = "这两张图片的相同点和区别分别是什么" # What are the similarities and differences between these two pictures
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(question, response)
# batch inference (single image per sample)
pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
image_counts = [pixel_values1.size(0), pixel_values2.size(0)]
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
questions = ["Describe the image in detail."] * len(image_counts)
responses = model.batch_chat(tokenizer, pixel_values,
image_counts=image_counts,
questions=questions,
generation_config=generation_config)
for question, response in zip(questions, responses):
print(question)
print(response)
如果需要优化InternVL-Chat模型的推理,我们推荐使用 LMDeploy。
在接下来的小节中,我们将以 InternVL-Chat-V1-5 模型为例介绍 LMDeploy 的使用
首先,请按照下面的步骤设置推理环境:
conda create -n internvl python=3.10 -y
conda activate internvl
pip install timm torchvision==0.17.2
pip install lmdeploy
LMDeploy 的 pypi 包默认依赖 CUDA 12.x。对于 CUDA 11.x 环境,请参考 installation guide.
from lmdeploy import pipeline
from lmdeploy.vl import load_image
pipe = pipeline('OpenGVLab/InternVL-Chat-V1-5')
image = load_image('examples/image2.jpg')
response = pipe(('describe this image', image))
print(response)
有关使用VLM流程的更多信息,包括图像推理或多轮对话,请查看指南 guide 。
LMDeploy支持将VLM模型一键打包成OpenAI服务,实现与OpenAI API的无缝集成。
该服务可以通过以下命令启动:
lmdeploy serve api_server OpenGVLab/InternVL-Chat-V1-5
api_server
的参数可以通过命令lmdeploy serve api_server -h
查看,例如,使用--tp
设置张量并行度,使用--session-len
指定上下文窗口的最大长度,使用--cache-max-entry-count
调整用于k/v缓存的GPU内存比例等。
有关更多详细信息,包括使用Docker启动服务、RESTful API信息以及OpenAI集成方法,请查看指导 guide。
本项目遵循MIT license许可证发布。项目中的部分代码和模型来自其他来源,并受其各自许可证的约束。
如果您在研究中发现本项目有用,请考虑引用:
@article{chen2023internvl,
title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
journal={arXiv preprint arXiv:2312.14238},
year={2023}
}
@article{chen2024far,
title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
journal={arXiv preprint arXiv:2404.16821},
year={2024}
}
InternVL 的代码构建参考了以下项目: OpenAI CLIP、Open CLIP、CLIP Benchmark、EVA、InternImage、ViT-Adapter、MMSegmentation、Transformers、DINOv2、BLIP-2、Qwen-VL和 LLaVA-1.5。感谢他们的工作。
如何你想加入我们的项目群,请扫描下方二维码添加我们的小助手。