vault backup: 2023-12-02 - 1 files

Affected files: stub notes/IMAGE2TEXT.md
swyxio · Dec 2, 2023 · 17b6129 · 17b6129
1 parent 105d8b3
commit 17b6129
Showing 1 changed file with 11 additions and 1 deletion.
diff --git a/stub notes/IMAGE2TEXT.md b/stub notes/IMAGE2TEXT.md
@@ -33,4 +33,14 @@ pulsr io and more from this thread
 https://twitter.com/tunguz/status/1616190582606467089?s=46&t=eCig8-Pc5CuJQeXulVU7qQ
 
 
-Flamingo model https://arxiv.org/abs/2204.14198
+Flamingo model https://arxiv.org/abs/2204.14198
+
+
+## VQA
+
+LlaVA
+- Visual Instruction Tuning [Haotian Liu](https://arxiv.org/search/cs?searchtype=author&query=Liu,+H), [Chunyuan Li](https://arxiv.org/search/cs?searchtype=author&query=Li,+C), [Qingyang Wu](https://arxiv.org/search/cs?searchtype=author&query=Wu,+Q), [Yong Jae Lee](https://arxiv.org/search/cs?searchtype=author&query=Lee,+Y+J)
+
+> Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.
+
+- https://llava-vl.github.io/