Announcement of LLM-jp-3 VILA 14B

November 20, 2024

Large Language Model Research and Development Center (LLMC) is developing open foundational models with strong Japanese language capabilities. We are pleased to announce the release of the multimodal foundational model LLM-jp-3 VILA 14B, an extension of the LLM-jp-3 13B model developed at LLMC to accept image inputs.

This model is composed of the large language model llm-jp/llm-jp-3-13b-instruct, the image encoder google/siglip-so400m-patch14-384, and a two-layer linear projection that maps image feature vectors into the language space. With the addition of parameters for the image encoder and projection layers, the model has approximately 14 billion parameters. The architecture was inspired by the VILA framework (Lin et al., 2024).

To train this multimodal model, LLMC developed new datasets, including Japanese text-image pairs and interleaved data (where images are embedded in appropriate positions within text) obtained from Common Crawl archives. Additionally, synthetic data generated using OpenAI GPT-4o was employed for fine-tuning the model for instruction-following tasks.

LLMC evaluated the model using Japanese-specific vision and language benchmarks such as HeronBench, JA-VLM-Bench-In-the-Wild, and JA-VG-VQA-500. Experimental results demonstrated that, despite its smaller parameter count, LLM-jp-3 VILA 14B performs comparably to OpenAI GPT-4o. These evaluations were conducted using LLMC’s internally developed evaluation framework, llm-jp-eval-mm.

The model weights for LLM-jp-3 VILA 14B are distributed under the Apache License 2.0. However, users must also comply with OpenAI’s Terms of Use when utilizing the model. As long as both sets of terms are observed, the model is available for free use in applications and further training.

For more details about the model, please refer to the following links and the accompanying paper (Sasagawa et al., 2024):

llm-jp/llm-jp-3-vila-14b

Details about the newly developed datasets for training this model can also be found at the link provided above.

References

Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, Song Han. 2024. VILA: On Pre-training for Visual Language Models. CVPR.
Keito Sasagawa, Koki Maeda, Issa Sugiura, Shuhei Kurita, Naoaki Okazaki, Daisuke Kawahara. 2024. Constructing Multimodal Datasets from Scratch for Rapid Development of a Japanese Visual Language Model. arXiv:2410.22736.