LLM-jp LLM-jp

Blog

Mid-Training LLM-jp on OLMo2 Data: Setup, Results, and Practical Tips

Date:  2025.9.12
Author: Koshi EGUCHI, Sosuke Hosokara, Kouta Nakayama


0. Introduction

LLM-jp aims to develop large-scale language models that are open and particularly strong in Japanese. Following the release of the LLM-jp-3 series, we are now working toward publishing a new series of models.

This article examines the mid-training1 dataset released by the Allen Institute for AI (Ai2) as part of the OLMo2 series. OLMo2 demonstrated that applying mid-training can substantially improve the performance of pre-trained models. Importantly, Ai2 has not only released model weights but also datasets, training configurations, and evaluation logs, making the entire process transparent.

We evaluated how much performance can be improved when applying Ai2’s OLMo2 mid-training datasets—adjusted in learning rate and size—to our pre-trained baseline checkpoints. The results were striking: applying the OLMo2 mid-training datasets to LLM-jp baseline models of 1.3B and 7.7B parameters increased accuracy on GSM8K from 1.17% to 30.17% for the 1.3B model, and from 14.86% to 59.06% for the 7.7B model.

1. Verification of Mid-Training for OLMo2 Datasets

1.1. Overview of OLMo2

As mentioned above, OLMo2 is the second-generation large-scale language model family developed by Ai2 in an open and transparent process. Ai2 published not only model weights but also training datasets (OLMo Mix 1124 and Dolmino Mix 1124), training code, and evaluation logs. For further details, see the official OLMo2 paper.

1.2. OLMo2 Mid-Training Datasets

Table 1 presents the composition of OLMo2’s high-quality mid-training datasets, which include mathematics, academic papers, and coding data. OLMo2 refers to this dataset collection as Dolmino Mix 1124.

Table 1. Composition of the OLMo2 Mid-Training Dataset (Dolmino Mix 1124)2

CategoryData SourceData TypeTokensWordsBytes (B)Documents
Mid-Training Dolmino High Quality SubsetDCLM-Baseline FastText top 7% Fine Web >=2High quality web752B670B4.56T606M
FLAN from Balma 17 decontaminatedInstruction data17.0B14.4B98.2B57.3M
peS2o from Dolma 17Academic papers58.6B51.1B413B38.8M
Wikipedia & Wikibooks from Dolma 1.7Encyclopedic3.7B3.16B16.2B6.17M
Stack Exchange 09/30/2024 dump curated Q&A dataQ&A1.26B1.14B7.72B2.48M
High quality total832.6B739.8B5.09T710.8M
Mid-training Dolmino Math MixTuluMathSynthetic math230M222M1.03B220K
Dolmino SynthMathSynthetic math28.7M35.1M163M725K
TinyGSM-MINDSynthetic math6.48B5.68B25.52B17M
MathCoder2 Synthetic Ajibawa-2023 M-A-P MatrixSynthetic Math3.87B3.71B18.4B2.83M
Metamath OWM-filteredMath84.2M76.6M741M383K
CodeSearchNet OWM-filteredCode1.78M1.41M29.8M7.27K
GSM8K Train splitMath2.74M3.00M25.3M17.6K
Math total10.7B9.73B45.9B21.37M

Ai2 created three different mid-training datasets by sampling Dolmino Mix 1124 into sizes of 50B, 100B, and 300B tokens, as summarized in Table 2.

Table 2. Sampling Strategy of Dolmino Mix 1124. Source (%) indicates the sampling ratio from each source relative to the original Dolmino Mix 1124. Mix (%) indicates the proportion each source contributes to the constructed mid-training dataset. The sum of Mix (%) values in each column equals 100%.

Data SourceTokens50B100B300B
Source (%)Mix (%)Source (%)Mix (%)Source (%)Mix (%)
Filtered DCLM752B3.2347.26.8550.220.7851.9
Decontam. FLAN17.0B50.016.610016.720011.3
StakExchange Q&A1.26B1002.452002.474001.68
peS2o58.6B5.155.8516.79.5210019.4
Wikipedia/Wikibooks3.7B1007.111003.574004.86
Dolmino Math10.7B10020.820017.540010.8

The OLMo2 models were trained as follows:

  • OLMo2 7B: Mid-training was conducted three times with different data orders on the 50B dataset. Model Soup—a simple averaging of multiple trained models’ weights—was then applied to produce the final model.
  • OLMo2 13B: Mid-training was conducted three times with the 100B dataset and once with the 300B dataset. Model Soup was applied across the four checkpoints to obtain the final model.

The OLMo2 paper confirmed that Model Soup consistently improved performance compared to individual runs.

We reproduced this methodology: we applied differently sized OLMo2 mid-training datasets to LLM-jp baseline models of 1.3B and 7.7B parameters and also tested the effect of Model Soup.

2. Evaluation

2.1. Experimental Setup

2.1.1. Mid-Training Datasets

For this study, we tokenized OLMo2’s mid-training data using the same tokenizer as the baseline models, creating LLM-jp Dolmino Mix 1124. We then sampled the data following the same Source(%) ratios as Table 2, yielding three datasets (50B, 100B, 300B tokens). Due to tokenizer differences, the Mix(%) ratios differed slightly from those in OLMo2. Details are shown in Table 3.

Table 3. Composition of the Mid-Training Datasets Used in This Study (LLM-jp Dolmino Mix). We adjusted Source (%) to align with the original Dolmino Mix. Because of tokenizer differences, Mix (%) values differ slightly from Table 2. The dataset sizes are approximately 55.8B, 114.9B, and 337.7B tokens, but for simplicity we refer to them as 50B, 100B, and 300B tokens.

Data SourceTokens50B100B300B
Source (%)Mix (%)Source (%)Mix (%)Source (%)Mix (%)
Filtered DCLM821B3.2347.576.8548.9820.7850.57
Decontam. FLAN18.5B50.016.5610016.0820010.95
StakExchange Q&A1.46B1002.632002.554001.74
peS2o62.9B5.155.8016.79.1310018.61
Wikipedia/Wikibooks3.9B1006.981003.394004.62
Dolmino Math11.4B10020.4620019.8740013.52

2.1.2. Other Training Settings

The baseline models of 1.3B and 7.7B parameters were pre-trained on 19.5T tokens across English, Chinese, Japanese, and Korean.

We tested two learning rate schedules during mid-training:

  1. Linear decay from the pre-training end learning rate down to zero.
  2. Keeping the pre-training end learning rate fixed throughout mid-training.

All other hyperparameters matched those used during pre-training.

2.1.3. Evaluation Benchmarks

We followed the evaluation methodology of Swallow and tested on twelve benchmarks:

  • gsm8k
  • squad2
  • triviaqa
  • hellaswag
  • openbookqa
  • xwinograd_en
  • bbh_cot
  • mmlu
  • mmlu_social_sciences
  • mmlu_humanities
  • mmlu_stem
  • mmlu_other

2.2. Results

We first compared the effects of learning rate schedule and dataset size, then examined the impact of Model Soup.

2.2.1. Comparison of Learning Rate Schedules and Dataset Sizes

We tested six conditions: (linear decay vs fixed learning rate) × (50B, 100B, 300B tokens) for both 1.3B and 7.7B models. Figures 1 and 2 summarize the results.

Figure 1. Evaluation results of the 1.3B model after mid-training with different dataset sizes and learning rate strategies.

Figure 2. Evaluation results of the 7.7B model after mid-training with different dataset sizes and learning rate strategies.

Figures 1 and 2 present the results for the 1.3B and 7.7B models after mid-training.
The 1.3B model with the best average benchmark rank outperformed the pre-trained baseline model on 10 out of 12 benchmarks.
The 7.7B model with the best average rank outperformed the baseline on all 12 benchmarks.
Performance gains were particularly striking on GSM8K: the 1.3B model improved from 1.17% to 27.90%, while the 7.7B model improved from 14.86% to 51.25%.

◾️Analysis of the 1.3B Model

Comparison of dataset sizes.
Both with fixed and linearly decayed learning rates, increasing dataset size generally improved performance. On GSM8K, scores dropped when moving from 50B to 100B tokens under both schedules, but performance increased again at 300B tokens, surpassing the 50B results.

Comparison of learning rate schedules.
For both strategies, the average benchmark rank was highest at 300B tokens. Comparing fixed vs. linear decay at 300B tokens, the fixed schedule outperformed linear decay in 10 out of 12 benchmarks. Therefore, for the 1.3B model, the best configuration was 300B tokens with fixed learning rate.

◾️Analysis of the 7.7B Model

Comparison of dataset sizes.
For benchmarks other than GSM8K, enlarging the dataset produced only minor differences (within 2%). However, GSM8K performance deteriorated significantly as dataset size increased. Specifically, under linear decay, GSM8K dropped from 46.85% (50B) to 34.04% (300B). Under fixed LR, it dropped from 51.25% (50B) to 17.74% (300B). As a result, the simple average score across benchmarks was highest at 50B tokens, for both learning rate strategies.

Comparison of learning rate schedules.
In every dataset size, fixed learning rate achieved a better average rank than linear decay. This effect was most pronounced for GSM8K: at 50B tokens, fixed LR outperformed linear decay on 10 of 12 benchmarks, achieving the best overall rank.
Therefore, for the 7.7B model, the best configuration was 50B tokens with fixed learning rate.

2.2.2. Evaluation of Model Soup

Next, we evaluated the effectiveness of Model Soup under the best conditions identified in Section 2.2.1 (1.3B with 300B tokens, fixed LR; 7.7B with 50B tokens, fixed LR). Figures 3 and 4 show the results.

Figure 3. Model Soup results for the 1.3B model trained with a fixed learning rate on the 300B-token dataset.

Figure 4. Model Soup results for the 7.7B model trained with a fixed learning rate on the 50B-token dataset.

  • 1.3B model: The Model Soup improved upon the best single-seed model in 7 of 12 benchmarks. On GSM8K, the best seed reached 27.90%, while Model Soup raised it to 30.17% (compared to 1.17% at pre-training).
  • 7.7B model: Model Soup outperformed the best seed in 10 of 12 benchmarks. On GSM8K, the best seed achieved 53.53%, while Model Soup improved it to 59.06% (compared to 14.86% at pre-training).

For both the 1.3B and 7.7B models, Model Soup did not surpass the seed best on every benchmark, but it improved the average rank and produced especially notable gains on GSM8K. The benchmarks that showed declines dropped by less than 1%, confirming that Model Soup yielded overall performance improvements.

(Supplement) Training Time

For reference, we report the training time required for mid-training. The experiments were conducted under the following environment:

  • Platform: ABCI3.0
  • Framework: llm-jp/Megatron-LM
  • Nodes: 16 (each with 8 NVIDIA H200 GPUs)

Table 4 lists the training times. During the experiments, GPU power was capped at 500W3. With 700W, training time would have been shorter.

Table 4. Training Time for Mid-Training Experiments

Model SizeTraining Corpus Size (token)Training Time (hours)
1.3B 50B2.9
100B6.1
300B18.8
7.7B50B17.6
100B36.1
300B107.3

Conclusion

This study examined the application of Ai2’s OLMo2 mid-training dataset, Dolmino Mix 1124, to pre-trained baseline checkpoints. We confirmed consistent performance improvements, with particularly notable gains on GSM8K.

Due to limited resources and time, we could not exhaustively test all configurations. Nevertheless, we believe further performance gains are possible by adjusting learning rate schedules, dataset ratios, and incorporating additional datasets.

LLM-jp will continue to develop large-scale language models that are open-source and strong in Japanese. We invite interested collaborators to join our efforts through the LLM-jp project page.

Citation

@article{llmjp2025midtraining,
  author = {Koshi Eguchi, Sosuke Hosokawa, Kouta Nakayama},
  title = {An Examination of OLMo2-Based Mid-Training for LLM-jp Models},
  year = {2025},
  url = {https://llm-jp.nii.ac.jp/en/blog/mid-training-llm-jp-on-olmo2-data-setup-results-and-practical-tips/}
}



  1. This training step lies between pre-training and post-training, utilizing high-quality data. 
  2. This corresponds to Table 2 in the OLMo2 paper. In that table, the GSM8K train split is reported as 17.6K documents, but this figure includes the train split and the test split. Please note that the actual train split used for training consists of 14.9K documents. 
  3. As of July 2025, it is limited to 400W.