Release of the Japanese Toxic Text Dataset “LLM-jp Toxicity Dataset v2”

March 19, 2025

We are pleased to announce the release of the LLM-jp Toxicity Dataset v2, a dataset designed to support research and development in toxic text detection technology. This dataset consists of Japanese texts collected from the Common Crawl corpus, manually labeled for toxicity. It has been created by adding 2,000 new samples to the previously released LLM-jp Toxicity Dataset, bringing the total to 3,847 labeled texts.

In addition to toxicity labels, each text is annotated with specific types of harmful content, such as obscenity, discrimination, violence, and illegal activities. The dataset is released under the CC-BY license, allowing for commercial use. We hope this resource will be useful for your research and applications.

For more details, please refer to the README in the repository above and the following paper:

LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs (LLM-jp, 2024)

This dataset has been developed through the collaboration of the RIKEN Center for Advanced Intelligence Project, Language Information Access Technology Team, and the LLM-jp community.