Sudachi logo

日本語

WAP Tokushima NLP Resources

Natural language processing software and language resources provided by WAP Tokushima Laboratory of AI and NLP.

Software

Language Resources

Community

We have a Slack workspace for developers and users to ask questions and discuss a variety of topics.


Open Data on AWS

SudachiDict and chiVe, chiTra data are generously hosted by AWS with their Oepn Data Sponsorship Program.

SudachiDict

Japanese dictionaries for morphological analysis. Please refer to SudachiDict for the detail.

Click here for pre-built dictionaries.

chiVe

Japanese pretrained word embedding. Please refer to chiVe for the detail.

Version Normalized Min Count Vocab Sudachi SudachiDict Text gensim Magnitude
v1.3 mc5 o 5 2,530,791 v0.6.8 20240109-core 3.6GB (tar.gz) 2.9GB (tar.gz) -
v1.3 mc15 o 15 1,186,019 v0.6.8 20240109-core 1.7GB (tar.gz) 1.3GB (tar.gz) -
v1.3 mc30 o 30 759,011 v0.6.8 20240109-core 1.1GB (tar.gz) 0.8GB (tar.gz) -
v1.3 mc90 o 90 410,533 v0.6.8 20240109-core 0.6GB (tar.gz) 0.5GB (tar.gz) -
                 
v1.2 mc5 o 5 3,197,456 v0.4.3 20200722-core 9.2GB (tar.gz) 3.8GB (tar.gz) 5.5GB (.magnitude)
v1.2 mc15 o 15 1,454,280 v0.4.3 20200722-core 5.0GB (tar.gz) 1.7GB (tar.gz) 2.4GB (.magnitude)
v1.2 mc30 o 30 912,550 v0.4.3 20200722-core 3.1GB (tar.gz) 1.1GB (tar.gz) 1.5GB (.magnitude)
v1.2 mc90 o 90 482,223 v0.4.3 20200722-core 1.7GB (tar.gz) 0.6GB (tar.gz) 0.8GB (.magnitude)
                 
v1.1 mc5 o 5 3,196,481 v0.3.0 20191030-core 11GB (tar.gz) 3.6GB (tar.gz) 5.5GB (.magnitude)
v1.1 mc15 o 15 1,452,205 v0.3.0 20191030-core 4.7GB (tar.gz) 1.7GB (tar.gz) 2.4GB (.magnitude)
v1.1 mc30 o 30 910,424 v0.3.0 20191030-core 3.0GB (tar.gz) 1.1GB (tar.gz) 1.5GB (.magnitude)
v1.1 mc90 o 90 480,443 v0.3.0 20191030-core 1.6GB (tar.gz) 0.6GB (tar.gz) 0.8GB (.magnitude)
v1.0 mc5 x 5 3,644,628 v0.1.1 0.1.1-dictionary-full 12GB (tar.gz) 4.1GB (tar.gz) 6.3GB (.magnitude)

“A Unit Only” Resources

Version Vocab Text gensim Magnitude
v1.1 mc5 aunit 322,094 (10.1%) 1.1GB (tar.gz) 0.4GB (tar.gz) 0.5GB (.magnitude)
v1.1 mc15 aunit 276,866 (19.1%) 1.0GB (tar.gz) 0.3GB (tar.gz) 0.4GB (.magnitude)
v1.1 mc30 aunit 242,658 (26.7%) 0.8GB (tar.gz) 0.3GB (tar.gz) 0.4GB (.magnitude)
v1.1 mc90 aunit 189,775 (39.5%) 0.7GB (tar.gz) 0.2GB (tar.gz) 0.3GB (.magnitude)

Training continuable chiVe

Version gensim (full)
v1.3 mc5 5.5GB (tar.gz)
v1.3 mc15 2.6GB (tar.gz)
v1.3 mc30 1.7GB (tar.gz)
v1.3 mc90 0.9GB (tar.gz)
   
v1.2 mc5 6.7GB (tar.gz)
v1.2 mc15 3.0GB (tar.gz)
v1.2 mc30 1.9GB (tar.gz)
v1.2 mc90 1.0GB (tar.gz)

chiTra

The library for using large-scale pre-trained language models with the Japanese tokenizer SudachiPy. Please refer to chiTra for the detail.

Version Normalized SudachiTra Sudachi SudachiDict Text Pretrained Model
v1.0 normalized_and_surface v0.1.7 0.6.2 20211220-core NWJC (148GB) 395 MB (tar.gz)
v1.1 normalized_nouns v0.1.8 0.6.6 20220729-core NWJC with additional cleaning (79GB) 396 MB (tar.gz)