Natural language processing software and language resources provided by WAP Tokushima Laboratory of AI and NLP.
We have a Slack workspace for developers and users to ask questions and discuss a variety of topics.
SudachiDict and chiVe, chiTra data are generously hosted by AWS with their Oepn Data Sponsorship Program.
Japanese dictionaries for morphological analysis. Please refer to SudachiDict for the detail.
Click here for pre-built dictionaries.
Japanese pretrained word embedding. Please refer to chiVe for the detail.
Version | Normalized | Min Count | Vocab | Sudachi | SudachiDict | Text | gensim | Magnitude |
---|---|---|---|---|---|---|---|---|
v1.3 mc5 | o | 5 | 2,530,791 | v0.6.8 | 20240109-core | 3.6GB (tar.gz) | 2.9GB (tar.gz) | - |
v1.3 mc15 | o | 15 | 1,186,019 | v0.6.8 | 20240109-core | 1.7GB (tar.gz) | 1.3GB (tar.gz) | - |
v1.3 mc30 | o | 30 | 759,011 | v0.6.8 | 20240109-core | 1.1GB (tar.gz) | 0.8GB (tar.gz) | - |
v1.3 mc90 | o | 90 | 410,533 | v0.6.8 | 20240109-core | 0.6GB (tar.gz) | 0.5GB (tar.gz) | - |
v1.2 mc5 | o | 5 | 3,197,456 | v0.4.3 | 20200722-core | 9.2GB (tar.gz) | 3.8GB (tar.gz) | 5.5GB (.magnitude) |
v1.2 mc15 | o | 15 | 1,454,280 | v0.4.3 | 20200722-core | 5.0GB (tar.gz) | 1.7GB (tar.gz) | 2.4GB (.magnitude) |
v1.2 mc30 | o | 30 | 912,550 | v0.4.3 | 20200722-core | 3.1GB (tar.gz) | 1.1GB (tar.gz) | 1.5GB (.magnitude) |
v1.2 mc90 | o | 90 | 482,223 | v0.4.3 | 20200722-core | 1.7GB (tar.gz) | 0.6GB (tar.gz) | 0.8GB (.magnitude) |
v1.1 mc5 | o | 5 | 3,196,481 | v0.3.0 | 20191030-core | 11GB (tar.gz) | 3.6GB (tar.gz) | 5.5GB (.magnitude) |
v1.1 mc15 | o | 15 | 1,452,205 | v0.3.0 | 20191030-core | 4.7GB (tar.gz) | 1.7GB (tar.gz) | 2.4GB (.magnitude) |
v1.1 mc30 | o | 30 | 910,424 | v0.3.0 | 20191030-core | 3.0GB (tar.gz) | 1.1GB (tar.gz) | 1.5GB (.magnitude) |
v1.1 mc90 | o | 90 | 480,443 | v0.3.0 | 20191030-core | 1.6GB (tar.gz) | 0.6GB (tar.gz) | 0.8GB (.magnitude) |
v1.0 mc5 | x | 5 | 3,644,628 | v0.1.1 | 0.1.1-dictionary-full | 12GB (tar.gz) | 4.1GB (tar.gz) | 6.3GB (.magnitude) |
Version | Vocab | Text | gensim | Magnitude |
---|---|---|---|---|
v1.1 mc5 aunit | 322,094 (10.1%) | 1.1GB (tar.gz) | 0.4GB (tar.gz) | 0.5GB (.magnitude) |
v1.1 mc15 aunit | 276,866 (19.1%) | 1.0GB (tar.gz) | 0.3GB (tar.gz) | 0.4GB (.magnitude) |
v1.1 mc30 aunit | 242,658 (26.7%) | 0.8GB (tar.gz) | 0.3GB (tar.gz) | 0.4GB (.magnitude) |
v1.1 mc90 aunit | 189,775 (39.5%) | 0.7GB (tar.gz) | 0.2GB (tar.gz) | 0.3GB (.magnitude) |
Version | gensim (full) |
---|---|
v1.3 mc5 | 5.5GB (tar.gz) |
v1.3 mc15 | 2.6GB (tar.gz) |
v1.3 mc30 | 1.7GB (tar.gz) |
v1.3 mc90 | 0.9GB (tar.gz) |
v1.2 mc5 | 6.7GB (tar.gz) |
v1.2 mc15 | 3.0GB (tar.gz) |
v1.2 mc30 | 1.9GB (tar.gz) |
v1.2 mc90 | 1.0GB (tar.gz) |
The library for using large-scale pre-trained language models with the Japanese tokenizer SudachiPy. Please refer to chiTra for the detail.
Version | Normalized | SudachiTra | Sudachi | SudachiDict | Text | Pretrained Model |
---|---|---|---|---|---|---|
v1.0 | normalized_and_surface | v0.1.7 | 0.6.2 | 20211220-core | NWJC (148GB) | 395 MB (tar.gz) |
v1.1 | normalized_nouns | v0.1.8 | 0.6.6 | 20220729-core | NWJC with additional cleaning (79GB) | 396 MB (tar.gz) |