Sudachi logo

WAP Tokushima NLP Resources

Natural language processing software and language resources provided by WAP Tokushima Laboratory of AI and NLP.

Software

We have a Slack workspace for developers and users to ask questions and discuss a variety of topics.

SudachiDict and chiVe, chiTra data are generously hosted by AWS with their Oepn Data Sponsorship Program.

Japanese dictionaries for morphological analysis. Please refer to SudachiDict for the detail.

Click here for pre-built dictionaries.

Japanese pretrained word embedding. Please refer to chiVe for the detail.

Version	Normalized	Min Count	Vocab	Sudachi	SudachiDict	Text	gensim	Magnitude
v1.3 mc5	o	5	2,530,791	v0.6.8	20240109-core	3.6GB (tar.gz)	2.9GB (tar.gz)	-
v1.3 mc15	o	15	1,186,019	v0.6.8	20240109-core	1.7GB (tar.gz)	1.3GB (tar.gz)	-
v1.3 mc30	o	30	759,011	v0.6.8	20240109-core	1.1GB (tar.gz)	0.8GB (tar.gz)	-
v1.3 mc90	o	90	410,533	v0.6.8	20240109-core	0.6GB (tar.gz)	0.5GB (tar.gz)	-

v1.2 mc5	o	5	3,197,456	v0.4.3	20200722-core	9.2GB (tar.gz)	3.8GB (tar.gz)	5.5GB (.magnitude)
v1.2 mc15	o	15	1,454,280	v0.4.3	20200722-core	5.0GB (tar.gz)	1.7GB (tar.gz)	2.4GB (.magnitude)
v1.2 mc30	o	30	912,550	v0.4.3	20200722-core	3.1GB (tar.gz)	1.1GB (tar.gz)	1.5GB (.magnitude)
v1.2 mc90	o	90	482,223	v0.4.3	20200722-core	1.7GB (tar.gz)	0.6GB (tar.gz)	0.8GB (.magnitude)

v1.1 mc5	o	5	3,196,481	v0.3.0	20191030-core	11GB (tar.gz)	3.6GB (tar.gz)	5.5GB (.magnitude)
v1.1 mc15	o	15	1,452,205	v0.3.0	20191030-core	4.7GB (tar.gz)	1.7GB (tar.gz)	2.4GB (.magnitude)
v1.1 mc30	o	30	910,424	v0.3.0	20191030-core	3.0GB (tar.gz)	1.1GB (tar.gz)	1.5GB (.magnitude)
v1.1 mc90	o	90	480,443	v0.3.0	20191030-core	1.6GB (tar.gz)	0.6GB (tar.gz)	0.8GB (.magnitude)
v1.0 mc5	x	5	3,644,628	v0.1.1	0.1.1-dictionary-full	12GB (tar.gz)	4.1GB (tar.gz)	6.3GB (.magnitude)

Version	Vocab	Text	gensim	Magnitude
v1.1 mc5 aunit	322,094 (10.1%)	1.1GB (tar.gz)	0.4GB (tar.gz)	0.5GB (.magnitude)
v1.1 mc15 aunit	276,866 (19.1%)	1.0GB (tar.gz)	0.3GB (tar.gz)	0.4GB (.magnitude)
v1.1 mc30 aunit	242,658 (26.7%)	0.8GB (tar.gz)	0.3GB (tar.gz)	0.4GB (.magnitude)
v1.1 mc90 aunit	189,775 (39.5%)	0.7GB (tar.gz)	0.2GB (tar.gz)	0.3GB (.magnitude)

The library for using large-scale pre-trained language models with the Japanese tokenizer SudachiPy. Please refer to chiTra for the detail.

Version	Normalized	SudachiTra	Sudachi	SudachiDict	Text	Pretrained Model
v1.0	normalized_and_surface	v0.1.7	0.6.2	20211220-core	NWJC (148GB)	395 MB (tar.gz)
v1.1	normalized_nouns	v0.1.8	0.6.6	20220729-core	NWJC with additional cleaning (79GB)	396 MB (tar.gz)