sudachipy package

config.Config

class sudachipy.config.Config(system: str = None, user: list[str] = None, projection: str = 'surface', connectionCostPlugin: list = None, oovProviderPlugin: list = None, pathRewritePlugin: list = None, inputTextPlugin: list = None, characterDefinitionFile: str = None)[source]

SudachiPy rich configuration object.

Fields passed here will override the fields in the default configuration.

as_jsons()[source]: Convert this Config object to the json string

projection: str = 'surface'

Output the following field as the result of sudachipy.Morpheme.surface() instead of its value. This option works for pre-tokenizers created for a given dictionary as well. The original value is available as sudachipy.Morpheme.raw_surface().

This option is created for chiTra integration.

Available options:

surface
normalized
reading
dictionary
dictionary_and_surface
normalized_and_surface
normalized_nouns

system: str = None

Path to dictionary or one of three strings: ‘small’, ‘core’, ‘full’. If the file with the specified path does not exist and is not one of three special values, raise an error. If you want to use dictionary with one of predefined names, use relative paths e.g. ‘./core’ instead of ‘core’.

If the value is one of three special values and there does not exist a file with the same name, we try to load the dictionary from the SudachiDict_{system} installed package. For example, for “core” we will try to load the dictionary from the installed SudachiDict_core package.

user: list[str] = None: Paths to user dictionaries, maximum number of user dictionaries is 14

Dictionary

Dictionary does not provide an access to the grammar and lexicon.

class sudachipy.Dictionary(config_path=None, resource_dir=None, dict=None, dict_type=None, *, config=None)

–

A sudachi dictionary.

If both config.systemDict and dict are not given, sudachidict_core is used. If both config.systemDict and dict are given, dict is used. If dict is an absolute path to a file, it is used as a dictionary.

Parameters:

config_path (Config | pathlib.Path | str | None) – path to the configuration JSON file, config json as a string, or a [sudachipy.Config] object.
config (Config | pathlib.Path | str | None) – alias to config_path, only one of them can be specified at the same time.
resource_dir (pathlib.Path | str | None) – path to the resource directory folder.
dict (pathlib.Path | str | None) – type of pre-packaged dictionary, referring to sudachidict_<dict> packages on PyPI: https://pypi.org/search/?q=sudachidict. Also, can be an _absolute_ path to a compiled dictionary file.
dict_type (pathlib.Path | str | None) – deprecated alias to dict.

close(): Close this dictionary.

create(self, /, mode=SplitMode.C, fields=None, *, projection=None) → Tokenizer

–

Creates a sudachi tokenizer.

Parameters:

mode (SplitMode | str | None) – sets the analysis mode for this Tokenizer
fields (set[str] | None) – load only a subset of fields. See https://worksapplications.github.io/sudachi.rs/python/topics/subsetting.html.
projection (str | None) – Projection override for created Tokenizer. See Config.projection for values.

lookup(self, /, surface, out=None) → MorphemeList

–

Look up morphemes in the binary dictionary without performing the analysis.

All morphemes from the dictionary with the given surface string are returned, with the last user dictionary searched first and the system dictionary searched last. Inside a dictionary, morphemes are outputted in-binary-dictionary order. Morphemes which are not indexed are not returned.

Parameters:

surface (str) – find all morphemes with the given surface
out (MorphemeList | None) – if passed, reuse the given morpheme list instead of creating a new one. See https://worksapplications.github.io/sudachi.rs/python/topics/out_param.html for details.

pos_matcher(target)

Creates a POS matcher object

If target is a function, then it must return whether a POS should match or not. If target is a list, it should contain partially specified POS. By partially specified it means that it is possible to omit POS fields or use None as a sentinel value that matches any POS.

For example, (‘名詞’,) will match any noun and (None, None, None, None, None, ‘終止形‐一般’) will match any word in 終止形‐一般 conjugation form.

Parameters:: target (Iterable[PartialPOS] | Callable[[POS], bool]) – can be either a list of POS partial tuples or a callable which maps POS to bool.

pos_of(self, /, pos_id: int) → tuple[str, str, str, str, str, str] | None

–

Returns POS with the given id.

Parameters:: pos_id (int) – POS id
Returns:: POS tuple with the given id or None for non existing id.

pre_tokenizer(self, /, mode=None, fields=None, handler=None, *, projection=None) → tokenizers.PreTokenizer

–

Creates HuggingFace Tokenizers-compatible PreTokenizer. Requires package tokenizers to be installed.

Parameters:

mode (SplitMode | str | None) – Use this split mode (C by default)
fields (set[str] | None) – ask Sudachi to load only a subset of fields. See https://worksapplications.github.io/sudachi.rs/python/topics/subsetting.html. Only used when handler is set.
handler (Callable[[int, NormalizedString, MorphemeList], list[NormalizedString]] | None) – a custom callable to transform MorphemeList into list of tokens. If None, simply use surface as token representations. Overrides projection. It should be a function(index: int, original: NormalizedString, morphemes: MorphemeList) -> List[NormalizedString]. See https://github.com/huggingface/tokenizers/blob/master/bindings/python/examples/custom_components.py. If nothing was passed, simply use surface as token representations.
projection (str | None) – Projection override for created Tokenizer. See Config.projection for supported values.

SplitMode

class sudachipy.SplitMode(mode=None)

–

Unit to split text.

A == short mode

B == middle mode

C == long mode

Parameters:: mode (str | None) – string representation of the split mode. One of [A,B,C] in captital or lower case. If None, returns SplitMode.C.

Tokenizer

class sudachipy.Tokenizer

A sudachi tokenizer

Create using Dictionary.create method.

SplitMode = SplitMode.C

mode: SplitMode of the tokenizer.

tokenize(self, /, text: str, mode=None, logger=None, out=None) → MorphemeList

–

Break text into morphemes.

Parameters:

text (str) – text to analyze.
mode (SplitMode | str | None) – analysis mode. This parameter is deprecated. Pass the analysis mode at the Tokenizer creation time and create different tokenizers for different modes. If you need multi-level splitting, prefer using Morpheme.split() method instead.
logger – Arg for v0.5.* compatibility. Ignored.
out (MorphemeList) – tokenization results will be written into this MorphemeList, a new one will be created instead. See https://worksapplications.github.io/sudachi.rs/python/topics/out_param.html for details.

Morpheme

Class method MorphemeList.empty() -> MorphemeList is deprecated.
- Use Tokenizer.tokenize("") if you need.

class sudachipy.MorphemeList

A list of morphemes.

An object can not be instantiated manually. Use Tokenizer.tokenize(“”) to create an empty morpheme list.

classmethod empty(dict: Dictionary) → MorphemeList

–

Returns an empty morpheme list with dictionary.

Deprecated since version 0.6.0: Use Tokenizer.tokenize(“”) if you need.

get_internal_cost(self, /) → int

–

Returns the total cost of the path.

size(self, /) → int

–

Returns the number of morpheme in this list.

Method Morpheme.get_word_info(self) -> WordInfo is deprecated.

class sudachipy.Morpheme

A morpheme (basic semantic unit of language).

begin(self, /) → int

–

Returns the begin index of this in the input text.

dictionary_form(self, /) → str

–

Returns the dictionary form.

dictionary_id(self, /) → int

–

Returns the dictionary id which this word belongs.

end(self, /) → int

–

Returns the end index of this in the input text.

get_word_info(self, /) → WordInfo

–

Returns the word info.

..deprecated:: v0.6.0: Users should not touch the raw WordInfo.

is_oov(self, /) → bool

–

Returns whether if this is out of vocabulary word.

normalized_form(self, /) → str

–

Returns the normalized form.

part_of_speech(self, /) → tuple[str, str, str, str, str, str]

–

Returns the part of speech as a six-element tuple. Tuple elements are four POS levels, conjugation type and conjugation form.

part_of_speech_id(self, /) → int

–

Returns the id of the part of speech in the dictionary.

raw_surface(self, /) → str

–

Returns the substring of input text corresponding to the morpheme regardless the configured projection.

See Config.projection.

reading_form(self, /) → str

–

Returns the reading form.

split(self, /, mode, out=None, add_single=False) → MorphemeList

–

Returns sub-morphemes in the provided split mode.

Parameters:

mode (SplitMode | None) – mode of new split.
out (MorphemeList | None) – write results to this MorhpemeList instead of creating new one. See https://worksapplications.github.io/sudachi.rs/python/topics/out_param.html for more information on output parameters. Returned MorphemeList will be invalidated if this MorphemeList is used as an output parameter.
add_single (bool) – return lists with the current morpheme if the split hasn’t produced any elements. When False is passed, empty lists are returned instead.

surface(self, /) → str

–

Returns the substring of input text corresponding to the morpheme, or a projection if one is configured.

See Config.projection.

synonym_group_ids(self, /) → List[int]

–

Returns the list of synonym group ids.

word_id(self, /) → int

–

Returns word id of this word in the dictionary.

WordInfo

class sudachipy.WordInfo

a_unit_split

b_unit_split

dictionary_form

dictionary_form_word_id

head_word_length

length()

normalized_form

pos_id

reading_form

surface

synonym_group_ids

word_structure