sudachipy package

config.Config

class sudachipy.config.Config(system: str = None, user: list[str] = None, projection: str = 'surface', connectionCostPlugin: list = None, oovProviderPlugin: list = None, pathRewritePlugin: list = None, inputTextPlugin: list = None, characterDefinitionFile: str = None)[source]

SudachiPy rich configuration object.

Fields passed here will override the fields in the default configuration.

as_jsons()[source]

Convert this Config object to the json string

projection: str = 'surface'

Output the following field as the result of sudachipy.Morpheme.surface() instead of its value. This option works for pre-tokenizers created for a given dictionary as well. The original value is available as sudachipy.Morpheme.raw_surface().

This option is created for chiTra integration.

Available options:

  • surface

  • normalized

  • reading

  • dictionary

  • dictionary_and_surface

  • normalized_and_surface

  • normalized_nouns

system: str = None

Path to dictionary or one of three strings: ‘small’, ‘core’, ‘full’. If the file with the specified path does not exist and is not one of three special values, raise an error. If you want to use dictionary with one of predefined names, use relative paths e.g. ‘./core’ instead of ‘core’.

If the value is one of three special values and there does not exist a file with the same name, we try to load the dictionary from the SudachiDict_{system} installed package. For example, for “core” we will try to load the dictionary from the installed SudachiDict_core package.

user: list[str] = None

Paths to user dictionaries, maximum number of user dictionaries is 14

Dictionary

  • Dictionary does not provide an access to the grammar and lexicon.

class sudachipy.Dictionary(config_path=None, resource_dir=None, dict=None, dict_type=None, *, config=None)

A sudachi dictionary.

If both config.systemDict and dict are not given, sudachidict_core is used. If both config.systemDict and dict are given, dict is used. If dict is an absolute path to a file, it is used as a dictionary.

Parameters:
  • config_path (Config | pathlib.Path | str | None) – path to the configuration JSON file, config json as a string, or a [sudachipy.Config] object.

  • config (Config | pathlib.Path | str | None) – alias to config_path, only one of them can be specified at the same time.

  • resource_dir (pathlib.Path | str | None) – path to the resource directory folder.

  • dict (pathlib.Path | str | None) – type of pre-packaged dictionary, referring to sudachidict_<dict> packages on PyPI: https://pypi.org/search/?q=sudachidict. Also, can be an _absolute_ path to a compiled dictionary file.

  • dict_type (pathlib.Path | str | None) – deprecated alias to dict.

close()

Close this dictionary.

create(self, /, mode=SplitMode.C, fields=None, *, projection=None) Tokenizer

Creates a sudachi tokenizer.

Parameters:
lookup(self, /, surface, out=None) MorphemeList

Look up morphemes in the binary dictionary without performing the analysis.

All morphemes from the dictionary with the given surface string are returned, with the last user dictionary searched first and the system dictionary searched last. Inside a dictionary, morphemes are outputted in-binary-dictionary order. Morphemes which are not indexed are not returned.

Parameters:
pos_matcher(target)

Creates a POS matcher object

If target is a function, then it must return whether a POS should match or not. If target is a list, it should contain partially specified POS. By partially specified it means that it is possible to omit POS fields or use None as a sentinel value that matches any POS.

For example, (‘名詞’,) will match any noun and (None, None, None, None, None, ‘終止形‐一般’) will match any word in 終止形‐一般 conjugation form.

Parameters:

target (Iterable[PartialPOS] | Callable[[POS], bool]) – can be either a list of POS partial tuples or a callable which maps POS to bool.

pos_of(self, /, pos_id: int) tuple[str, str, str, str, str, str] | None

Returns POS with the given id.

Parameters:

pos_id (int) – POS id

Returns:

POS tuple with the given id or None for non existing id.

pre_tokenizer(self, /, mode=None, fields=None, handler=None, *, projection=None) tokenizers.PreTokenizer

Creates HuggingFace Tokenizers-compatible PreTokenizer. Requires package tokenizers to be installed.

Parameters:

SplitMode

class sudachipy.SplitMode(mode=None)

Unit to split text.

A == short mode

B == middle mode

C == long mode

Parameters:

mode (str | None) – string representation of the split mode. One of [A,B,C] in captital or lower case. If None, returns SplitMode.C.

Tokenizer

class sudachipy.Tokenizer

A sudachi tokenizer

Create using Dictionary.create method.

SplitMode = SplitMode.C
mode

SplitMode of the tokenizer.

tokenize(self, /, text: str, mode=None, logger=None, out=None) MorphemeList

Break text into morphemes.

Parameters:
  • text (str) – text to analyze.

  • mode (SplitMode | str | None) – analysis mode. This parameter is deprecated. Pass the analysis mode at the Tokenizer creation time and create different tokenizers for different modes. If you need multi-level splitting, prefer using Morpheme.split() method instead.

  • logger – Arg for v0.5.* compatibility. Ignored.

  • out (MorphemeList) – tokenization results will be written into this MorphemeList, a new one will be created instead. See https://worksapplications.github.io/sudachi.rs/python/topics/out_param.html for details.

Morpheme

  • Class method MorphemeList.empty() -> MorphemeList is deprecated.
    • Use Tokenizer.tokenize("") if you need.

class sudachipy.MorphemeList

A list of morphemes.

An object can not be instantiated manually. Use Tokenizer.tokenize(“”) to create an empty morpheme list.

empty(dict: Dictionary) MorphemeList

Returns an empty morpheme list with dictionary.

Deprecated since version 0.6.0: Use Tokenizer.tokenize(“”) if you need.

get_internal_cost(self, /) int

Returns the total cost of the path.

size(self, /) int

Returns the number of morpheme in this list.

  • Method Morpheme.get_word_info(self) -> WordInfo is deprecated.

class sudachipy.Morpheme

A morpheme (basic semantic unit of language).

begin(self, /) int

Returns the begin index of this in the input text.

dictionary_form(self, /) str

Returns the dictionary form.

dictionary_id(self, /) int

Returns the dictionary id which this word belongs.

end(self, /) int

Returns the end index of this in the input text.

get_word_info(self, /) WordInfo

Returns the word info.

..deprecated:: v0.6.0

Users should not touch the raw WordInfo.

is_oov(self, /) bool

Returns whether if this is out of vocabulary word.

normalized_form(self, /) str

Returns the normalized form.

part_of_speech(self, /) tuple[str, str, str, str, str, str]

Returns the part of speech as a six-element tuple. Tuple elements are four POS levels, conjugation type and conjugation form.

part_of_speech_id(self, /) int

Returns the id of the part of speech in the dictionary.

raw_surface(self, /) str

Returns the substring of input text corresponding to the morpheme regardless the configured projection.

See Config.projection.

reading_form(self, /) str

Returns the reading form.

split(self, /, mode, out=None, add_single=False) MorphemeList

Returns sub-morphemes in the provided split mode.

Parameters:
  • mode (SplitMode | None) – mode of new split.

  • out (MorphemeList | None) – write results to this MorhpemeList instead of creating new one. See https://worksapplications.github.io/sudachi.rs/python/topics/out_param.html for more information on output parameters. Returned MorphemeList will be invalidated if this MorphemeList is used as an output parameter.

  • add_single (bool) – return lists with the current morpheme if the split hasn’t produced any elements. When False is passed, empty lists are returned instead.

surface(self, /) str

Returns the substring of input text corresponding to the morpheme, or a projection if one is configured.

See Config.projection.

synonym_group_ids(self, /) List[int]

Returns the list of synonym group ids.

word_id(self, /) int

Returns word id of this word in the dictionary.

WordInfo

class sudachipy.WordInfo
a_unit_split
b_unit_split
dictionary_form
dictionary_form_word_id
head_word_length
length()
normalized_form
pos_id
reading_form
surface
synonym_group_ids
word_structure