sudachipy package

config.Config

class sudachipy.config.Config(system: str = None, user: list[str] = None, projection: str = 'surface', connectionCostPlugin: list = None, oovProviderPlugin: list = None, pathRewritePlugin: list = None, inputTextPlugin: list = None, characterDefinitionFile: str = None)[source]

SudachiPy rich configuration object.

Fields passed here will override the fields in the default configuration.

as_jsons()[source]

Convert this Config object to the json string

projection: str = 'surface'

Output the following field as the result of sudachipy.Morpheme.surface() instead of its value. This option works for pre-tokenizers created for a given dictionary as well. The original value is available as sudachipy.Morpheme.raw_surface().

This option is created for chiTra integration.

Available options:

  • surface

  • normalized

  • reading

  • dictionary

  • dictionary_and_surface

  • normalized_and_surface

  • normalized_nouns

system: str = None

Path to dictionary or one of three strings: ‘small’, ‘core’, ‘notcore’. If the file with the specified path does not exist and is not one of three special values, raise an error. If you want to use dictionary with one of predefined names, use relative paths e.g. ‘./core’ instead of ‘core’.

If the value is one of three special values and there does not exist a file with the same name, we try to load the dictionary from the SudachiDict_{system} installed package. For example, for “core” we will try to load the dictionary from the installed SudachiDict_core package.

user: list[str] = None

Paths to user dictionaries, maximum number of user dictionaries is 14

Dictionary

  • Dictionary does not provide an access to the grammar and lexicon.

class sudachipy.Dictionary(config_path=None, resource_dir=None, dict=None, dict_type=None, *, config=None)

A sudachi dictionary

close()

Close this dictionary

create($self, mode: sudachipy.SplitMode = sudachipy.SplitMode.C) sudachipy.Tokenizer

Creates a sudachi tokenizer.

Parameters:
lookup($self, surface, out = None) sudachipy.MorphemeList

Look up morphemes in the binary dictionary without performing the analysis. All morphemes from the dictionary with the given surface string are returned, with the last user dictionary searched first and the system dictionary searched last. Inside a dictionary, morphemes are outputted in-binary-dictionary order. Morphemes which are not indexed are not returned.

Parameters:

type: out: sudachipy.MorphemeList

pos_matcher(target)

Creates a POS matcher object

If target is a function, then it must return whether a POS should match or not. If target a list, it should contain partially specified POS. By partially specified it means that it is possible to omit POS fields or use None as a sentinel value that matches any POS.

For example, (‘名詞’,) will match any noun and (None, None, None, None, None, ‘終止形‐一般’) will match any word in 終止形‐一般 conjugation form.

Parameters:

target – can be either a callable or list of POS partial tuples

pos_of()

Get POS Tuple by its id

pre_tokenizer($self, mode, fields, handler) tokenizers.PreTokenizer

Creates HuggingFace Tokenizers-compatible PreTokenizer. Requires package tokenizers to be installed.

Parameters:

SplitMode

class sudachipy.SplitMode

Unit to split text

A == short mode

B == middle mode

C == long mode

Tokenizer

class sudachipy.Tokenizer

Sudachi Tokenizer, Python version

SplitMode = <sudachipy.tokenizer.SplitMode object>
tokenize($self, text: str, mode: SplitMode = None, logger = None, out = None) sudachipy.MorphemeList

Break text into morphemes.

SudachiPy 0.5.* had logger parameter, it is accepted, but ignored.

Parameters:

Morpheme

  • Class method MorphemeList.empty() -> MorphemeList is deprecated.
    • Use Tokenizer.tokenize("") if you need.

class sudachipy.MorphemeList

A list of morphemes

empty(dict: sudachipy.Dictionary) sudachipy.MorphemeList

Returns an empty morpheme list with dictionary

get_internal_cost($self) int

Returns the total cost of the path

size($self) int

Returns the number of morpheme in this list.

  • Method Morpheme.get_word_info(self) -> WordInfo is deprecated.

class sudachipy.Morpheme
begin($self) int

Returns the begin index of this in the input text

dictionary_form($self) str

Returns the dictionary form

dictionary_id($self) int

Returns the dictionary id which this word belongs

end($self) int

Returns the end index of this in the input text

get_word_info($self) sudachipy.WordInfo

Returns the word info

is_oov($self) bool

Returns whether if this is out of vocabulary word

normalized_form($self) str

Returns the normalized form

part_of_speech()

Returns the part of speech as a six-element tuple. Tuple elements are four POS levels, conjugation type and conjugation form.

part_of_speech_id($self) int

Returns the id of the part of speech in the dictionary

raw_surface($self) str

Returns the substring of input text corresponding to the morpheme regardless the configured projection

reading_form($self) str

Returns the reading form

split($self, mode, out = None, add_single = False) sudachipy.MorphemeList

Returns sub-morphemes in the provided split mode.

Parameters:
surface($self) str

Returns the substring of input text corresponding to the morpheme, or a projection if one is configured

synonym_group_ids($self) List[int]

Returns the list of synonym group ids

word_id($self) int

Returns word id of this word in the dictionary

WordInfo

class sudachipy.WordInfo
a_unit_split
b_unit_split
dictionary_form
dictionary_form_word_id
head_word_length
length()
normalized_form
pos_id
reading_form
surface
synonym_group_ids
word_structure