sudachipy package
config.Config
- class sudachipy.config.Config(system: str = None, user: list[str] = None, projection: str = 'surface', connectionCostPlugin: list = None, oovProviderPlugin: list = None, pathRewritePlugin: list = None, inputTextPlugin: list = None, characterDefinitionFile: str = None)[source]
SudachiPy rich configuration object.
Fields passed here will override the fields in the default configuration.
- projection: str = 'surface'
Output the following field as the result of
sudachipy.Morpheme.surface()
instead of its value. This option works for pre-tokenizers created for a given dictionary as well. The original value is available assudachipy.Morpheme.raw_surface()
.This option is created for chiTra integration.
Available options:
surface
normalized
reading
dictionary
dictionary_and_surface
normalized_and_surface
normalized_nouns
- system: str = None
Path to dictionary or one of three strings: ‘small’, ‘core’, ‘notcore’. If the file with the specified path does not exist and is not one of three special values, raise an error. If you want to use dictionary with one of predefined names, use relative paths e.g. ‘./core’ instead of ‘core’.
If the value is one of three special values and there does not exist a file with the same name, we try to load the dictionary from the SudachiDict_{system} installed package. For example, for “core” we will try to load the dictionary from the installed SudachiDict_core package.
- user: list[str] = None
Paths to user dictionaries, maximum number of user dictionaries is 14
Dictionary
Dictionary does not provide an access to the grammar and lexicon.
- class sudachipy.Dictionary(config_path=None, resource_dir=None, dict=None, dict_type=None, *, config=None)
A sudachi dictionary
- close()
Close this dictionary
- create($self, mode = 'C') sudachipy.Tokenizer
–
Creates a sudachi tokenizer.
- Parameters:
mode – tokenizer’s default split mode (C by default).
fields – load only a subset of fields. See https://worksapplications.github.io/sudachi.rs/python/topics/subsetting.html
- lookup($self, surface, out = None) sudachipy.MorphemeList
–
Look up morphemes in the binary dictionary without performing the analysis. All morphemes from the dictionary with the given surface string are returned, with the last user dictionary searched first and the system dictionary searched last. Inside a dictionary, morphemes are outputted in-binary-dictionary order. Morphemes which are not indexed are not returned.
- Parameters:
surface (str) – find all morphemes with the given surface
out – if passed, reuse the given morpheme list instead of creating a new one. See https://worksapplications.github.io/sudachi.rs/python/topics/out_param.html for details.
type: out: sudachipy.MorphemeList
- pos_matcher(target)
Creates a POS matcher object
If target is a function, then it must return whether a POS should match or not. If target a list, it should contain partially specified POS. By partially specified it means that it is possible to omit POS fields or use None as a sentinel value that matches any POS.
For example, (‘名詞’,) will match any noun and (None, None, None, None, None, ‘終止形‐一般’) will match any word in 終止形‐一般 conjugation form.
- Parameters:
target – can be either a callable or list of POS partial tuples
- pos_of()
Get POS Tuple by its id
- pre_tokenizer($self, mode, fields, handler) tokenizers.PreTokenizer
–
Creates HuggingFace Tokenizers-compatible PreTokenizer. Requires package tokenizers to be installed.
- Parameters:
mode (sudachipy.SplitMode) – Use this split mode (C by default)
fields (Set[str]) – ask Sudachi to load only a subset of fields. See https://worksapplications.github.io/sudachi.rs/python/topics/subsetting.html
handler – a custom callable to transform MorphemeList into list of tokens. It should be should be a function(index: int, original: NormalizedString, morphemes: MorphemeList) -> List[NormalizedString]. See https://github.com/huggingface/tokenizers/blob/master/bindings/python/examples/custom_components.py If nothing was passed, simply use surface as token representations.
projection – projection mode for a created PreTokenizer. See
sudachipy.config.Config
object documentation for supported projections.
SplitMode
- class sudachipy.SplitMode(mode=None)
Unit to split text
A == short mode
B == middle mode
C == long mode
Tokenizer
- class sudachipy.Tokenizer
Sudachi Tokenizer, Python version
- SplitMode = SplitMode.C
- mode
- tokenize($self, text: str, mode = None, logger = None, out = None) sudachipy.MorphemeList
–
Break text into morphemes.
SudachiPy 0.5.* had logger parameter, it is accepted, but ignored.
- Parameters:
text (str) – text to analyze
mode (sudachipy.SplitMode) – analysis mode. This parameter is deprecated. Pass the analysis mode at the Tokenizer creation time and create different tokenizers for different modes. If you need multi-level splitting, prefer using
Morpheme.split()
method instead.out (sudachipy.MorphemeList) – tokenization results will be written into this MorphemeList, a new one will be created instead. See https://worksapplications.github.io/sudachi.rs/python/topics/out_param.html for details.
Morpheme
- Class method
MorphemeList.empty() -> MorphemeList
is deprecated. Use
Tokenizer.tokenize("")
if you need.
- Class method
- class sudachipy.MorphemeList
A list of morphemes
- empty(dict: sudachipy.Dictionary) sudachipy.MorphemeList
–
Returns an empty morpheme list with dictionary
- get_internal_cost($self) int
–
Returns the total cost of the path
- size($self) int
–
Returns the number of morpheme in this list.
Method
Morpheme.get_word_info(self) -> WordInfo
is deprecated.
- class sudachipy.Morpheme
- begin($self) int
–
Returns the begin index of this in the input text
- dictionary_form($self) str
–
Returns the dictionary form
- dictionary_id($self) int
–
Returns the dictionary id which this word belongs
- end($self) int
–
Returns the end index of this in the input text
- get_word_info($self) sudachipy.WordInfo
–
Returns the word info
- is_oov($self) bool
–
Returns whether if this is out of vocabulary word
- normalized_form($self) str
–
Returns the normalized form
- part_of_speech()
Returns the part of speech as a six-element tuple. Tuple elements are four POS levels, conjugation type and conjugation form.
- part_of_speech_id($self) int
–
Returns the id of the part of speech in the dictionary
- raw_surface($self) str
–
Returns the substring of input text corresponding to the morpheme regardless the configured projection
- reading_form($self) str
–
Returns the reading form
- split($self, mode, out = None, add_single = False) sudachipy.MorphemeList
–
Returns sub-morphemes in the provided split mode.
- Parameters:
mode (sudachipy.SplitMode) – mode of new split
out (Optional[sudachipy.MorphemeList]) – write results to this MorhpemeList instead of creating new one See https://worksapplications.github.io/sudachi.rs/python/topics/out_param.html for more information on output parameters. Returned MorphemeList will be invalidated if this MorphemeList is used as an output parameter.
add_single (bool) – return lists with the current morpheme if the split hasn’t produced any elements. When False is passed, empty lists are returned instead.
- surface($self) str
–
Returns the substring of input text corresponding to the morpheme, or a projection if one is configured
- synonym_group_ids($self) List[int]
–
Returns the list of synonym group ids
- word_id($self) int
–
Returns word id of this word in the dictionary