sudachipy.dictionary package
Note
- Import from
sudachipy.dictionary
is deprecated. Use
from sudachipy import Dictionary
instead.
- Import from
Dictionary does not provide an access to the grammar and lexicon.
Module contents
- class sudachipy.dictionary.Dictionary(config_path=None, resource_dir=None, dict=None, dict_type=None, *, config=None)
A sudachi dictionary
- close()
Close this dictionary
- create($self, mode = 'C') sudachipy.Tokenizer
–
Creates a sudachi tokenizer.
- Parameters:
mode – tokenizer’s default split mode (C by default).
fields – load only a subset of fields. See https://worksapplications.github.io/sudachi.rs/python/topics/subsetting.html
- lookup($self, surface, out = None) sudachipy.MorphemeList
–
Look up morphemes in the binary dictionary without performing the analysis.
All morphemes from the dictionary with the given surface string are returned, with the last user dictionary searched first and the system dictionary searched last. Inside a dictionary, morphemes are outputted in-binary-dictionary order. Morphemes which are not indexed are not returned.
- Parameters:
surface (str) – find all morphemes with the given surface
out (sudachipy.MorphemeList) – if passed, reuse the given morpheme list instead of creating a new one. See https://worksapplications.github.io/sudachi.rs/python/topics/out_param.html for details.
- pos_matcher(target)
Creates a POS matcher object
If target is a function, then it must return whether a POS should match or not. If target a list, it should contain partially specified POS. By partially specified it means that it is possible to omit POS fields or use None as a sentinel value that matches any POS.
For example, (‘名詞’,) will match any noun and (None, None, None, None, None, ‘終止形‐一般’) will match any word in 終止形‐一般 conjugation form.
- Parameters:
target – can be either a callable or list of POS partial tuples
- pos_of()
Get POS Tuple by its id
- pre_tokenizer($self, mode, fields, handler) tokenizers.PreTokenizer
–
Creates HuggingFace Tokenizers-compatible PreTokenizer. Requires package tokenizers to be installed.
- Parameters:
mode (sudachipy.SplitMode) – Use this split mode (C by default)
fields (Set[str]) – ask Sudachi to load only a subset of fields. See https://worksapplications.github.io/sudachi.rs/python/topics/subsetting.html
handler – a custom callable to transform MorphemeList into list of tokens. It should be should be a function(index: int, original: NormalizedString, morphemes: MorphemeList) -> List[NormalizedString]. See https://github.com/huggingface/tokenizers/blob/master/bindings/python/examples/custom_components.py If nothing was passed, simply use surface as token representations.
projection – projection mode for a created PreTokenizer. See
sudachipy.config.Config
object documentation for supported projections.