sudachipy.dictionary package

Note

Import from sudachipy.dictionary is deprecated.
- Use from sudachipy import Dictionary instead.
Dictionary does not provide an access to the grammar and lexicon.

Module contents

class sudachipy.dictionary.Dictionary(config_path=None, resource_dir=None, dict=None, dict_type=None, *, config=None)

–

A sudachi dictionary.

If both config.systemDict and dict are not given, sudachidict_core is used. If both config.systemDict and dict are given, dict is used. If dict is an absolute path to a file, it is used as a dictionary.

Parameters:

config_path (Config | pathlib.Path | str | None) – path to the configuration JSON file, config json as a string, or a [sudachipy.Config] object.
config (Config | pathlib.Path | str | None) – alias to config_path, only one of them can be specified at the same time.
resource_dir (pathlib.Path | str | None) – path to the resource directory folder.
dict (pathlib.Path | str | None) – type of pre-packaged dictionary, referring to sudachidict_<dict> packages on PyPI: https://pypi.org/search/?q=sudachidict. Also, can be an _absolute_ path to a compiled dictionary file.
dict_type (pathlib.Path | str | None) – deprecated alias to dict.

close(): Close this dictionary.

create(self, /, mode=SplitMode.C, fields=None, *, projection=None) → Tokenizer

–

Creates a sudachi tokenizer.

Parameters:

mode (SplitMode | str | None) – sets the analysis mode for this Tokenizer
fields (set[str] | None) – load only a subset of fields. See https://worksapplications.github.io/sudachi.rs/python/topics/subsetting.html.
projection (str | None) – Projection override for created Tokenizer. See Config.projection for values.

lookup(self, /, surface, out=None) → MorphemeList

–

Look up morphemes in the binary dictionary without performing the analysis.

All morphemes from the dictionary with the given surface string are returned, with the last user dictionary searched first and the system dictionary searched last. Inside a dictionary, morphemes are outputted in-binary-dictionary order. Morphemes which are not indexed are not returned.

Parameters:

surface (str) – find all morphemes with the given surface
out (MorphemeList | None) – if passed, reuse the given morpheme list instead of creating a new one. See https://worksapplications.github.io/sudachi.rs/python/topics/out_param.html for details.

pos_matcher(target)

Creates a POS matcher object

If target is a function, then it must return whether a POS should match or not. If target is a list, it should contain partially specified POS. By partially specified it means that it is possible to omit POS fields or use None as a sentinel value that matches any POS.

For example, (‘名詞’,) will match any noun and (None, None, None, None, None, ‘終止形‐一般’) will match any word in 終止形‐一般 conjugation form.

Parameters:: target (Iterable[PartialPOS] | Callable[[POS], bool]) – can be either a list of POS partial tuples or a callable which maps POS to bool.

pos_of(self, /, pos_id: int) → tuple[str, str, str, str, str, str] | None

–

Returns POS with the given id.

Parameters:: pos_id (int) – POS id
Returns:: POS tuple with the given id or None for non existing id.

pre_tokenizer(self, /, mode=None, fields=None, handler=None, *, projection=None) → tokenizers.PreTokenizer

–

Creates HuggingFace Tokenizers-compatible PreTokenizer. Requires package tokenizers to be installed.

Parameters:

mode (SplitMode | str | None) – Use this split mode (C by default)
fields (set[str] | None) – ask Sudachi to load only a subset of fields. See https://worksapplications.github.io/sudachi.rs/python/topics/subsetting.html. Only used when handler is set.
handler (Callable[[int, NormalizedString, MorphemeList], list[NormalizedString]] | None) – a custom callable to transform MorphemeList into list of tokens. If None, simply use surface as token representations. Overrides projection. It should be a function(index: int, original: NormalizedString, morphemes: MorphemeList) -> List[NormalizedString]. See https://github.com/huggingface/tokenizers/blob/master/bindings/python/examples/custom_components.py. If nothing was passed, simply use surface as token representations.
projection (str | None) – Projection override for created Tokenizer. See Config.projection for supported values.