sudachipy.dictionary package

Note

  • Import from sudachipy.dictionary is deprecated.
    • Use from sudachipy import Dictionary instead.

  • Dictionary does not provide an access to the grammar and lexicon.

Module contents

class sudachipy.dictionary.Dictionary(config_path=None, resource_dir=None, dict=None, dict_type=None, *, config=None)

A sudachi dictionary.

If both config.systemDict and dict are not given, sudachidict_core is used. If both config.systemDict and dict are given, dict is used. If dict is an absolute path to a file, it is used as a dictionary.

Parameters:
  • config_path (Config | pathlib.Path | str | None) – path to the configuration JSON file, config json as a string, or a [sudachipy.Config] object.

  • config (Config | pathlib.Path | str | None) – alias to config_path, only one of them can be specified at the same time.

  • resource_dir (pathlib.Path | str | None) – path to the resource directory folder.

  • dict (pathlib.Path | str | None) – type of pre-packaged dictionary, referring to sudachidict_<dict> packages on PyPI: https://pypi.org/search/?q=sudachidict. Also, can be an _absolute_ path to a compiled dictionary file.

  • dict_type (pathlib.Path | str | None) – deprecated alias to dict.

close()

Close this dictionary.

create(self, /, mode=SplitMode.C, fields=None, *, projection=None) Tokenizer

Creates a sudachi tokenizer.

Parameters:
lookup(self, /, surface, out=None) MorphemeList

Look up morphemes in the binary dictionary without performing the analysis.

All morphemes from the dictionary with the given surface string are returned, with the last user dictionary searched first and the system dictionary searched last. Inside a dictionary, morphemes are outputted in-binary-dictionary order. Morphemes which are not indexed are not returned.

Parameters:
pos_matcher(target)

Creates a POS matcher object

If target is a function, then it must return whether a POS should match or not. If target is a list, it should contain partially specified POS. By partially specified it means that it is possible to omit POS fields or use None as a sentinel value that matches any POS.

For example, (‘名詞’,) will match any noun and (None, None, None, None, None, ‘終止形‐一般’) will match any word in 終止形‐一般 conjugation form.

Parameters:

target (Iterable[PartialPOS] | Callable[[POS], bool]) – can be either a list of POS partial tuples or a callable which maps POS to bool.

pos_of(self, /, pos_id: int) tuple[str, str, str, str, str, str] | None

Returns POS with the given id.

Parameters:

pos_id (int) – POS id

Returns:

POS tuple with the given id or None for non existing id.

pre_tokenizer(self, /, mode=None, fields=None, handler=None, *, projection=None) tokenizers.PreTokenizer

Creates HuggingFace Tokenizers-compatible PreTokenizer. Requires package tokenizers to be installed.

Parameters: