Output Parameters and Memory Reuse

By default, SudachiPy creates a new sudachipy.MorphemeList for each tokenization run. That incurs measurable memory allocation overhead. Instead, it is possible to reuse MorphemeLists for multiple analysis runs.

The basic usage pattern is to pass a sudachipy.MorphemeList as an out parameter to sudachipy.Tokenizer.tokenize() method:

tok = dic.create(Mode.A)
morphemes = tok.tokenize("")
for line in data:
    tok.tokenize(line, out=morphemes)
    process(morphemes)

New analysis data will replace old analysis data in this case, reusing the memory.

sudachipy.Morpheme.split() also supports memory reuse. In it’s case, you should be careful because the resulting MorphemeList will refer to the data of the parent MorphemeList and will be invalidated when using the parent list as an output parameter:

ml1 = tok.tokenize("外国人参政権")
subl1 = ml1[0].split(SplitMode.A)
tok.tokenize("something", out=ml1)
subl1[0].surface() # can raise an exception!