syntok
sentence segmentation and word tokenization toolkit
sentence segmentation and word tokenization toolkit
To install this package, run one of the following:
Syntok is the successor of an earlier, very similar tool, segtok, but has evolved significantly in terms of providing better segmentation and tokenization performance and throughput (syntok can segment documents at a rate of about 100k tokens per second without problems). For example, if a sentence terminal marker is not followed by a spacing character, segtok is unable to detect that as a terminal marker, while syntok has no problem segmenting that case (as it uses tokenization first, and does segmentation afterwards).
Summary
sentence segmentation and word tokenization toolkit
Last Updated
Jan 31, 2022 at 15:06
License
MIT
Supported Platforms