KenLM 학습 방법을 정리합니다.

# 1. Installing KenLM dependencies
$ sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev

# 2. Installing KenLM toolkit
$ conda create -n kenlm python=3.8
$ conda activate kenlm
$ git clone --recursive https://github.com/vchahun/kenlm.git
$ cd kenlm
$ ./bjam
$ python setup.py install

# 3. Training a Language Model
$ wget -c https://github.com/vchahun/notes/raw/data/bible/bible.en.txt.bz2
# For sanity check, do:
$ bzcat bible.en.txt.bz2 | python preprocess.py | wc

# -o means `order` which translates to the `n` in n-gram
$ bzcat bible.en.txt.bz2 |\
  python preprocess.py |\
  ./bin/lmplz -o 3 > bible.arpa

# 4. Binarizing the model
$ ./bin/build_binary bible.arpa bible.binary

# One can also use trie when binarizing.
$ ./bin/build_binary trie bible.arpa bible.binary

import kenlm
model = kenlm.LanguageModel("bible.binary")
model.score("The weather is nice today.")  # -15.03003978729248
  • 텍스트파일을 bzip2로 압축해야하고, 라인바이라인이어야한다.
  • preprocess.py 를 짜야되는데 토크나이징해서 토큰으로 나누고 공백으로 구분하면 된다.
  • 학습하면 .arpa 파일이 나오는데 binary로 해놓으면 파일용량이 확 줄어든다.
  • 사용할 때는 pip install https://github.com/kpu/kenlm/archive/master.zip 하면 된다.
# preprocess.py

import sys
import sentencepiece as spm

spm_vocab = spm.SentencePieceProcessor()

for line in sys.stdin:
    print(" ".join(spm_vocab.encode_as_pieces(line)))
# test.arpa 예시

