1 minute read

KenLM 학습 방법을 정리합니다.

# 1. Installing KenLM dependencies
$ sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev

# 2. Installing KenLM toolkit
$ conda create -n kenlm python=3.8
$ conda activate kenlm
$ git clone --recursive https://github.com/vchahun/kenlm.git
$ cd kenlm
$ ./bjam
$ python setup.py install

# 3. Training a Language Model
$ wget -c https://github.com/vchahun/notes/raw/data/bible/bible.en.txt.bz2
# For sanity check, do:
$ bzcat bible.en.txt.bz2 | python preprocess.py | wc

# -o means `order` which translates to the `n` in n-gram
$ bzcat bible.en.txt.bz2 |\
  python preprocess.py |\
  ./bin/lmplz -o 3 > bible.arpa

# 4. Binarizing the model
$ ./bin/build_binary bible.arpa bible.binary

# One can also use trie when binarizing.
$ ./bin/build_binary trie bible.arpa bible.binary

import kenlm
model = kenlm.LanguageModel("bible.binary")
model.score("The weather is nice today.")  # -15.03003978729248
  • 텍스트파일을 bzip2로 압축해야하고, 라인바이라인이어야한다.
  • preprocess.py 를 짜야되는데 토크나이징해서 토큰으로 나누고 공백으로 구분하면 된다.
  • 학습하면 .arpa 파일이 나오는데 binary로 해놓으면 파일용량이 확 줄어든다.
  • 사용할 때는 pip install https://github.com/kpu/kenlm/archive/master.zip 하면 된다.
# preprocess.py

import sys
import sentencepiece as spm

spm_vocab = spm.SentencePieceProcessor()
spm_vocab.load("subword_5K/sp.model")

for line in sys.stdin:
    print(" ".join(spm_vocab.encode_as_pieces(line)))
# test.arpa 예시

\data\
ngram 1=3041
ngram 2=116611
ngram 3=255794

\1-grams:
-1.5854161      <unk>
-inf    <s>     -1.2378186
-1.9406412      </s>
-2.355018            -0.4871656
-1.9478472             -0.83428055
-4.2279882            -0.159906
-3.5629702      소리   -0.23907857
-2.6603723            -0.57303077
-3.1922667      그건   -0.28170446
-2.624419            -0.3437255
-2.6044936      그래서 -0.43840107
-2.7573256           -0.4146822
-3.4445417            -0.2560217
-2.9432197            -0.34222814
-3.1642013           -0.6444618
-3.0064254            -0.4572751
-3.4049416      올라   -0.41977677
-4.0711727      와서    -0.16058955
-2.4040775           -0.49303237
-3.591091       위에   -0.24098954
-3.1429856      운동   -0.3808123
-2.8314676      하는    -0.5375345
-2.730483            -0.58122414
-2.8738978            -0.3165717
-2.930341            -0.70681775
-3.7679205      대요    -0.31725228
-2.2082634           -0.6149223
-2.6795304            -0.43282977
-3.4550328      그걸로 -0.19055586
-2.5926714           -0.7962867
-3.0894804            -0.3030804
-3.9237733      구요    -0.25602177
-2.324094            -0.49630338

Tags:

Categories:

Updated:

Leave a comment