KenLM 학습 방법
KenLM 학습 방법을 정리합니다.
# 1. Installing KenLM dependencies
$ sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev
# 2. Installing KenLM toolkit
$ conda create -n kenlm python=3.8
$ conda activate kenlm
$ git clone --recursive https://github.com/vchahun/kenlm.git
$ cd kenlm
$ ./bjam
$ python setup.py install
# 3. Training a Language Model
$ wget -c https://github.com/vchahun/notes/raw/data/bible/bible.en.txt.bz2
# For sanity check, do:
$ bzcat bible.en.txt.bz2 | python preprocess.py | wc
# -o means `order` which translates to the `n` in n-gram
$ bzcat bible.en.txt.bz2 |\
python preprocess.py |\
./bin/lmplz -o 3 > bible.arpa
# 4. Binarizing the model
$ ./bin/build_binary bible.arpa bible.binary
# One can also use trie when binarizing.
$ ./bin/build_binary trie bible.arpa bible.binary
import kenlm
model = kenlm.LanguageModel("bible.binary")
model.score("The weather is nice today.") # -15.03003978729248
- 텍스트파일을 bzip2로 압축해야하고,
라인바이라인
이어야한다. - preprocess.py 를 짜야되는데 토크나이징해서 토큰으로 나누고 공백으로 구분하면 된다.
- 학습하면 .arpa 파일이 나오는데 binary로 해놓으면 파일용량이 확 줄어든다.
- 사용할 때는
pip install https://github.com/kpu/kenlm/archive/master.zip
하면 된다.
# preprocess.py
import sys
import sentencepiece as spm
spm_vocab = spm.SentencePieceProcessor()
spm_vocab.load("subword_5K/sp.model")
for line in sys.stdin:
print(" ".join(spm_vocab.encode_as_pieces(line)))
# test.arpa 예시
\data\
ngram 1=3041
ngram 2=116611
ngram 3=255794
\1-grams:
-1.5854161 <unk>
-inf <s> -1.2378186
-1.9406412 </s>
-2.355018 ▁아 -0.4871656
-1.9478472 ▁ -0.83428055
-4.2279882 몬 -0.159906
-3.5629702 ▁소리 -0.23907857
-2.6603723 야 -0.57303077
-3.1922667 ▁그건 -0.28170446
-2.624419 ▁또 -0.3437255
-2.6044936 ▁그래서 -0.43840107
-2.7573256 ▁지 -0.4146822
-3.4445417 호 -0.2560217
-2.9432197 랑 -0.34222814
-3.1642013 ▁계 -0.6444618
-3.0064254 단 -0.4572751
-3.4049416 ▁올라 -0.41977677
-4.0711727 와서 -0.16058955
-2.4040775 ▁막 -0.49303237
-3.591091 ▁위에 -0.24098954
-3.1429856 ▁운동 -0.3808123
-2.8314676 하는 -0.5375345
-2.730483 ▁기 -0.58122414
-2.8738978 구 -0.3165717
-2.930341 ▁있 -0.70681775
-3.7679205 대요 -0.31725228
-2.2082634 ▁그 -0.6149223
-2.6795304 서 -0.43282977
-3.4550328 ▁그걸로 -0.19055586
-2.5926714 ▁할 -0.7962867
-3.0894804 려 -0.3030804
-3.9237733 구요 -0.25602177
-2.324094 ▁뭐 -0.49630338
Leave a comment