Data Science

[do it! bert gpt] 2장. 문장을 작은 단위로 쪼개기 본문

인공지능

[do it! bert gpt] 2장. 문장을 작은 단위로 쪼개기

shinho0902 2023. 1. 26. 00:16

 

 

 

2-1 토큰화란?

문장을 토큰 시퀀스로 나누는 과정

 
  • 단어 단위 토큰화:
    • 어휘 집합의 크기가 커져 모델 학습이 어렵다.
  • 문자 단위 토큰화:
    • 미등록 토큰 문제로부터 자유롭다
    • 각 문자 토큰이 의미 있는 단위가 되기 어렵다.
  • 서브워드 단위 토큰화:
    • 단어와 문자 단위 토큰화의 장점만 취한 형태
    • 어휘 집합 크기가 커지지 않으면서도 미등록 토큰 문제를 피하고, 분석된 토큰 시퀀스가 너무 길어지지 않게 한다.
 

2-2 바이트 페어 인코딩(BEP)이란?

 

BEP는 사전의 크기 증가를 억제하면서도 정보를 효율적으로 압축할 수 있는 알고리즘이다.

BPE 어휘 집합은 고빈도 바이그램(Bigram) 쌍을 병합하는 방식으로 구축된다.

 
그림2-1.png
 

2-3 어휘 집합 구축하기

In [ ]:
!pip install ratsnlp
In [ ]:
from google.colab import drive
drive.mount('/content/drive')
 
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [ ]:
# 네이버 영화 리뷰 NSMC
from Korpora import Korpora
nsmc = Korpora.load("nsmc", force_download=True)
 
    Korpora 는 다른 분들이 연구 목적으로 공유해주신 말뭉치들을
    손쉽게 다운로드, 사용할 수 있는 기능만을 제공합니다.

    말뭉치들을 공유해 주신 분들에게 감사드리며, 각 말뭉치 별 설명과 라이센스를 공유 드립니다.
    해당 말뭉치에 대해 자세히 알고 싶으신 분은 아래의 description 을 참고,
    해당 말뭉치를 연구/상용의 목적으로 이용하실 때에는 아래의 라이센스를 참고해 주시기 바랍니다.

    # Description
    Author : e9t@github
    Repository : https://github.com/e9t/nsmc
    References : www.lucypark.kr/docs/2015-pyconkr/#39

    Naver sentiment movie corpus v1.0
    This is a movie review dataset in the Korean language.
    Reviews were scraped from Naver Movies.

    The dataset construction is based on the method noted in
    [Large movie review dataset][^1] from Maas et al., 2011.

    [^1]: http://ai.stanford.edu/~amaas/data/sentiment/

    # License
    CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
    Details in https://creativecommons.org/publicdomain/zero/1.0/

 
[nsmc] download ratings_train.txt: 14.6MB [00:00, 92.5MB/s]                            
[nsmc] download ratings_test.txt: 4.90MB [00:00, 17.2MB/s]                            
 

NSMC에 포함된 영화 리뷰들을 순수 텍스트 형태로 코랩 환경 로컬의 지정된 디렉터리에 저장해 둡니다.

In [ ]:
import os
def write_lines(path, lines):
    with open(path, 'w', encoding='utf-8') as f:
        for line in lines:
            f.write(f'{line}\n')
write_lines('/content/train.txt', nsmc.train.get_all_texts())
write_lines('/content/test.txt', nsmc.test.get_all_texts())
 

GPT 토크나이저 구축 - BPE

In [ ]:
# 디렉토리 만들기
import os
os.makedirs('/content/drive/MyDrive/Colab Notebooks/BERT와 GPT로 배우는 자연어처리/nlpbook/bbpe', exist_ok=True)
In [ ]:
# 바이트 수준 BPE 어휘 집합 구축
from tokenizers import ByteLevelBPETokenizer
bytebpe_tokenizer = ByteLevelBPETokenizer()
bytebpe_tokenizer.train(
    files=['/content/train.txt', '/content/test.txt'], # 학습 말뭉치를 리스트 형태로 넣기
    vocab_size = 1000, # 어휘 집합 크기 조절
    special_tokens=['[PAD]'], # 특수 토큰 추가
)
bytebpe_tokenizer.save_model('/content/drive/MyDrive/Colab Notebooks/BERT와 GPT로 배우는 자연어처리/nlpbook/bbpe')
Out[ ]:
['/content/drive/MyDrive/Colab Notebooks/BERT와 GPT로 배우는 자연어처리/nlpbook/bbpe/vocab.json',
 '/content/drive/MyDrive/Colab Notebooks/BERT와 GPT로 배우는 자연어처리/nlpbook/bbpe/merges.txt']
In [ ]:
import json
with open('/content/drive/MyDrive/Colab Notebooks/BERT와 GPT로 배우는 자연어처리/nlpbook/bbpe/vocab.json', 'r') as f:
    json_data = json.load(f)
print(json.dumps(json_data) )
 
{"[PAD]": 0, "!": 1, "\"": 2, "#": 3, "$": 4, "%": 5, "&": 6, "'": 7, "(": 8, ")": 9, "*": 10, "+": 11, ",": 12, "-": 13, ".": 14, "/": 15, "0": 16, "1": 17, "2": 18, "3": 19, "4": 20, "5": 21, "6": 22, "7": 23, "8": 24, "9": 25, ":": 26, ";": 27, "<": 28, "=": 29, ">": 30, "?": 31, "@": 32, "A": 33, "B": 34, "C": 35, "D": 36, "E": 37, "F": 38, "G": 39, "H": 40, "I": 41, "J": 42, "K": 43, "L": 44, "M": 45, "N": 46, "O": 47, "P": 48, "Q": 49, "R": 50, "S": 51, "T": 52, "U": 53, "V": 54, "W": 55, "X": 56, "Y": 57, "Z": 58, "[": 59, "\\": 60, "]": 61, "^": 62, "_": 63, "`": 64, "a": 65, "b": 66, "c": 67, "d": 68, "e": 69, "f": 70, "g": 71, "h": 72, "i": 73, "j": 74, "k": 75, "l": 76, "m": 77, "n": 78, "o": 79, "p": 80, "q": 81, "r": 82, "s": 83, "t": 84, "u": 85, "v": 86, "w": 87, "x": 88, "y": 89, "z": 90, "{": 91, "|": 92, "}": 93, "~": 94, "\u00a1": 95, "\u00a2": 96, "\u00a3": 97, "\u00a4": 98, "\u00a5": 99, "\u00a6": 100, "\u00a7": 101, "\u00a8": 102, "\u00a9": 103, "\u00aa": 104, "\u00ab": 105, "\u00ac": 106, "\u00ae": 107, "\u00af": 108, "\u00b0": 109, "\u00b1": 110, "\u00b2": 111, "\u00b3": 112, "\u00b4": 113, "\u00b5": 114, "\u00b6": 115, "\u00b7": 116, "\u00b8": 117, "\u00b9": 118, "\u00ba": 119, "\u00bb": 120, "\u00bc": 121, "\u00bd": 122, "\u00be": 123, "\u00bf": 124, "\u00c0": 125, "\u00c1": 126, "\u00c2": 127, "\u00c3": 128, "\u00c4": 129, "\u00c5": 130, "\u00c6": 131, "\u00c7": 132, "\u00c8": 133, "\u00c9": 134, "\u00ca": 135, "\u00cb": 136, "\u00cc": 137, "\u00cd": 138, "\u00ce": 139, "\u00cf": 140, "\u00d0": 141, "\u00d1": 142, "\u00d2": 143, "\u00d3": 144, "\u00d4": 145, "\u00d5": 146, "\u00d6": 147, "\u00d7": 148, "\u00d8": 149, "\u00d9": 150, "\u00da": 151, "\u00db": 152, "\u00dc": 153, "\u00dd": 154, "\u00de": 155, "\u00df": 156, "\u00e0": 157, "\u00e1": 158, "\u00e2": 159, "\u00e3": 160, "\u00e4": 161, "\u00e5": 162, "\u00e6": 163, "\u00e7": 164, "\u00e8": 165, "\u00e9": 166, "\u00ea": 167, "\u00eb": 168, "\u00ec": 169, "\u00ed": 170, "\u00ee": 171, "\u00ef": 172, "\u00f0": 173, "\u00f1": 174, "\u00f2": 175, "\u00f3": 176, "\u00f4": 177, "\u00f5": 178, "\u00f6": 179, "\u00f7": 180, "\u00f8": 181, "\u00f9": 182, "\u00fa": 183, "\u00fb": 184, "\u00fc": 185, "\u00fd": 186, "\u00fe": 187, "\u00ff": 188, "\u0100": 189, "\u0101": 190, "\u0102": 191, "\u0103": 192, "\u0104": 193, "\u0105": 194, "\u0106": 195, "\u0107": 196, "\u0108": 197, "\u0109": 198, "\u010a": 199, "\u010b": 200, "\u010c": 201, "\u010d": 202, "\u010e": 203, "\u010f": 204, "\u0110": 205, "\u0111": 206, "\u0112": 207, "\u0113": 208, "\u0114": 209, "\u0115": 210, "\u0116": 211, "\u0117": 212, "\u0118": 213, "\u0119": 214, "\u011a": 215, "\u011b": 216, "\u011c": 217, "\u011d": 218, "\u011e": 219, "\u011f": 220, "\u0120": 221, "\u0121": 222, "\u0122": 223, "\u0123": 224, "\u0124": 225, "\u0125": 226, "\u0126": 227, "\u0127": 228, "\u0128": 229, "\u0129": 230, "\u012a": 231, "\u012b": 232, "\u012c": 233, "\u012d": 234, "\u012e": 235, "\u012f": 236, "\u0130": 237, "\u0131": 238, "\u0132": 239, "\u0133": 240, "\u0134": 241, "\u0135": 242, "\u0136": 243, "\u0137": 244, "\u0138": 245, "\u0139": 246, "\u013a": 247, "\u013b": 248, "\u013c": 249, "\u013d": 250, "\u013e": 251, "\u013f": 252, "\u0140": 253, "\u0141": 254, "\u0142": 255, "\u0143": 256, "\u0120\u00ec": 257, "\u0120\u00eb": 258, "\u00ec\u013f": 259, "\u00eb\u012d": 260, "\u00ed\u0137": 261, "\u00ea\u00b0": 262, "..": 263, "\u00ec\u013f\u00b4": 264, "\u00eb\u012d\u00a4": 265, "\u00eb\u012c": 266, "\u00ec\u0139": 267, "\u00ea\u00b3": 268, "\u00ec\u00a7": 269, "\u00eb\u012c\u0136": 270, "\u00ec\u0140": 271, "\u00eb\u00a7": 272, "\u00ed\u013b": 273, "\u00ea\u00b3\u0142": 274, "\u00ec\u0142": 275, "\u00ed\u013b\u0136": 276, "\u013a\u0123": 277, "\u0120\u00ea": 278, "\u00eb\u0131": 279, "\u00ec\u0137": 280, "\u00e3\u0127": 281, "\u013a\u0123\u00ed\u013b\u0136": 282, "\u00ec\u013c": 283, "\u00ec\u00a7\u0122": 284, "\u00ed\u0137\u013a": 285, "\u00ea\u00b0\u0122": 286, "\u00eb\u0124": 287, "\u00ea\u00b2": 288, "\u00ec\u0126": 289, "\u0120\u00ec\u0140": 290, "\u00ac\u00eb": 291, "\u00ea\u00b8": 292, "\u0120\u00ec\u0137": 293, "\u00eb\u0131\u0126": 294, "\u0120\u00ed": 295, "\u00eb\u0135": 296, "\u00eb\u00a6": 297, "\u00ec\u0139\u0132": 298, "\u0120\u00ec\u013f": 299, "\u00ed\u0137\u013e": 300, "\u0120\u00ec\u013a\u0123\u00ed\u013b\u0136": 301, "\u012a\u00eb": 302, "\u00b3\u00b4": 303, "\u00ec\u012d": 304, "\u0138\u00b4": 305, "\u00ec\u013f\u013a": 306, "\u00ea\u00b8\u00b0": 307, "\u00e3\u0127\u012d": 308, "\u0120\u00ec\u0139": 309, "\u00ec\u013f\u0122": 310, "\u00eb\u00a1": 311, "\u00eb\u012f": 312, "\u00ec\u013f\u0126": 313, "\u013f\u00bc": 314, "\u00eb\u0124\u013a": 315, "\u00ea\u00b2\u012e": 316, "\u0120\u00ec\u013f\u00b4": 317, "\u00ec\u0126\u013e": 318, "\u0120\u00eb\u00a7": 319, "\u00ec\u013c\u0136": 320, "\u00ec\u012c": 321, "\u00ec\u0138\u00b4": 322, "\u00eb\u00a1\u013e": 323, "\u0120\u00eb\u0124": 324, "\u00eb\u00a7\u012e": 325, "\u00eb\u013f\u00bc": 326, "\u00eb\u00a6\u00ac": 327, "\u0120\u00ec\u0142": 328, "\u00b7\u00b8": 329, "\u00eb\u012d\u012a": 330, "\u00eb\u0135\u00a4": 331, "\u00eb\u00a5": 332, "\u00ea\u00b1": 333, "\u00ec\u0137\u0126": 334, "\u00eb\u0142": 335, "...": 336, "\u00eb\u00a9": 337, "\u00ac\u00b4": 338, "\u00ec\u013e": 339, "\u0120\u00ea\u00b0": 340, "\u00ec\u013f\u00b8": 341, "\u00e3\u0127\u012d\u00e3\u0127\u012d": 342, "\u00af\u00b8": 343, "\u00eb\u012f\u00b0": 344, "\u0120\u00ec\u00a7": 345, "\u00eb\u0126": 346, "\u0120\u00ec\u0137\u0126": 347, "\u012e\u0122": 348, "\u00eb\u0141": 349, "\u00ec\u013a": 350, "\u00ea\u00b5": 351, "\u00ed\u0137\u00b4": 352, "\u0120\u00eb\u00b3\u00b4": 353, "\u00eb\u00a9\u00b4": 354, "\u00ec\u0125": 355, "\u00ec\u013a\u0123\u00ed\u013b\u0136": 356, "\u0120\u00ec\u012d": 357, "\u0120\u00ea\u00b7\u00b8": 358, "\u00ea\u00b9": 359, "\u00eb\u00b0": 360, "\u0120\u00eb\u00aa": 361, "\u00ec\u0142\u0132": 362, "\u00ec\u012d\u013e": 363, "\u012a\u013a": 364, "\u0120\u00eb\u012d": 365, "\u00eb\u00b3\u00b4": 366, "\u00ec\u013f\u012e": 367, "\u00ec\u012c\u00a4": 368, "\u00a3\u00bc": 369, "\u00eb\u0140": 370, "\u0120\u00eb\u00b0": 371, "\u00ec\u013e\u00bc": 372, "\u00eb\u0126\u00a4": 373, "\u0120\u00ec\u0140\u00ac\u00eb": 374, "\u00eb\u00a5\u00bc": 375, "\u00eb\u00a7\u0132": 376, "\u0120\u00ec\u00a2": 377, "!!": 378, "\u00eb\u00b6": 379, "\u00ec\u00a4": 380, "\u0120\u00ec\u0124": 381, "\u00ea\u00b1\u00b0": 382, "\u00ec\u0124": 383, "\u00ec\u013b": 384, "\u00eb\u012e\u0122": 385, "\u00eb\u012d\u012a\u00eb\u012d\u00a4": 386, "\u012a\u00eb\u00ac\u00b4": 387, "\u00ec\u0140\u0132": 388, "\u00eb\u012c\u0136\u00eb\u012f\u00b0": 389, "\u00ec\u013d": 390, "\u0120\u00eb\u00b3": 391, "\u00ec\u0142\u0137": 392, "\u0120\u00eb\u0126": 393, "\u00eb\u00a7\u012a": 394, "\u00ea\u00b9\u012e": 395, "\u00ec\u00b2": 396, "\u00ec\u0139\u0128": 397, "\u0120\u00ec\u0139\u0128": 398, "\u00ec\u0139\u012a": 399, "\u0120\u00eb\u0124\u013a": 400, "\u0120\u00ed\u0137\u013a": 401, "\u00ec\u013c\u00b0": 402, "\u0120\u00eb\u00b4": 403, "\u00ec\u00b9": 404, "\u00ec\u0137\u00bc": 405, "\u0120\u00ec\u00a2\u012d": 406, "\u00ec\u00a3\u00bc": 407, "\u00ec\u00a7\u0126": 408, "\u0120\u00eb\u012d\u00a4": 409, "\u00ec\u012a\u013a": 410, "\u00ed\u0138": 411, "\u00eb\u00b3": 412, "\u00eb\u00b2": 413, "\u00ec\u0142\u0123": 414, "\u00b5\u013e": 415, "\u00ec\u0140\u00a5": 416, "\u00ec\u0140\u012a": 417, "\u00ec\u0140\u0133": 418, "\u00ec\u0142\u0126": 419, "\u00ec\u0125\u0123": 420, "\u00eb\u00aa": 421, "....": 422, "\u0120\u00ec\u0125": 423, "\u0120\u00ec\u0142\u0137": 424, "\u00ec\u00a7\u013e": 425, "\u00ec\u0128": 426, "\u00b0\u012e": 427, "\u00ec\u013e\u00bc\u00eb\u00a1\u013e": 428, "\u0120\u00ea\u00b2": 429, "\u0120\u00ec\u0140\u012a": 430, "\u00ec\u00a7\u0122\u00eb\u00a7\u012e": 431, "\u00ed\u013a": 432, "\u00ea\u00b0\u0126": 433, "\u0120\u00ec\u0139\u00b0": 434, "\u00ed\u0137\u013a\u00ea\u00b3\u0142": 435, "\u0120\u00ec\u013b": 436, "\u00ac\u00eb\u0140": 437, "\u00ea\u00b3\u00bc": 438, "\u0132\u00eb\u0131": 439, "\u00ec\u013a\u00a4": 440, "\u0120\u00ec\u012c": 441, "\u0120\u00eb\u0135": 442, "\u00eb\u0124\u00b4": 443, "\u0120\u00ea\u00b8": 444, "\u0131\u012b": 445, "\u00e3\u0127\u0142": 446, "\u0120\u00eb\u0126\u012a\u00eb\u00ac\u00b4": 447, "\u00eb\u0141\u00b0": 448, "\u00eb\u0127": 449, "\u0120\u00ec\u0138\u00b4": 450, "\u0120\u00ec\u013a": 451, "\u0120\u00eb\u00a7\u012e": 452, "\u00ed\u0125": 453, "\u0120\u00ec\u0140\u00ac\u00eb\u00af\u00b8": 454, "\u0120\u00ec\u00a7\u0122": 455, "\u00b9\u0126": 456, "\u00eb\u0136": 457, "\u00ea\u00b7\u00b8": 458, "\u00ec\u00b0": 459, "\u00ed\u0140": 460, "\u00eb\u0125": 461, "\u00ec\u0139\u0132\u00ec\u0126\u013e": 462, "\u0120\u00eb\u0124\u00b4": 463, "\u00eb\u0126\u00a4\u00ec\u013c\u0136": 464, "\u00ea\u00b1\u00b4": 465, "\u0132\u013a": 466, "\u0120\u00ed\u0137\u013e": 467, "\u00eb\u0135\u013e": 468, "\u0120\u00ec\u012d\u013e": 469, "\u00ed\u0128": 470, "\u0120\u00eb\u00b6": 471, "\u00ec\u0137\u013a": 472, "\u00ed\u0137\u0142": 473, "\u0120\u00ec\u0126": 474, "\u0137\u012e": 475, "\u00ec\u00a1": 476, "\u00ec\u0140\u00ac\u00eb": 477, "\u00ec\u0139\u00b0": 478, "\u0120\u00ec\u00a7\u0126": 479, "\u0120\u00eb\u00b4\u00a4": 480, "\u00eb\u00a3": 481, "\u0120\u00ea\u00b0\u0122": 482, "\u00ec\u013c\u00b4": 483, "\u0120\u00ec\u012c\u00a4": 484, "\u00ea\u00b3\u00b5": 485, "\u0120\u00ec\u00b5\u013e": 486, "\u00eb\u00b4": 487, "\u00ec\u0126\u00b1": 488, "\u00ec\u013b\u0122": 489, "\u0120\u00eb\u0131": 490, "\u00eb\u00af\u00b8": 491, "\u0120\u00ec\u013c": 492, "\u00ec\u0139\u00ac": 493, "\u00ea\u00b0\u0123": 494, "\u00ec\u012c\u00b5": 495, "\u0120\u00ec\u00b0": 496, "\u00ea\u00b2\u0125": 497, "\"\"": 498, "\u00ed\u0140\u012a": 499, "\u0120\u00eb\u012f": 500, "\u00ec\u0142\u013e": 501, "\u01201": 502, "\u00ed\u0136": 503, "\u00ec\u00b9\u013a": 504, "\u00ec\u00b6": 505, "\u00ec\u0138": 506, "\u00ec\u013c\u00a9": 507, "\u0120\u00ea\u00b8\u00b0": 508, "\u00ed\u0137\u013a\u00eb\u012c\u0136": 509, "\u0120\u00eb\u012e\u0122": 510, "\u0120\u00ec\u0128": 511, "\u00eb\u00b6\u0122": 512, "\u00eb\u0142\u00a4": 513, "\u00ec\u013f\u00bc": 514, "\u0120\u00ec\u0139\u00b0\u00ea\u00b8\u00b0": 515, "\u00ed\u0128\u0142": 516, "\u00eb\u0140\u013a": 517, "\u0120\u00ea\u00b0\u0132\u00eb\u0131": 518, "\u00ed\u0138\u012a": 519, "\u00ed\u012e": 520, "\u0120\u00ec\u0140\u00ac\u00eb\u00b0\u012e": 521, "\u0131\u012b\u00ec\u0142\u0132": 522, "\u00eb\u00ac": 523, "\u0120\u00ec\u012a\u013a": 524, "\u0126\u00b0": 525, "\u00ea\u00b5\u00ac": 526, "\u0120\u00eb\u00aa\u00a8": 527, "\u00ec\u00a6": 528, "\u00ed\u0137\u00a8": 529, "\u00eb\u00a3\u00a8": 530, "\u00ec\u0124\u00ac": 531, "\u00ec\u0138\u00b4\u00ec\u013c\u0136": 532, "\u0120\u00ec\u0142\u0137\u00eb\u00a7\u0132": 533, "\u0120\u00ec\u0142\u0126": 534, "\u0120\u00ec\u0124\u00ac\u00eb\u0140": 535, "\u013f\u00ea\u00b0\u0123": 536, "\u00ea\u00b5\u0143": 537, "\u0120\u00ec\u013e": 538, "\u00e3\u0127\u0130": 539, "\u0120\u00ec\u00a4": 540, "\u00ec\u012d\u0142": 541, "\u00ed\u0126\u00b0": 542, "\u0120\u00ec\u0140\u013a": 543, "\u00ec\u00a2": 544, "\u00ec\u00a4\u0133": 545, "\u0120\u00ec\u0137\u012c": 546, "\u0120\u00eb\u00ac\u00b4": 547, "\u00eb\u00b6\u0126": 548, "\u00ed\u012c": 549, "\u00eb\u0141\u00ac": 550, "\u00ec\u0127": 551, "\u00e3\u0127\u012d\u00e3\u0127\u012d\u00e3\u0127\u012d\u00e3\u0127\u012d": 552, "\u00ea\u00b2\u0142": 553, "\u00ec\u012c\u00b5\u00eb\u012d\u012a\u00eb\u012d\u00a4": 554, "\u00ec\u013f\u00b4\u00eb\u012d\u00a4": 555, "\u00ed\u0130": 556, "\u0120\u00eb\u012f\u0136": 557, "\u00ec\u0126\u00b8": 558, "\u0120\u00ec\u0137\u012a": 559, "\u00ed\u0137\u013a\u00eb\u012d\u00a4": 560, "\u0120\u00eb\u012c": 561, "\u0120\u00ec\u00a1": 562, "\u00eb\u0142\u012a": 563, "\u0120\u00ea\u00b1": 564, "\u0120\u00ec\u00a3\u00bc": 565, "\u00ea\u00b0\u013b": 566, "\u00b0\u00ec\u013c\u00b0": 567, "\u00eb\u00a5\u00b4": 568, "\u0135\u00b0": 569, "\u00eb\u0137\u012e": 570, "\u0120\u00ec\u013d": 571, "\u00ec\u0128\u012e": 572, "\u00ea\u00b0\u013e": 573, "~~": 574, "\u0120\u00eb\u0143": 575, "\u00ec\u0140\u0126": 576, "\u00ed\u0128\u0142\u00eb\u00a6\u00ac": 577, "\u0120\u00eb\u0135\u013e": 578, "\u0120\u00ec\u0125\u013f\u00ea\u00b0\u0123": 579, "\u00ec\u0125\u013f": 580, "\u0120\u00ec\u00a7\u0126\u00ec\u00a7\u013e": 581, "\u00e3\u0127\u00a1": 582, "\u00ea\u00b0\u0132": 583, "\u0120\u00eb\u00a7\u012a": 584, "\u00eb\u0142\u00a5": 585, "\u00eb\u0135\u0142": 586, "\u00eb\u012f\u013a": 587, "\u0120\u00ec\u013f\u00b8": 588, "\u00ec\u0140\u0127": 589, "\u00ec\u012d\u00a4": 590, "\u0120\u00ea\u00b0\u013b": 591, "\u0120\u00ec\u00b5\u013e\u00ea\u00b3\u0142": 592, "\u00ed\u0123": 593, "\u0120\u00ec\u00b2": 594, "\u0120\u00eb\u00a7\u0132": 595, "\u0120\u00ea\u00b5": 596, "\u0120\u00eb\u00aa\u00bb": 597, "\u00ea\u00b7": 598, "\u00eb\u0124\u013e": 599, "\u00eb\u0135\u00af": 600, "\u00eb\u013f\u00bc\u00eb\u00a7\u012a": 601, "\u00eb\u0135\u00a4\u00ec\u013f\u00b4": 602, "\u00eb\u00ac\u00b4": 603, "\u0120\u00ea\u00b0\u013e": 604, "\u0120\u00ec\u0139\u00ac": 605, "\u00eb\u0127\u0126": 606, "\u00ec\u0137\u0127": 607, "\u00ed\u0134": 608, "\u00e3\u0127\u0142\u00e3\u0127\u0142": 609, "\u0120\u00ec\u0140\u0132": 610, "\u00eb\u0136\u0136": 611, "\u0120\u00ec\u0142\u013e": 612, "\u0120\u00eb\u012c\u0132": 613, "\u0120\u00eb\u0123": 614, "\u00ea\u00b9\u012e\u00ec\u00a7\u0122": 615, "\u00ea\u00b8\u012a": 616, "\u00eb\u00a6\u0126": 617, "\u0120\u00ec\u013b\u013e": 618, "\u00eb\u0125\u00a5": 619, "\u00ed\u0130\u00b8": 620, "\u0120\u00ea\u00b4": 621, "\u00ed\u0125\u0122": 622, "\u00ed\u0137\u013a\u00ea\u00b2\u012e": 623, "\u00eb\u00b9\u0126": 624, "\u00eb\u00b3\u00b8": 625, "\u0120\u00eb\u00b3\u00b8": 626, "\u0120\u00ec\u00b6": 627, "\u00eb\u0142\u0129": 628, "\u0133\u0132": 629, "\u00ec\u013a\u0122": 630, "\u0120\u00ec\u0138": 631, "\u0120\u00ec\u0137\u012e": 632, "\u0120\u00eb\u0124\u00a8": 633, "\u00ed\u0131": 634, "\u0120\u00ed\u013a": 635, "\u00ec\u013f\u00b4\u00eb\u013f\u00bc": 636, "\u00ed\u012c\u00b8": 637, "\u0120\u00ea\u00b2\u0125": 638, "\u00ec\u00a4\u0122": 639, "\u0120\u00ec\u013a\u00a4": 640, "\u0120\u00ed\u0131\u012b\u00ec\u0142\u0132": 641, "\u00eb\u00b3\u00b4\u00eb\u012d\u00a4": 642, "\u00ec\u0139\u012a\u00eb\u012d\u00a4": 643, "\u0120\u00ea\u00b7": 644, "\u00bd\u0136": 645, "\u0120\u00eb\u00b3\u00b4\u00ea\u00b3\u0142": 646, "\u00ec\u0127\u013a": 647, "\u0120\u00eb\u00b3\u00bc": 648, "\u0120\u00ec\u013f\u00b4\u00eb\u0141\u00b0": 649, "\u00ec\u013e\u0142": 650, "\u00ea\u00b1\u00b8": 651, ";;": 652, "\u00ec\u0142\u0122": 653, "\u00ec\u0142\u0137\u00eb\u00a7\u0132": 654, "\u0120\u00ec\u0128\u012e": 655, "\u00eb\u00a7\u012b": 656, "\u0120\u00eb\u00a9": 657, "\u00eb\u012f\u0136": 658, "\u00ea\u00b4": 659, "??": 660, "\u00ec\u00b2\u00b4": 661, "\u00eb\u00b9": 662, "\u00ec\u00a7\u0126\u00ec\u00a7\u013e": 663, "\u00ec\u0136": 664, "\u0120\u00eb\u00b9": 665, "\u0120\u00eb\u00b9\u0126": 666, "\u00eb\u0123": 667, "\u013d\u0126": 668, "\u00ec\u012d\u00ac": 669, "\u0120\u00ea\u00b0\u0132\u00eb\u0131\u013b": 670, "\u0120\u00ea\u00b9": 671, "\u0120\u00eb\u00a7\u0130": 672, "\u0120\u00ed\u013f": 673, "\u0120\u00ec\u0137\u0126\u00eb\u012d": 674, "\u0120\u00ec\u012c\u00a4\u00ed\u0128\u0142\u00eb\u00a6\u00ac": 675, "\u00ec\u0140\u00ac\u00eb\u00af\u00b8": 676, "\u00eb\u0142\u012a\u00ea\u00b8\u00b0": 677, "\u00ed\u0134\u012a": 678, "\u00ed\u0137\u00b4\u00ec\u0126\u013e": 679, "\u00ec\u0137\u012a": 680, "\u00ec\u013d\u0132": 681, "\u0120\u00eb\u00af\u00b8": 682, "\u0120\u00eb\u0136": 683, "\u0120\u00eb\u012a": 684, "\u0120\u00ec\u0140\u0133": 685, "\u00eb\u00b2\u0126": 686, "\u0135\u00b0\u00eb\u0142\u012a\u00ea\u00b8\u00b0": 687, "\u00ec\u012a": 688, "\u0120\u00ec\u013e\u0142": 689, "\u0120\u00ec\u0140\u00a5": 690, "\u0132\u013e": 691, "\u00a1\u013e": 692, "\u0120\u00eb\u00b2": 693, "\u00eb\u0126\u012a\u00eb\u00ac\u00b4": 694, "\u0120\u00ec\u0124\u00ac\u00eb\u0140\u012e": 695, "\u00eb\u0125\u0132": 696, "\u0120\u00eb\u00b6\u0122": 697, "\u00eb\u00a9\u00b4\u00ec\u0126\u013e": 698, "\u0120\u00eb\u00a7\u012e\u00eb\u0135\u00a4": 699, "\u0120\u00ec\u013f\u00bc": 700, "\u0120\u00ed\u0138": 701, "\u00eb\u00a8": 702, "\u00eb\u012d\u00a8": 703, "\u00eb\u0132\u013a": 704, "\u00ec\u00a2\u012d": 705, "\u0120\u00e3\u0127\u012d\u00e3\u0127\u012d": 706, "\u00ec\u00b5\u013e": 707, "\u00ea\u00b3\u0126": 708, "\u012a\u00eb\u012f": 709, "\u0120\u00ec\u00a7\u0122\u00eb\u00a3\u00a8": 710, "\u00eb\u00ac\u00b8": 711, "\u00eb\u00b2\u012a": 712, "\u0120\u00eb\u0135\u013e\u00eb\u013f\u00bc\u00eb\u00a7\u012a": 713, "\u0120\u00ec\u0137\u012a\u00eb": 714, "\u00ec\u0142\u0123\u00ec\u013f\u00b8": 715, "\u0120\u00ec\u0124\u00ac": 716, "\u0120\u00ec\u00a4\u0133": 717, "\u00eb\u00aa\u0127": 718, "\u00ec\u0126\u0142": 719, "\u00ed\u012d": 720, "\u0120\u00ea\u00b3\u00b5": 721, "\u00eb\u0142\u0129\u00ea\u00b2\u012e": 722, "\u00eb\u012d\u00a4\u00eb\u012c\u0136": 723, "\u0126\u00eb\u00a1\u013e": 724, "\u0120\u00eb\u0123\u013f": 725, "\u00ec\u013a\u0123": 726, "\u00e3\u0127\u013e": 727, "\u0120\u00eb\u00b0\u00b0\u00ec\u013c\u00b0": 728, "\u00ec\u00a7\u0123": 729, "\u00ec\u0138\u00b5": 730, "\u00ec\u00b6\u013e": 731, "\u00eb\u012d\u00b9": 732, "\u0120\u00eb\u0124\u00b4\u00ec\u013c\u00a9": 733, "\u00eb\u00a6\u00ac\u00ea\u00b3\u0142": 734, "\u00eb\u00a6\u00b0": 735, "\u00eb\u00a7\u013f": 736, "\u00eb\u00a6\u00ac\u00eb": 737, "\u00ec\u013d\u012e": 738, "\u00a3\u00bd": 739, "\u00eb\u0140\u0133": 740, "\u0120\u00eb\u0132\u013a": 741, "\u0120\u00ec\u00a1\u00b0": 742, "\u00ed\u013c": 743, "\u00eb\u0131\u013b": 744, "\u00eb\u00af": 745, "\u0120\u00ec\u013c\u00b0": 746, "\u0120\u00ec\u00a2\u0122": 747, "\u0120\u00ed\u0137": 748, "\u0120\u00ed\u0137\u00b4": 749, ",,": 750, "\u0126\u00ec\u0142\u0126": 751, "\u012010": 752, "\u00ea\u00b9\u013f": 753, "\u00ec\u00a1\u00b0": 754, "^^": 755, "\u0120\u00eb\u0143\u0132": 756, "\u0120\u00eb\u0128": 757, "\u00eb\u00b3\u00b4\u00ea\u00b3\u0142": 758, "\u0120\u00ec\u0137\u0142": 759, "\u00ed\u0124": 760, "\u00e3\u0127\u0130\u00e3\u0127\u0130": 761, "\u00eb\u00b4\u00a4": 762, "\u00ec\u0140\u00ac": 763, "\u00a1\u00ec\u0127\u013a": 764, "\u00ec\u00a7\u0122\u00eb\u00a7\u012b": 765, "\u0120\u00ea\u00b0\u0132\u00eb\u0131\u0127": 766, "\u01202": 767, "\u00ea\u00b0\u0132\u00eb\u0131": 768, "\u00eb\u00ac\u00bc": 769, "\u0120\u00ec\u0140\u00ac\u00eb\u00af\u00b8\u00ec\u0140\u012a": 770, "\u00eb\u00a5\u00b8": 771, "\u0120\u00eb\u00b4\u0132": 772, "\u0120\u00ec\u012d\u00b6": 773, "\u0120\u00ea\u00b0\u0132": 774, "\u00eb\u0128": 775, "\u00ec\u013e\u00bc\u00eb\u00a9\u00b4": 776, "\u0120\u00ec\u013a\u0123\u00ed\u013b\u0136\u00eb\u00a5\u00bc": 777, "\u00ec\u0140\u00ac\u00eb\u00b0\u012e": 778, "\u0120\u00ec\u00b9": 779, "\u0120\u00ed\u012e": 780, "\u00ec\u0124\u00ac\u00eb\u0140": 781, "\u00ea\u00b8\u00b4": 782, "\u00eb\u00aa\u00a8": 783, "\u00eb\u00a6\u00b4": 784, "\u00ec\u0137\u0142": 785, "\u0120\u00ec\u0124\u00ac\u00eb\u0140\u0133": 786, "\u0120\u00ec\u0142\u0122": 787, "\u00ed\u013a\u0126": 788, "\u00ec\u0128\u012f": 789, "\u0120\u00ed\u0139": 790, "\u00ed\u013f": 791, "\u00ed\u0131\u00ac": 792, "\u0120\u00eb\u00aa\u0127": 793, "\u0120\u00ea\u00b3\u0142": 794, "\u0120\u00eb\u013a": 795, "\u0120\u00eb\u00b0\u013a": 796, "\u0120\u00ec\u00a2\u012d\u00ec\u0137\u0126": 797, "\u012a\u00eb\u012f\u013a": 798, "\u0120\u00ec\u013d\u0125": 799, "\u00eb\u0133\u0132": 800, "\u0120\u00ea\u00b1\u00b0": 801, "\u0120\u00ec\u0140\u00ac\u00eb\u00af\u00b8\u00ec\u0139\u0128": 802, "\u0120\u00eb\u0127": 803, "\u00ec\u0139\u0143": 804, "\u0120\u00ec\u00b0\u00b8": 805, "\u00ec\u00a4\u0126": 806, "\u00eb\u00b0\u0136": 807, "\u0120\u00eb\u00b3\u00b4\u00eb\u012c\u0136": 808, "\u00ec\u00b2\u013a": 809, "\u0120\u00ea\u00b7\u00b8\u00eb\u0125\u00a5": 810, "\u0120\u00eb\u00b4\u00a4\u00eb\u012c\u0136\u00eb\u012f\u00b0": 811, "\u00ac\u00bc": 812, "\u00ed\u0137\u013a\u00ec\u00a7\u0122": 813, "\u0120\u00ec\u0139\u0143": 814, "\u00ec\u00a1\u00b1": 815, "\u00ed\u0127": 816, "\u0120\u00eb\u00b0\u0136": 817, "\u0120\u00ec\u013a\u0123": 818, "\u0120\u00ec\u0125\u0123": 819, "\u00eb\u0138": 820, "\u00ec\u013e\u0126": 821, "\u00eb\u0135\u00a4\u00ec\u013f\u013a": 822, "\u00ea\u00bb": 823, "\u0120\u00ec\u0135\u00b0\u00eb\u0142\u012a\u00ea\u00b8\u00b0": 824, "\u00eb\u00b0\u0137": 825, "\u0120\u00ec\u0139\u0128\u00eb\u012c\u0136": 826, "\u00ed\u0136\u0126": 827, "\u00e3\u0126": 828, "OO": 829, "\u0120\u00ec\u0140\u0133\u00ed\u0134\u012a": 830, "\u00eb\u0124\u00a8": 831, "\u0120\u00eb\u012d\u00a4\u00ec\u012d\u013e": 832, "\u00a5\u00bc": 833, "\u00eb\u012d\u013a": 834, "\u00eb\u0127\u00b8": 835, "\u00ec\u013f\u00b8\u00ea\u00b3\u00b5": 836, "\u00ec\u00a7\u0122\u00eb\u012c\u0136": 837, "\u0120\u00eb\u00a7\u00a4": 838, "\u00eb\u0141\u00bd": 839, "\u00ec\u0140\u0127\u00eb\u012d\u012a\u00eb\u012d\u00a4": 840, "\u00ed\u0139": 841, "\u00ec\u013c\u00b8": 842, "\u00ec\u00b2\u0143": 843, "\u0120\u00ea\u00b2\u00b0": 844, "\u00ec\u012d\u013f": 845, "\u0120\u00ec\u013a\u0123\u00ed\u013b\u0136\u00eb\u012c\u0136": 846, "\u00ec\u012d\u00b6": 847, "\u00ed\u0131\u012b\u00ec\u0142\u0132": 848, "\u0120\u00eb\u012c\u0132\u00eb\u0124": 849, "\u0120\u00ec\u012d\u00a4": 850, "\u00eb\u00a7\u012a\u00eb": 851, "\u00eb\u0143": 852, "\u0126\u00a4": 853, "\u00eb\u00b0\u013a": 854, "!!!": 855, "\u00ac\u00eb\u00a6": 856, "\u00ec\u0140\u00bc": 857, "\u0120\u00ed\u013b": 858, "\u00ea\u00b2\u00bd": 859, "\u0120\u00ec\u0140\u012a\u00eb\u012c\u0136": 860, "\u00ec\u00a7\u012a": 861, "\u0120\u00ec\u00a2\u012d\u00ec\u013f\u0122": 862, "\u00a9\u00eb\u012d\u012a\u00eb\u012d\u00a4": 863, "\u0120\u00ec\u00a2\u012d\u00ec\u0137\u013a": 864, "\u00ea\u00b4\u0122": 865, "\u00eb\u012d\u00a4\u00ea\u00b3\u0142": 866, "\u00ec\u012b": 867, "\u0120\u00ec\u012d\u013e\u00ea\u00b0\u0126": 868, "\u00ea\u00b8\u00b8": 869, "\u00eb\u013f\u00bc\u00ea\u00b3\u0142": 870, "\u00ec\u0139\u0136": 871, "\u0120\u00ec\u0124\u00b4": 872, "\u00ea\u00b5\u00b0": 873, "\u0120\u00ec\u012d\u0142": 874, "\u00ec\u0139\u00b0\u00ea\u00b8\u00b0": 875, "\u00ec\u0126\u00a4": 876, "\u00ec\u0137\u00bc\u00ea\u00b8\u00b0": 877, "\u0120\u00ec\u013a\u0123\u00ed\u013b\u0136\u00ea\u00b0\u0122": 878, "\u00eb\u012d\u00a4\u00ea\u00b0\u0122": 879, "\u00eb\u00b0\u013e": 880, "\u0120\u00ed\u0137\u013a\u00eb\u0124\u013a": 881, "\u00ec\u012c\u00a8": 882, "\u00ba\u0132": 883, "\u00ed\u013c\u012e": 884, "\u00ec\u0139\u0128\u00eb\u012c\u0136": 885, "\u0120\u00ed\u0125": 886, "\u00ea\u00b0\u013b\u00ec\u013f\u0122": 887, "\u0120\u00ec\u00b4": 888, "\u00ec\u0138\u00b4\u00ec\u0126\u013e": 889, "\u0120\u00eb\u0137\u012e": 890, "\u00ec\u00b6\u0136": 891, "\u0120\u00eb\u00a8": 892, "\u00e2\u013b": 893, "\u00eb\u0140\u0122": 894, "\u00eb\u00b4\u0132": 895, "\u0133\u013e": 896, "\u0120\u00ec\u0139\u0128\u00eb\u012d\u00a4": 897, "\u0120\u00ec\u00b2\u013a": 898, "\u00ec\u0126\u00b8\u00ec\u013c\u0136": 899, "\u0120\u00ec\u013a\u012a": 900, "\u00ec\u013f\u00b4\u00eb\u0141\u00b0": 901, "\u0120\u00ed\u0137\u0142": 902, "\u00ec\u0142\u0123\u00ec\u013f\u00b4": 903, "\u00ec\u00b5\u013e\u00ea\u00b3\u0142": 904, "\u00ec\u0123": 905, "\u0120\u00eb\u0127\u00b8": 906, "\u0120\u00ed\u0137\u013a\u00eb\u012c\u0136": 907, "\u00ec\u013f\u00b8\u00eb\u012f\u00b0": 908, "\u00eb\u0132\u013e": 909, "\u0120\u00eb\u0124\u013a\u00ec\u013a\u00a4": 910, "\u00eb\u012d\u00b5": 911, "!!!!": 912, "\u0120\u00eb\u0126\u013a": 913, "\u0120\u00ec\u0137\u0126\u00eb\u012d\u012a": 914, "\u00ec\u00a6\u012a": 915, "\u0120\u00ec\u00a3\u00bd": 916, "\u00ec\u00a6\u013f": 917, "\u0120\u00ed\u0136": 918, "\u013e\u00ec\u00b0": 919, "\u0120\u00ec\u013f\u013a": 920, "\u00ec\u0139\u0132\u00ea\u00b2\u012e": 921, "\u00ec\u013a\u012a": 922, "\u0120\u00ec\u0137\u00a1\u00ec\u0127\u013a": 923, "\u00eb\u0135\u00a4\u00ec\u013f\u0122": 924, "\u00eb\u00af\u00bc": 925, "\u00ec\u013d\u0122": 926, "\u00e3\u0127\u00a1\u00e3\u0127\u00a1": 927, "\u00ec\u00bd\u0136": 928, "\u0120\u00ea\u00bc": 929, "\u0120\u00ec\u00bd\u0136": 930, "\u00ec\u0137\u012c": 931, "\u00ea\u00b7\u00b9": 932, "\u0120\u00eb\u00aa\u00a8\u00eb\u00a5\u00b4": 933, "\u0120\u00ed\u0131": 934, "\u00ea\u00b5\u0132": 935, "\u0120\u00eb\u00aa\u00b0": 936, "\u0120\u00ec\u0137\u0126\u00ea\u00b9\u013f": 937, "\u0120\u00ec\u00b6\u0136": 938, "\u00ec\u0142\u00b8": 939, "\u00ec\u0142\u0123\u00ec\u013e\u00bc\u00eb\u00a1\u013e": 940, "\u00ec\u013f\u00b4\u00eb\u0124\u013a": 941, "\u00ec\u013a\u00a8": 942, "\u00ec\u00bc": 943, "\u0120\u00ed\u013d\u0126": 944, "\u01203": 945, "\u0120\u00ea\u00b8\u00b0\u00eb\u012e\u0122": 946, "\u00ec\u00b2\u013e": 947, "\u012e\u00ec\u013f\u00b4": 948, "\u00ea\u00b2\u0142\u00eb\u012d\u00a4": 949, "\u00ed\u012e\u0132": 950, "\u0120\u00ec\u00b5\u013e\u00ea\u00b3\u0142\u00ec\u013f\u013a": 951, "\u00ec\u0140\u012a\u00eb\u012c\u0136": 952, "\u00ea\u00b2\u00a8": 953, "\u0120\u00ed\u0140": 954, "\u00ed\u0137\u013b": 955, "\u00bb\u0136": 956, "\u00eb\u00b6\u0122\u00ed\u0126\u00b0": 957, "\u00ed\u0138\u012b": 958, "\u0120\u00eb\u0138": 959, "\u00eb\u00a9\u00b0": 960, "\u00ec\u0136\u00a8": 961, "\u00ec\u00b4": 962, "\u0120\u00ec\u0126\u00b1": 963, "\u013e\u00ec\u00b0\u00ae": 964, "\u00eb\u012e\u0122\u00eb\u00a1\u013e": 965, "\u0120\u00ec\u013f\u00b4\u00ed\u0137\u00b4": 966, "\u0120\u00eb\u013a\u0132": 967, "\u00ea\u00b8\u012b": 968, "\u00ed\u0137\u013e\u00eb\u012d\u00a4": 969, "\u00ec\u00b0\u00a8": 970, "\u012a\u00eb\u0126\u00a4": 971, "\u00eb\u0124\u0142": 972, "\u00eb\u00a1\u013f": 973, "\u00ed\u012d\u00b0": 974, "\u00ed\u0138\u012a\u00eb\u012d\u00a4": 975, "\u00eb\u012c\u0132": 976, "\u0120\u00ed\u013a\u0126": 977, "\u00ec\u012d\u013e\u00ea\u00b0\u0126": 978, "\u00ed\u013d\u0126": 979, "\u0120\u00ed\u012c": 980, "\u0120\u00ec\u0139\u00b0\u00ec\u00b6\u013e": 981, "\u0120\u00ed\u0138\u012a": 982, "\u0120\u00eb\u0135\u00a4": 983, "\u0120\u00eb\u00a7\u012a\u00ec\u00a7\u0122\u00eb\u00a7\u012b": 984, "\u0120\u00eb\u00b6\u012a": 985, "\u00eb\u00b0\u00b0\u00ec\u013c\u00b0": 986, "\u0120\u00ec\u013f\u00b4\u00eb\u0142\u0129\u00ea\u00b2\u012e": 987, "\u00ec\u0140\u0136": 988, "\u0120\u00eb\u00b6\u0126": 989, "\u0120\u00eb\u00a9\u012d": 990, "\u00ec\u00a3": 991, "\u0120\u00ec\u00b2\u013a\u00ec\u013f\u012e": 992, "\u0120\u00ea\u00b4\u0122": 993, "\u0120\u00ec\u013d\u0132": 994, "\u0120\u00ec\u00a7\u013e": 995, "\u0120\u00ec\u013f\u00b4\u00ec\u0137\u00bc\u00ea\u00b8\u00b0": 996, "\u0120\u00ea\u00b7\u00b9": 997, "\u00ec\u0142\u012a": 998, "\u00ec\u0142\u0137\u00eb\u0131\u0126": 999}
In [ ]:
with open("/content/drive/MyDrive/Colab Notebooks/BERT와 GPT로 배우는 자연어처리/nlpbook/bbpe/merges.txt", 'r') as f:
    print(f.read(100))
 
#version: 0.2 - Trained by `huggingface/tokenizers`
Ġ ì
Ġ ë
ì Ŀ
ë ĭ
í ķ
ê °
. .
ìĿ ´
ëĭ ¤
ë Ĭ
ì Ĺ
ê 
 

BERT 토크나이저 구축 - WordPiece

In [ ]:
import os
os.makedirs('/content/drive/MyDrive/Colab Notebooks/BERT와 GPT로 배우는 자연어처리/nlpbook/wordpiece', exist_ok=True)
In [ ]:
# 워드피스 어휘 집합 구축
from tokenizers import BertWordPieceTokenizer
wordpiece_tokenizer = BertWordPieceTokenizer(lowercase=False)
wordpiece_tokenizer.train(
    files=['/content/train.txt', '/content/test.txt'],
    vocab_size=10000,
)
wordpiece_tokenizer.save_model('/content/drive/MyDrive/Colab Notebooks/BERT와 GPT로 배우는 자연어처리/nlpbook/wordpiece')
Out[ ]:
['/content/drive/MyDrive/Colab Notebooks/BERT와 GPT로 배우는 자연어처리/nlpbook/wordpiece/vocab.txt']
In [ ]:
with open("/content/drive/MyDrive/Colab Notebooks/BERT와 GPT로 배우는 자연어처리/nlpbook/wordpiece/vocab.txt", 'r') as f:
    print(f.read(1000))
 
[PAD]
[UNK]
[CLS]
[SEP]
[MASK]
!
"
%
&
'
(
)
*
+
,
-
.
/
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
@
A
B
C
D
E
F
G
I
K
L
M
N
O
P
R
S
T
V
X
[
]
^
_
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
r
s
t
u
v
w
x
y
z
~
★
♡
♥
ㄱ
ㄴ
ㄷ
ㄹ
ㅁ
ㅂ
ㅅ
ㅇ
ㅈ
ㅉ
ㅋ
ㅎ
ㅏ
ㅗ
ㅜ
ㅠ
ㅡ
ㅣ
가
각
간
갈
감
갑
값
갓
갔
강
갖
같
개
객
갠
갱
걍
거
건
걸
검
겁
것
겉
게
겐
겟
겠
겨
격
견
결
겹
겼
경
계
고
곡
곤
골
곱
곳
공
과
관
광
괜
괴
굉
교
구
국
군
굳
굴
굿
궁
권
귀
규
균
그
극
근
글
금
급
기
긴
길
김
깊
까
깎
깐
깔
깜
깝
깨
꺼
껄
껏
께
껴
꼈
꼬
꼭
꼴
꼽
꽃
꽝
꽤
꾸
꾼
꿀
꿈
꿔
뀌
끄
끈
끊
끌
끔
끝
끼
낀
낄
낌
나
낙
낚
난
날
남
납
낫
났
낭
낮
낳
내
낸
낼
냄
냈
냉
냐
냥
너
넌
널
넘
넣
네
넷
녀
년
념
녕
노
녹
논
놀
놈
농
높
놓
놔
놨
뇌
누
눈
뉴
느
는
늘
늙
능
늦
니
닉
닌
닐
님
닙
닝
다
닥
단
닫
달
닮
담
답
당
닿
대
댄
댓
더
덕
던
덜
덤
덩
데
덴
뎁
도
독
돈
돋
돌
동
돼
됐
되
된
될
됨
됩
됬
두
둘
둥
뒤
뒷
드
득
든
듣
들
듬
듭
듯
등
디
딘
딧
딩
따
딱
딴
딸
땅
때
땐
땜
땡
떄
떠
떡
떤
떨
떻
떼
또
똑
똥
뚝
뚱
뛰
뜨
뜩
뜬
뜻
라
락
란
랄
람
랍
랐
랑
래
랙
랜
램
랫
랬
략
량
러
럭
런
럴
럼
럽
렁
렇
레
렉
렌
려
력
련
렬
렵
렷
렸
령
례
로
록
론
롭
롯
롱
뢰
료
룡
루
룬
류
륜
률
륭
르
른
를
름
리
릭
린
릴
림
립
릿
링
마
막
만
많
말
맘
맙
맛
망
맞
맡
매
맥
맨
맹
머
먹
먼
멀
멈
멋
멍
메
멘
멜
며
면
멸
명
몇
모
목
몬
몰
몸
못
몽
묘
무
묵
문
묻
물
뭉
뭐
뭔
뭘
뮤
므
미
믹
민
믿
밀
밋
밌
밑
바
박
밖
반
받
발
밝
밤
밥
방
배
백
뱀
버
번
벌
범
법
In [ ]:
MYPATH = '/content/drive/MyDrive/Colab Notebooks/BERT와 GPT로 배우는 자연어처리/'
 

2-4 토큰화 하기

문장을 토큰화하고 해당 토큰을 모델의 입력으로 만드는 과정 실습

 

GPT 모델 입력값 만들기

In [ ]:
# GPT 토크나이저 선언
from transformers import GPT2Tokenizer
tokenizer_gpt = GPT2Tokenizer.from_pretrained(MYPATH + 'nlpbook/bbpe')
tokenizer_gpt.pad_token = "[PAD]"
 
file /content/drive/MyDrive/Colab Notebooks/BERT와 GPT로 배우는 자연어처리/nlpbook/bbpe/config.json not found
In [ ]:
sentences = [
    "아 더빙.. 진짜 짜증나네요 목소리",
    "흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나",
    "별루 였다..",
]
In [ ]:
# 토큰화 결과 볼때
tokenized_sentences = [tokenizer_gpt.tokenize(sentence) for sentence in sentences]
tokenized_sentences
Out[ ]:
[['ìķĦ',
  'ĠëįĶ',
  'ë¹',
  'Ļ',
  '..',
  'Ġì§Ħì§ľ',
  'Ġì§ľ',
  'ì¦Ŀ',
  'ëĤĺ',
  'ëĦ¤ìļĶ',
  'Ġëª',
  '©',
  'ìĨĮ',
  '리'],
 ['íĿ',
  'ł',
  '...',
  'íı¬',
  'ìĬ¤',
  'íĦ°',
  'ë³´ê³ł',
  'Ġì´',
  'Īë',
  'Ķ',
  '©',
  'ìĺģíĻĶ',
  'ì¤Ħ',
  '....',
  'ìĺ¤',
  'ë²Ħ',
  'ìĹ°ê¸°',
  'ì¡°',
  'ì°¨',
  'Ġê°Ģ',
  'ë³',
  'į',
  'ì§Ģ',
  'ĠìķĬ',
  '구',
  'ëĤĺ'],
 ['ë³', 'Ħ', '루', 'Ġìĺ', 'Ģ', 'ëĭ¤', '..']]
In [ ]:
# gpt 모델 입력 만들기 (실제 모델 입력값)
batch_inputs = tokenizer_gpt(
    sentences,
    padding="max_length", # 문장의 최대 길이에 맞춰 패딩
    max_length=12, # 문장의 토큰 기준 최대 길이
    truncation=True, # 문장 잘림 허용 옵션
)
In [ ]:
batch_inputs.keys()
Out[ ]:
dict_keys(['input_ids', 'attention_mask'])
In [ ]:
# 토큰화 결과를 가지고 각 토큰을 인덱스로 바꾼 것
batch_inputs['input_ids']
Out[ ]:
[[334, 557, 662, 248, 263, 581, 995, 917, 315, 464, 361, 103],
 [791, 255, 336, 792, 368, 542, 758, 888, 302, 243, 103, 356],
 [412, 227, 530, 451, 223, 265, 263, 0, 0, 0, 0, 0]]
In [ ]:
# 일반 토큰(1), 패딩 토큰(0)
batch_inputs['attention_mask']
Out[ ]:
[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]]
 

BERT 모델 입력값 만들기

In [ ]:
# BERT 토크나이저 선언
from transformers import BertTokenizer
tokenizer_bert = BertTokenizer.from_pretrained(
    MYPATH + 'nlpbook/wordpiece',
    do_lower_case = False,
)
 
file /content/drive/MyDrive/Colab Notebooks/BERT와 GPT로 배우는 자연어처리/nlpbook/wordpiece/config.json not found
In [ ]:
sentences = [
    "아 더빙.. 진짜 짜증나네요 목소리",
    "흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나",
    "별루 였다..",
]
In [ ]:
# 토큰화 결과 볼때
tokenized_sentences = [tokenizer_bert.tokenize(sentence) for sentence in sentences]
tokenized_sentences
Out[ ]:
[['아', '더빙', '.', '.', '진짜', '짜증나', '##네요', '목소리'],
 ['흠',
  '.',
  '.',
  '.',
  '포스터',
  '##보고',
  '초딩',
  '##영화',
  '##줄',
  '.',
  '.',
  '.',
  '.',
  '오버',
  '##연기',
  '##조차',
  '가볍',
  '##지',
  '않',
  '##구나'],
 ['별루', '였다', '.', '.']]
In [ ]:
# BERT 모델 입력 만들기 (실제 모델 입력값)
batch_inputs = tokenizer_bert(
    sentences,
    padding="max_length",
    max_length=12,
    truncation=True,
)
In [ ]:
batch_inputs.keys()
Out[ ]:
dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
In [ ]:
batch_inputs['input_ids']
Out[ ]:
[[2, 620, 2631, 16, 16, 1993, 3678, 1990, 3323, 3, 0, 0],
 [2, 997, 16, 16, 16, 2609, 2045, 2796, 1981, 1033, 16, 3],
 [2, 3274, 9508, 16, 16, 3, 0, 0, 0, 0, 0, 0]]
 

문장의 시작과 끝에 2개의 토큰을 덧붙이는 특징

2: [CLS]

3: [SEP]

In [ ]:
batch_inputs['attention_mask']
Out[ ]:
[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]]
 

세그먼트(segment)

첫번째 세그먼트(문서 혹은 문장)은 0, 두번째 세그먼트는 1

여기선 한문장이니까 모두 0으로 처리됐음

In [ ]:
batch_inputs['token_type_ids']
Out[ ]:
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
In [ ]:

In [ ]:

'인공지능' 카테고리의 다른 글

[do it! bert gpt] 1장. 처음 만나는 자연어처리  (1) 2023.01.26
Comments