38일차. 자연어 데이터 준비 - NLP & Integer Encoding & Word2Vec

38일 차 회고.

새로운 단위가 시작되어서 설렜다. 하지만 어제 푹 쉬었음에도 피로가 풀리지 않아서 수업 시간에 고생을 했다. 오늘치 SQLD 공부는 학원에서 다 마쳐서 오늘도 일찍 자야 할 것 같다. 이제 시험까지 이틀도 남지 않았는데 조금만 더 고생하고 주말에 맛있는 걸 먹어야겠다. 그리고 이번에 빅데이터분석기사 시험도 신청을 해서 그에 대한 계획을 짜고 자기소개서도 조금씩 써야 할 것 같다.

1. NLP

1-1. NLP(Natural Language Processing)

자연어 처리 과정

NLU(Natural Language Understanding)
- 자연어 이해
- 자연어 형태의 문장을 이해하는 기술
NLG(Natural Language Generation)
- 자연어 생성
- 자연어 문장을 생성하는 기술

자연어 처리 응용

이메일 필터링(Email Filtering)
언어 번역(Language Translation)
스마트 비서(Smart Assistants)
문서 분석(Document Analysis)
온라인 검색(Online Searches)
예측 텍스트(Predictive Text)
자동 요약(Automatic Summarization)
감정 분석(Sentiment Analysis)
챗봇(Chatbots)
소셜 미디어 모니터링(Social Media Monitoring)

1-2. 텍스트 전처리(Text Preprocessing)

정제(Cleaning)

대문자 vs 소문자
- 뜻이 같은 경우: 대문자를 소문자로 변경한다.
- 뜻이 다른 경우: 변경하지 않는다.
출현 횟수가 적은 단어 제거
- 사용 횟수가 적은 단어를 제거한다.
  - ex. animals(동물) vs faunas(동물군)
- 출현 횟수가 적지만, 중요한 단어인 경우에는 제거하지 않는다.
데이터 사용 목적에 맞춰 노이즈 제거
- 의미를 부여할 수 없는 글자를 제거한다.
  - ex. 관사, 대명사 등

추출(Stemming)

어간 추출
- 어간(Stem)
  - 단어의 의미를 담은 핵심
    - ex. playing -> play
  - 단어의 품사 정보를 갖고 있지 않다.
    - ex. having -> hav
- 접사(Affix)
  - 단어에 추가 용법을 부여한다.
    - ex. playing -> ing
  - 단어의 품사 정보를 갖고 있다.
    - ex. having -> have
표제어 추출
- 표제어(Lemmatization)
  - 문장에서 단어의 원형을 추출한다.
  - ex. is, are -> be
- 품사에 따라 뜻이 달라지는 단어는 표제어 추출을 해야 한다.

불용어(Stopword)

큰 의미가 없는 단어 토큰 제거
- 자주 등장하지만 분석을 하는 데 큰 도움이 되지 않는 단어를 제거한다.

토큰화(Tokenization)

토큰화
- 문장에서 의미 부여가 가능한 단위를 찾는다.
- 형태소 분석기를 이용하여 진행한다.
  - 사용하는 데이터에 맞는 형태소 분석기를 찾아 사용한다.
토큰(Token)
- 텍스트의 원자 조각
- 문자 수준 표현(Character Level Representation)
  - 단어가 새로 생성되더라도 토큰의 개수는 유지된다.
  - 단어에 대한 의미가 사라진다.
  - ex. book -> b, o, o, k
- 단어 수준 표현(Word Level Representation)
  - 단어가 새로 생성되면 토큰의 개수가 늘어난다.
  - 단어(토큰)의 의미가 유지된다.
  - ex. book -> book

형태소 분석
- 형태소(Morpheme)
  - 일정한 의미가 있는 가장 작은 말의 단위
- 의미 / 기능으로 구분
  - 실질형태소
    - 어휘적 의미가 있는 형태소
    - 어떤 대상이나 상태, 동작을 가리키는 형태소
    - 명사, 동사, 형용사, 부사
  - 형식형태소
    - 문법적 의미가 있는 형태소
    - 관계를 나타내는 기능을 하는 형태소
    - 조사, 어미
- 의존성으로 구분
  - 자립형태소
    - 다른 형태소 없이 홀로 어절을 이루어 사용될 수 있는 형태소
    - 명사, 대명사, 수사, 관형사, 부사, 감탄사 등
  - 의존형태소
    - 문장에서 반드시 다른 형태소와 함께 쓰여서 어절을 이루는 형태소
    - 조사, 어미, 용언(동사, 형용사)의 어간
어휘집(Vocabularty)
- 중복을 제거한 어휘(토큰)와 index가 정의된 집합
- 어휘집을 통해 문자를 숫자로 변환할 수 있다.

Embedding / Sorting

Embedding
- 문자를 숫자로 처리한다.
Sorting
- 자주 사용하거나 중요한 문자를 작은 숫자로 정의한다.
Encoding
- 사전에 포함되어 있는 단어만 처리할 수 있다.
- 정수 인코딩(Integer Encoding)
  - 자연어 처리에서 텍스트를 숫자로 변환한다.
  - 컴퓨터가 텍스트를 이해하고 처리할 수 있다.
  - BOW(Bag of Words)
    - 문서가 가지는 모든 단어를 문맥이나 순서를 무시하고 일괄적으로 단어에 대해 빈도 값을 부여해 피처 값을 추출한다.
    - Count Encoding
      - 문서 집합에서 단어 토큰을 생성하고 각 단어의 수를 세어 BOW 인코딩 벡터를 만든다.
    - TF-IDF Encoding
      - 단어를 개수 그대로 카운트하지 않고 모든 문서에 공통적으로 들어있는 단어의 경우 문서 구별 능력이 떨어진다고 보아 가중치를 축소한다.
      - $\text {tf-idf(d, t)} = tf(d, t) × idf(t)$
        
        $tf(d, t)$: 특정한 단어의 빈도수
        
        $idf(t)$: 특정한 단어가 들어있는 문서의 수에 반비례한다.
      - $idf(d, t) = log \frac {n} {df(t)}$
        
        $n$: 전체 문서의 수
        
        $df(t)$: 단어 t를 가진 문서의 수
  - 단어 사이의 연관성을 파악하기 어렵다.
- 원-핫 인코딩(One-Hot Encoding)
  - 표현하고자 하는 단어의 인덱스 값만 1이고, 나머지 인덱스는 전부 0으로 표현된다.
  - 희소 표현(Sparse Representation)
- Word2Vec Encoding
  - 분산 가설(Distributed Hypothesis)
    - 비슷한 위치에 나오는 단어는 비슷한 의미를 가진다.
    - 비슷한 위치에 나오는 단어는 단어 간의 유사도가 높다.
  - Polysemy 문제
    - 단어가 여러 의미를 가진다.
  - 새로운 단어가 추가될 경우 재학습을 해야 한다.
- Word Embedding with Neural Network
  - 단어를 실수 형태의 벡터로 표현한다.

Padding

병렬 연산을 위해 여러 문장의 길이를 임의로 동일하게 맞춘다.

1-3. 유사도(Similarity)

유클리디안 유사도(Euclidean Similarity)

$L_2 = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 + \cdots + (p_n - q_n)^2} = \sqrt {\sum^{n}_{i=1} (p_i - q_i)^2}$

코사인 유사도(Cosine Similarity)

두 벡터의 유사도
$similarity = \cos(\theta) = \frac {A \cdot B} {|A| |B|} = \frac {\sum^{n}_{i=1} A_i \times B_i} {\sqrt {\sum^n_{i=1} (A_i)^2} \times \sqrt {\sum^n_{i=1} (B_i)^2}}$

2. Integer Encoding

2-1. DictVectorizer

from sklearn.feature_extraction import DictVectorizer

dict_vec = DictVectorizer(sparse=False)

"""
Data
- Life is full of ups and downs.
- Life is not all beer and skittles.
"""
data = [
    {'Life': 1, 'is': 1, 'full': 1, 'of': 1, 'ups': 1, 'and': 1, 'downs': 1},
    {'Life': 1, 'is': 1, 'not': 1, 'all': 1, 'beer': 1, 'and': 1, 'skittles': 1}
]

features_encoding = dict_vec.fit_transform(data)

features_encoding
# array([[1., 0., 1., 0., 1., 1., 1., 0., 1., 0., 1.],
#        [1., 1., 1., 1., 0., 0., 1., 1., 0., 1., 0.]])

dict_vec.feature_names_
# ['Life',
#  'all',
#  'and',
#  'beer',
#  'downs',
#  'full',
#  'is',
#  'not',
#  'of',
#  'skittles',
#  'ups']

2-2. CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

"""
Data
- This is the first document.
- This is the second second ducuments.
- And the third one.
- Is this the first document?
- The last document?
"""

corpus = [
    'This is the first document.',
    'This is the second second documents.',
    'And the third one.',
    'Is this the first document?',
    'The last document?'
]

count_vec = CountVectorizer()
corpus_embedding = count_vec.fit_transform(corpus)
corpus_embedding.toarray()
# array([[0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1],
#        [0, 0, 1, 0, 1, 0, 0, 2, 1, 0, 1],
#        [1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0],
#        [0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1],
#        [0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0]])

count_vec.vocabulary_
# {'this': 10,
#  'is': 4,
#  'the': 8,
#  'first': 3,
#  'document': 1,
#  'second': 7,
#  'documents': 2,
#  'and': 0,
#  'third': 9,
#  'one': 6,
#  'last': 5}

count_vec = CountVectorizer(stop_words=['and', 'is', 'the'])
count-vec.fit(corpus)
count_vec.vocabulary_
# {'this': 7,
#  'first': 2,
#  'document': 0,
#  'second': 5,
#  'documents': 1,
#  'third': 6,
#  'one': 4,
#  'last': 3}

count_vec = CountVectorizer(ngram_range=(2, 3))
count_vec.fit(corpus)
count_vec.vocabulary_
# {'this is': 21,
#  'is the': 3,
#  'the first': 12,
#  'first document': 2,
#  'this is the': 22,
#  'is the first': 4,
#  'the first document': 13,
#  'the second': 16,
#  'second second': 10,
#  'second documents': 9,
#  'is the second': 5,
#  'the second second': 17,
#  'second second documents': 11,
#  'and the': 0,
#  'the third': 18,
#  'third one': 20,
#  'and the third': 1,
#  'the third one': 19,
#  'is this': 6,
#  'this the': 23,
#  'is this the': 7,
#  'this the first': 24,
#  'the last': 14,
#  'last document': 8,
#  'the last document': 15}

2-3. TfidfVectorizer

from sklearn.feature_extraction.text import TfidVectorizer

tfidf = TfidVectorizer().fit(corpus)
tfidf.transform(corpus).toarray()
# array([[0.        , 0.44912566, 0.        , 0.54105637, 0.44912566,
#         0.        , 0.        , 0.        , 0.31955661, 0.        ,
#         0.44912566],
#        [0.        , 0.        , 0.40409121, 0.        , 0.27062459,
#         0.        , 0.        , 0.80818242, 0.19255163, 0.        ,
#         0.27062459],
#        [0.55666851, 0.        , 0.        , 0.        , 0.        ,
#         0.        , 0.55666851, 0.        , 0.26525553, 0.55666851,
#         0.        ],
#        [0.        , 0.44912566, 0.        , 0.54105637, 0.44912566,
#         0.        , 0.        , 0.        , 0.31955661, 0.        ,
#         0.44912566],
#        [0.        , 0.51737618, 0.        , 0.        , 0.        ,
#         0.77253573, 0.        , 0.        , 0.36811741, 0.        ,
#         0.        ]])

import pandas as pd

pd.DataFrame(
    tfidf.transform(corpus).toarray(),
    columns = tfidf.get_feature_names_out()
)

2-4. WordCloud

!pip install wordcloud
from wordcloud import WordCloud

vec = CountVectorizer()
tdm = vec.fit_transform(corpus)
tdm.toarray()
# array([[0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1],
#        [0, 0, 1, 0, 1, 0, 0, 2, 1, 0, 1],
#        [1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0],
#        [0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1],
#        [0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0]])

df_tdm = pd.DataFrame(
    tdm.toarray(),
    columns = vec.get_feature_names_out()
)
df_tdm.sum().to_dict()
# {'and': 1,
#  'document': 3,
#  'documents': 1,
#  'first': 2,
#  'is': 3,
#  'last': 1,
#  'one': 1,
#  'second': 2,
#  'the': 5,
#  'third': 1,
#  'this': 3}

wc = WordCloud(background_color='white', width=500, height=500)
cloud = wc.generate_from_frequencies(df_tdm.sum().to_dict())
cloud.to_image()

3. Word2Vec

3-1. Word2Vec

Word2Vec 학습 과정

Tokenizing를 통해 Vocabulary 구축
Sliding Window
- 어떤 한 단어를 중심으로 앞뒤로 나타나는 각각의 단어와 짝을 지어 입출력 쌍을 구성한다.
- 중심 단어로부터 얼마나 멀리 떨어져 있는 단어까지 유사한 관계로 학습할지를 반영하기 위해 사용한다.
행렬 연산을 이용하여 의미적인 유사성 파악
Softmax

Word2Vec 학습 방법

CBOW
- 특정 문맥 안의 주변 단어들을 통해 어떤 단어를 예측한다.
Skip-gram
- 어떤 단어를 통해 특정 문맥 안의 주변 단어들을 예측한다.

3-2. Word2Vec 모델 학습

Load Data

from google.colab import drive
drive.mount('/content/data')

import numpy as np
import pandas as pd

data_path = ''
df = pd.read_csv(data_path + '/IMDB Dataset.csv')
df.shape
# (50000, 2)

Text Preprocessing

import nltk

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')

from tqdm import tqdm

tokenized_data = []
for review in tqdm(df['review']):
    review = review.lower()
    review = review.replace('<br />', '')
    tokenized_review = nltk.word_tokenize(review)
    tokenized_data.append(tokenized_review)

len(tokenized_data)
# 50000

import matplotlib.pyplot as plt

plt.hist(
    [len(review) for review in tokenized_data],
    bins=50
)
plt.xlabel('length of reviews')
plt.ylabel('number of reviews')
plt.show()

from gensim.models import Word2Vec

model = Word2Vec(
    sentences=tokenized_data,
    vector_size=300,
    window=5,
    min_count=5,
    workers=-1,
    sg=0
)
model.wv.vectors.shape
# (43746, 300)

# 5만개의 데이터는 적은 양의 데이터기 때문에 학습이 제대로 되지 않는다.
model.wv.most_similar('wonderful')
# [('lilliput', 0.22179092466831207),
#  ('intellectually', 0.21859486401081085),
#  ('hordes', 0.2111765593290329),
#  ('gypsies', 0.20655690133571625),
#  ('goldeneye', 0.2050001621246338),
#  ('injured', 0.19880861043930054),
#  ('fortier', 0.19827546179294586),
#  ('ingmar', 0.19663165509700775),
#  ('empress', 0.19342432916164398),
#  ('.sadly', 0.1930973082780838)]

import gensim
from gensim.test.utils import datapath

DATA_PATH = ''

word2vec_model = gensim.models.KeyedVectors.load_word2vec_format(
    DATA_PATH + 'GoogleNews-vectors-negative300.bin.gz', binary=True
)
word2vec_model.vectors.shape
# (3000000, 300)

word2vec_model.most_similar('wonderful')
# [('marvelous', 0.8188856244087219),
#  ('fantastic', 0.8047919869422913),
#  ('great', 0.7647867798805237),
#  ('fabulous', 0.7614760994911194),
#  ('terrific', 0.7420831918716431),
#  ('lovely', 0.7320096492767334),
#  ('amazing', 0.7263179421424866),
#  ('beautiful', 0.6854085922241211),
#  ('magnificent', 0.6633867025375366),
#  ('delightful', 0.6574996709823608)]

'SK네트웍스 Family AI캠프 10기 > Daily 회고' 카테고리의 다른 글

40일차. 자연어 데이터 준비 - Text Preprocessing & 1D CNN (0)	2025.03.10
39일차. 자연어 데이터 준비 - 형태소 분석 & 어휘집 & Padding (0)	2025.03.07
36-37일차. 단위 프로젝트(데이터 분석과 머신러닝, 딥러닝) (0)	2025.03.05
35일차. Deep Learning - 추천 시스템 (0)	2025.02.28
34일차. Deep Learning - Modular & TensorBoard & HPO Tuning (0)	2025.02.27

이네의 개발 노트

38일차. 자연어 데이터 준비 - NLP & Integer Encoding & Word2Vec

1. NLP

1-1. NLP(Natural Language Processing)

1-2. 텍스트 전처리(Text Preprocessing)

1-3. 유사도(Similarity)

2. Integer Encoding

2-1. DictVectorizer

2-2. CountVectorizer

2-3. TfidfVectorizer

2-4. WordCloud

3. Word2Vec

3-1. Word2Vec

3-2. Word2Vec 모델 학습

'SK네트웍스 Family AI캠프 10기 > Daily 회고' 카테고리의 다른 글

티스토리툴바

38일차. 자연어 데이터 준비 - NLP & Integer Encoding & Word2Vec

1. NLP

1-1. NLP(Natural Language Processing)

1-2. 텍스트 전처리(Text Preprocessing)

1-3. 유사도(Similarity)

2. Integer Encoding

2-1. DictVectorizer

2-2. CountVectorizer

2-3. TfidfVectorizer

2-4. WordCloud

3. Word2Vec

3-1. Word2Vec

3-2. Word2Vec 모델 학습

'SK네트웍스 Family AI캠프 10기 > Daily 회고' 카테고리의 다른 글

'SK네트웍스 Family AI캠프 10기/Daily 회고' Related Articles

티스토리툴바