어텐션 기반 번역 모델의 실전적 구현과 분석

qordnswnd123 2025. 6. 1. 20:23

1. 학습용 데이터 준비

import torch
import torch.nn as nn
import torch.nn.functional as F
import random
import numpy as np

vocab_en = {
    "<pad>": 0, "<sos>": 1, "<eos>": 2,
    "i": 3, "am": 4, "he": 5, "she": 6, "is": 7,
    "you": 8, "are": 9, "we": 10, "they": 11,
    "book": 12, "books": 13, "reading": 14, "writing": 15, "eating": 16,
    "a": 17, "the": 18, "in": 19, "on": 20, "at": 21,
    "home": 22, "school": 23, "library": 24
}

# 한국어 어휘 사전
vocab_ko = {
    "<pad>": 0, "<sos>": 1, "<eos>": 2,
    "나는": 3, "너는": 4, "그는": 5, "그녀는": 6, "우리는": 7, "그들은": 8,
    "책을": 9, "읽고": 10, "쓰고": 11, "먹고": 12,
    "있다": 13, "있습니다": 14,
    "집에서": 15, "학교에서": 16, "도서관에서": 17
}

# 학습용 예제 문장 쌍 : 스테이지 4에서 사용한 문장쌍과 새로 추가한 문장쌍

train_pairs = [
    ("she is reading a book", "그녀는 책을 읽고 있다"),
    ("i am reading a book", "나는 책을 읽고 있다"),
    ("they are reading books", "그들은 책을 읽고 있다"),
    ("he is writing at school", "그는 학교에서 쓰고 있다"),
    ("we are eating at home", "우리는 집에서 먹고 있다"),
    
    # 새로 추가하는 문장쌍 (영어 6단어, 한국어 5단어)
    ("i am reading books at school", "나는 학교에서 책을 읽고 있습니다"),
    ("he is writing books at library", "그는 도서관에서 책을 쓰고 있습니다"),
    ("they are reading books at home", "그들은 집에서 책을 읽고 있습니다"),
    ("she is writing a book at library", "그녀는 도서관에서 책을 쓰고 있습니다"),
    ("we are reading the books at school", "우리는 학교에서 책을 읽고 있습니다")
]

print("=== 어휘 사전 및 학습 데이터 준비 완료 ===")
print(f"영어 어휘 크기: {len(vocab_en)}")
print(f"한국어 어휘 크기: {len(vocab_ko)}")

=== 어휘 사전 및 학습 데이터 준비 완료 ===
영어 어휘 크기: 25
한국어 어휘 크기: 18

2. 데이터셋 및 토큰화 구현

이번에는 학습 데이터의 토큰화와 PyTorch의 Dataset 및 DataLoader를 활용해 데이터셋을 구성합니다.
문장을 인덱스 시퀀스로 변환하고, 이를 TranslationDataset 클래스를 통해 효율적으로 관리하는 방법을 학습합니다.

from torch.utils.data import Dataset, DataLoader

# 랜덤 시드 고정
SEED = 42
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)  # multi-GPU 사용 시
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(SEED)
random.seed(SEED)

def sentence_to_indices(sentence: str, vocab: dict) -> torch.Tensor:
    """문장을 인덱스 시퀀스로 변환 (시작과 종료 토큰 포함)"""
    words = sentence.lower().split()
    indices = [vocab["<sos>"]] + [vocab[word] for word in words] + [vocab["<eos>"]] 
    return torch.tensor(indices).unsqueeze(0)


class TranslationDataset(Dataset):
    def __init__(self, pairs, vocab_src, vocab_tgt):
        super(TranslationDataset, self).__init__()
        self.pairs = pairs
        self.vocab_src = vocab_src
        self.vocab_tgt = vocab_tgt

    def __len__(self):
        return len(self.pairs)

    def __getitem__(self, idx):
        src_sentence, tgt_sentence = self.pairs[idx]
        # 영어(소스)는 is_english=True, 한국어(타겟)는 is_english=False로 설정
        src_indices = sentence_to_indices(src_sentence, self.vocab_src)
        tgt_indices = sentence_to_indices(tgt_sentence, self.vocab_tgt)
        return src_indices.squeeze(0), tgt_indices.squeeze(0)  # squeeze로 차원 맞춤
    
# 데이터셋 및 데이터로더 생성
dataset = TranslationDataset(train_pairs, vocab_en, vocab_ko)

TranslationDataset은 PyTorch의 Dataset을 상속받아 구현합니다.
- __init__ 메서드를 통해 데이터셋의 문장 쌍과 소스 및 타겟 어휘 사전을 초기화합니다.
- __len__ 메서드를 사용해 데이터셋의 크기를 반환합니다.
- __getitem__ 메서드는 주어진 인덱스에 해당하는 소스와 타겟 문장을 인덱스 시퀀스로 변환하여 반환합니다.
sentence_to_indices를 활용해 문장을 어휘 사전에 따라 텐서로 변환하고, squeeze(0)으로 차원을 정리합니다.

❗ 데이터셋과 첫 번째 문장쌍의 구조를 확인합니다.
데이터셋 클래스(TranslationDataset)이 올바르게 구현되었는지 점검하고,
첫 번째 문장쌍의 소스와 타겟 인덱스를 확인해봅니다.

# 데이터셋 정보 출력
print("=== 데이터셋 정보 ===")
print(f"데이터셋 크기: {len(dataset)} 문장쌍")

# 첫 번째 데이터 샘플 확인
src_indices, tgt_indices = dataset[0]
print("\n첫 번째 문장쌍 정보:")
print(f"첫 번째 문장쌍 타입: {type(dataset[0])}")
print(f"원문 인덱스: {src_indices}")
print(f"번역문 인덱스: {tgt_indices}")

=== 데이터셋 정보 ===
데이터셋 크기: 10 문장쌍

첫 번째 문장쌍 정보:
첫 번째 문장쌍 타입: <class 'tuple'>
원문 인덱스: tensor([ 1,  6,  7, 14, 17, 12,  2])
번역문 인덱스: tensor([ 1,  6,  9, 10, 13,  2])

3. 배치 처리와 패딩 구현

이번에는 배치 데이터를 효율적으로 처리하기 위해 패딩을 구현합니다.
문장의 길이가 다를 경우, 동일한 길이로 맞추기 위해 <pad> 토큰을 사용하며,
DataLoader의 collate_fn을 활용해 배치 처리를 구현합니다.

def collate_fn(batch):
    src_batch, tgt_batch = zip(*batch)
    src_lengths = [len(s) for s in src_batch]
    tgt_lengths = [len(t) for t in tgt_batch]

    src_padded = nn.utils.rnn.pad_sequence(src_batch, padding_value=vocab_en["<pad>"], batch_first=True)
    tgt_padded = nn.utils.rnn.pad_sequence(tgt_batch, padding_value=vocab_ko["<pad>"], batch_first=True)

    return src_padded, tgt_padded, src_lengths, tgt_lengths


dataloader = DataLoader(dataset, batch_size=2, shuffle=True, collate_fn=collate_fn)

for src_batch, tgt_batch, src_lengths, tgt_lengths in dataloader:
    print("src_batch:", src_batch)
    print("src_batch shape:", src_batch.shape)
    print("tgt_batch:", tgt_batch)
    print("tgt_batch shape:", tgt_batch.shape)

    break  # 첫 번째 배치만 확인

nn.utils.rnn.pad_sequence를 사용하여 소스와 타겟 시퀀스를 패딩합니다.
- padding_value로 각 어휘 사전에 해당하는 <pad> 인덱스를 사용합니다.
- batch_first=True를 설정하여 [batch_size, seq_len] 형태로 반환합니다.
DataLoader를 정의하며, collate_fn 파라미터로 작성한 collate_fn 함수를 전달합니다.
각 배치의 소스와 타겟 문장 길이(src_lengths, tgt_lengths)도 반환합니다.

src_batch: tensor([[ 1,  6,  7, 15, 17, 12, 21, 24,  2],
        [ 1,  6,  7, 14, 17, 12,  2,  0,  0]])
src_batch shape: torch.Size([2, 9])
tgt_batch: tensor([[ 1,  6, 17,  9, 11, 14,  2],
        [ 1,  6,  9, 10, 13,  2,  0]])
tgt_batch shape: torch.Size([2, 7])

❗배치 단위로 데이터를 처리할 때 패딩이 어떻게 적용되는지 확인합니다:

배치 내 서로 다른 길이의 문장들이 어떻게 동일한 길이로 맞춰지는지
패딩 토큰()이 문장의 어느 위치에 추가되는지
실제 문장 길이와 패딩이 포함된 길이의 차이는 얼마인지

위 세 가지를 중점적으로 살펴보면서 패딩의 동작 방식을 이해합니다.

# 데이터로더에서 배치 추출하여 확인
for src_batch, tgt_batch, src_lengths, tgt_lengths in dataloader:

    # 패딩된 배치 데이터 상세 확인
    print("\n=== 패딩 구현 확인 ===")
    
    # 인덱스를 단어로 변환하기 위한 사전 생성
    idx2word_en = {v: k for k, v in vocab_en.items()}
    idx2word_ko = {v: k for k, v in vocab_ko.items()}
    
    # 배치의 각 문장 확인
    for i in range(len(src_lengths)):
        
        print(f"\n[문장 쌍 {i+1}]")
        
        # 소스 문장 출력 (패딩 토큰 표시)
        src_tokens = [idx2word_en[idx.item()] for idx in src_batch[i]]
        print("소스 문장:", ' '.join([f"({token})" if token == "<pad>" else token for token in src_tokens]))
        print(f"실제 길이: {src_lengths[i]}, 패딩 포함 길이: {len(src_tokens)}")
        
        # 타겟 문장 출력 (패딩 토큰 표시)
        tgt_tokens = [idx2word_ko[idx.item()] for idx in tgt_batch[i]]
        print("타겟 문장:", ' '.join([f"({token})" if token == "<pad>" else token for token in tgt_tokens]))
        print(f"실제 길이: {tgt_lengths[i]}, 패딩 포함 길이: {len(tgt_tokens)}")
        
    break  # 첫 번째 배치만 확인



=== 패딩 구현 확인 ===

[문장 쌍 1]
소스 문장: <sos> we are eating at home <eos> (<pad>)
실제 길이: 7, 패딩 포함 길이: 8
타겟 문장: <sos> 우리는 집에서 먹고 있다 <eos> (<pad>)
실제 길이: 6, 패딩 포함 길이: 7

[문장 쌍 2]
소스 문장: <sos> he is writing books at library <eos>
실제 길이: 8, 패딩 포함 길이: 8
타겟 문장: <sos> 그는 도서관에서 책을 쓰고 있습니다 <eos>
실제 길이: 7, 패딩 포함 길이: 7

4. 임베딩 레이어 구현

인코더와 디코더의 임베딩 차원 설정
padding_idx를 활용한 패딩 토큰 처리
영어와 한국어 각각의 어휘 크기에 맞는 임베딩 레이어 구성

# 임베딩 차원 및 히든 차원 설정
embedding_dim = 16
hidden_dim = 16

# 인코더 임베딩 레이어
encoder_embedding = nn.Embedding(len(vocab_en), embedding_dim, padding_idx=vocab_en["<pad>"])
# 디코더 임베딩 레이어
decoder_embedding = nn.Embedding(len(vocab_ko), embedding_dim, padding_idx=vocab_ko["<pad>"])

❗토큰별 임베딩 결과를 확인합니다:

패딩 토큰(<pad>)의 임베딩 벡터 크기가 0인지 확인
일반 토큰들('the', 'book')의 임베딩 벡터가 0이 아닌 의미 있는 크기를 가지는지 확인
같은 토큰(여기서는 <pad>)은 항상 동일한 임베딩 결과를 가지는지 확인

이를 통해 padding_idx 파라미터가 패딩 토큰을 특별하게 처리하는 것을 확인할 수 있습니다.

# 테스트용 입력 생성
test_tokens = torch.tensor([
    vocab_en["<pad>"],  # 패딩 토큰 (인덱스 0)
    vocab_en["the"],    # 일반 토큰
    vocab_en["book"],   # 일반 토큰
    vocab_en["<pad>"]   # 패딩 토큰 (인덱스 0)
])

# 임베딩 수행
embedded = encoder_embedding(test_tokens)

print("=== 토큰별 임베딩 비교 ===")
print(f"입력 토큰:", test_tokens)

print("\n[임베딩 결과]")
for i, token_idx in enumerate(test_tokens):
    token_name = [k for k, v in vocab_en.items() if v == token_idx.item()][0]
    emb_norm = torch.norm(embedded[i]).item()
    is_zero = torch.all(embedded[i] == 0).item()
    print(f"\n토큰: {token_name} (인덱스: {token_idx})")
    print(f"임베딩 벡터의 크기: {emb_norm:.4f}")

=== 토큰별 임베딩 비교 ===
입력 토큰: tensor([ 0, 18, 12,  0])

[임베딩 결과]

토큰: <pad> (인덱스: 0)
임베딩 벡터의 크기: 0.0000

토큰: the (인덱스: 18)
임베딩 벡터의 크기: 2.6680

토큰: book (인덱스: 12)
임베딩 벡터의 크기: 3.2518

토큰: <pad> (인덱스: 0)
임베딩 벡터의 크기: 0.0000

※ padding_idx 파라미터의 역할
padding_idx를 설정하면 해당 인덱스의 임베딩 벡터는 0으로 초기화됩니다.
훈련 과정에서 이 임베딩 벡터는 업데이트되지 않습니다.
이점: 모델이 패딩 토큰의 임베딩을 학습하지 않도록 하여, 패딩이 모델의 예측에 영향을 주지 않게 합니다.
연산 효율성 향상: 패딩 위치의 임베딩 벡터가 0이므로, 해당 위치에서의 연산이 최소화됩니다.

5. 인코더 클래스 구현

이번에는 인코더 클래스를 구현합니다.
nn.Embedding과 nn.LSTM을 활용하여 입력 시퀀스를 임베딩하고, LSTM을 통해 인코더 출력을 생성합니다.
특히, 가변 길이 시퀀스를 처리하기 위해 pack_padded_sequence와 pad_packed_sequence를 사용합니다.

class Encoder(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, padding_idx):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=padding_idx)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)

    def forward(self, src, src_lengths):
        # 패딩 토큰을 무시하기 위한 pack_padded_sequence 사용
        embedded = self.embedding(src)
        packed = nn.utils.rnn.pack_padded_sequence(embedded, src_lengths, batch_first=True, enforce_sorted=False)
        outputs, (hidden, cell) = self.lstm(packed)
        outputs, _ = nn.utils.rnn.pad_packed_sequence(outputs, batch_first=True, padding_value=vocab_en["<pad>"])
        return outputs, hidden, cell

nn.Embedding에서 padding_idx를 설정하여 <pad> 토큰에 대해 가중치를 학습하지 않도록 합니다.
pack_padded_sequence를 사용해 패딩된 시퀀스를 가변 길이로 변환하고, LSTM에 전달합니다.
- batch_first=True를 설정해 입력 텐서의 첫 번째 차원이 배치 크기를 나타내도록 합니다.
- enforce_sorted=False를 설정해 정렬되지 않은 시퀀스도 처리할 수 있도록 합니다.
pad_packed_sequence를 사용해 LSTM 출력을 다시 패딩된 시퀀스로 변환합니다.

❗인코더의 동작 결과를 확인합니다:

입력 시퀀스가 임베딩되고 LSTM을 통과하면서 shape이 어떻게 변화하는지
pack_padded_sequence로 패딩을 처리하고 pad_packed_sequence로 복원되는 과정
최종 출력에서 패딩 위치의 값들이 의미 있게 처리되었는지
hidden state와 cell state의 shape이 예상대로 [1, batch_size, hidden_dim]인지

# 테스트 코드
encoder = Encoder(len(vocab_en), embedding_dim, hidden_dim, padding_idx=vocab_en["<pad>"])

# 데이터로더에서 첫 번째 배치 가져오기
for src_batch, tgt_batch, src_lengths, tgt_lengths in dataloader:
    print("=== 인코더 동작 확인 ===")
    print("\n[입력 정보]")
    print(f"입력 배치 shape: {src_batch.shape}")
    print(f"문장 길이: {src_lengths}")
    
    # 인코더 실행
    outputs, hidden, cell = encoder(src_batch, src_lengths)
    
    print("\n[출력 정보]")
    print(f"인코더 출력 shape: {outputs.shape}")
    print(f"히든 스테이트 shape: {hidden.shape}")
    print(f"셀 스테이트 shape: {cell.shape}")
    
    # 패딩이 잘 처리되었는지 확인
    pad_positions = (src_batch == vocab_en["<pad>"])
    if pad_positions.any():
        print("\n[패딩 처리 확인]")
        print("패딩 위치의 출력값 평균:", outputs[pad_positions].abs().mean().item())
    break

=== 인코더 동작 확인 ===

[입력 정보]
입력 배치 shape: torch.Size([2, 8])
문장 길이: [8, 7]

[출력 정보]
인코더 출력 shape: torch.Size([2, 8, 16])
히든 스테이트 shape: torch.Size([1, 2, 16])
셀 스테이트 shape: torch.Size([1, 2, 16])

[패딩 처리 확인]
패딩 위치의 출력값 평균: 0.0

6. Bahdanau 어텐션 클래스 구현

이번에는 Bahdanau 어텐션 메커니즘을 구현합니다.
디코더의 히든 상태와 인코더 출력을 결합하여 어텐션 가중치를 계산하고, 이를 기반으로 컨텍스트 벡터를 생성합니다.
특히, 시퀀스 마스킹을 적용해 패딩 토큰의 영향을 제거하는 과정을 포함합니다.

class BahdanauAttention(nn.Module):
    def __init__(self, hidden_dim):
        super(BahdanauAttention, self).__init__()
        # 어텐션 레이어 정의
        self.attn_projection_layer = nn.Linear(hidden_dim * 2, hidden_dim)
        self.attention_v = nn.Linear(hidden_dim, 1, bias=False)

    def forward(self, decoder_hidden, encoder_outputs, mask):
        
        # decoder_hidden 확장
        decoder_hidden_expanded = decoder_hidden.unsqueeze(1).repeat(1, encoder_outputs.size(1), 1)
        
        # 디코더 히든과 인코더 출력 결합
        combined_states = torch.cat([decoder_hidden_expanded, encoder_outputs], dim=2)
        
        # 어텐션 스코어 계산 (stage4와 동일한 방식)
        attention_scores_prep = torch.tanh(self.attn_projection_layer(combined_states))
        attention_scores = self.attention_v(attention_scores_prep).squeeze(2)
        
        # 마스킹 적용
        attention_scores = attention_scores.masked_fill(~mask, -1e10)
        
        # 어텐션 가중치 정규화
        attention_weights = F.softmax(attention_scores, dim=1)
        
        # context vector 계산 
        context_vector = torch.bmm(attention_weights.unsqueeze(1), encoder_outputs).squeeze(1)
        
        return context_vector, attention_weights

어텐션 레이어를 정의합니다:
- nn.Linear(hidden_dim * 2, hidden_dim)로 디코더 히든과 인코더 출력 결합 상태를 투영합니다.
- nn.Linear(hidden_dim, 1, bias=False)로 어텐션 스코어를 계산합니다.
디코더 히든 상태를 unsqueeze(1)로 확장한 뒤, 인코더 출력과 결합하여 combined_states를 생성합니다.
torch.tanh와 self.attn_projection_layer를 사용해 어텐션 스코어를 계산합니다.
masked_fill로 패딩된 위치에 대해 매우 낮은 값을 할당합니다.
torch.bmm로 어텐션 가중치와 인코더 출력을 곱하여 컨텍스트 벡터를 생성합니다.

❗어텐션 메커니즘의 동작 결과를 확인합니다:

입력 텐서들의 shape이 예상과 일치하는지
어텐션 가중치가 모든 위치에 대해 합이 1이 되도록 정규화되었는지
패딩 위치의 어텐션 가중치가 매우 작은 값(-1e10)으로 마스킹되었는지
실제 토큰 위치에만 의미 있는 어텐션 가중치가 할당되었는지

# 결과 확인 코드
attention = BahdanauAttention(hidden_dim)

# 데이터로더에서 첫 번째 배치 가져오기
for src_batch, tgt_batch, src_lengths, tgt_lengths in dataloader:
    # 인코더 실행
    encoder_outputs, hidden, cell = encoder(src_batch, src_lengths)
    
    print("=== 어텐션 메커니즘 동작 확인 ===")
    
    # 마스크 생성 (패딩이 아닌 부분은 True)
    mask = (src_batch != vocab_en["<pad>"])
    print(f"마스크 shape: {mask.shape}")
    
    # 어텐션 계산 - context vector와 attention weights 모두 받아옴
    context_vector, attention_weights = attention(hidden[-1], encoder_outputs, mask)
    
    print("\n[어텐션 결과]")
    print(f"컨텍스트 벡터 shape: {context_vector.shape}")
    print(f"어텐션 가중치 shape: {attention_weights.shape}")
    print("어텐션 가중치 합:", attention_weights.sum(dim=1))  # 각 배치의 가중치 합이 1인지 확인
    
    # 패딩 위치의 어텐션 값 확인
    print("\n[패딩 처리 확인]")
    print("패딩 위치의 어텐션 가중치:", attention_weights[~mask].max().item())
    break

=== 어텐션 메커니즘 동작 확인 ===
마스크 shape: torch.Size([2, 8])

[어텐션 결과]
컨텍스트 벡터 shape: torch.Size([2, 16])
어텐션 가중치 shape: torch.Size([2, 8])
어텐션 가중치 합: tensor([1.0000, 1.0000], grad_fn=<SumBackward1>)

[패딩 처리 확인]
패딩 위치의 어텐션 가중치: 0.0

7. 디코더 클래스 구현

이번 단계에서는 번역 모델의 핵심 컴포넌트인 디코더를 구현합니다.

앞서 구현한 BahdanauAttention 클래스를 디코더에 통합하여, 입력 시퀀스의 중요한 부분에 집중하면서 목표 언어로의 번역을 수행 하는 메커니즘을 구축하겠습니다.

class Decoder(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, padding_idx, attention):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=padding_idx)
        self.lstm = nn.LSTM(embedding_dim + hidden_dim, hidden_dim, batch_first=True)
        self.attention = attention  # BahdanauAttention 클래스의 인스턴스
        self.output_layer = nn.Linear(hidden_dim * 2, vocab_size)
    
    def forward(self, tgt, hidden, cell, encoder_outputs, mask):
        # tgt: [batch_size, 1]
        embedded = self.embedding(tgt)  # [batch_size, 1, embedding_dim]
        
        # attention 계산 - context vector와 attention weights 모두 받아옴
        context_vector, attention_weights = self.attention(
            hidden[-1],
            encoder_outputs,
            mask
        )
        
        # 디코더 LSTM 입력
        lstm_input = torch.cat([embedded, context_vector.unsqueeze(1)], dim=2)
        output, (hidden, cell) = self.lstm(lstm_input, (hidden, cell))
        
        # 출력 계산
        output_vector = torch.cat([output.squeeze(1), context_vector], dim=1)
        output = self.output_layer(output_vector)
        
        return output, hidden, cell, attention_weights

❗디코더의 출력을 확인해봅니다:

디코더 출력이 목표 어휘 크기로 변환되었는지
히든 스테이트가 다음 타임스텝을 위해 업데이트되었는지
어텐션 가중치가 소스 문장 길이에 맞게 계산되었는지

# 결과 확인 코드
decoder = Decoder(len(vocab_ko), embedding_dim, hidden_dim, 
                padding_idx=vocab_ko["<pad>"], attention=attention)

# 데이터로더에서 첫 번째 배치 가져오기
for src_batch, tgt_batch, src_lengths, tgt_lengths in dataloader:
    # 인코더 출력 가져오기
    encoder_outputs, hidden, cell = encoder(src_batch, src_lengths)
    mask = (src_batch != vocab_en["<pad>"])
    
    print("=== 디코더 동작 확인 ===")
    # 첫 번째 타겟 토큰으로 디코더 실행
    tgt_input = tgt_batch[:,0:1]  # <sos> 토큰
    output, new_hidden, new_cell, attn_weights = decoder(
        tgt_input, hidden, cell, encoder_outputs, mask)
    
    print("\n[출력 shape 확인]")
    print(f"디코더 출력: {output.shape} (어휘 크기: {len(vocab_ko)})")
    print(f"새로운 히든 스테이트: {new_hidden.shape}")
    print(f"어텐션 가중치: {attn_weights.shape}")
    break

=== 디코더 동작 확인 ===

[출력 shape 확인]
디코더 출력: torch.Size([2, 18]) (어휘 크기: 18)
새로운 히든 스테이트: torch.Size([1, 2, 16])
어텐션 가중치: torch.Size([2, 8])

8. Bahdanau Attention Seq2Seq 모델 클래스 구현

이번에는 Bahdanau Attention을 기반으로 Seq2Seq 모델을 구현합니다.
이 모델은 인코더-디코더 구조를 사용하며, 디코더의 각 스텝에서 어텐션 가중치를 계산하고 이를 활용하여 번역을 생성합니다.

class BadhanauAttentionSeq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super(BadhanauAttentionSeq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        self.last_attention_weights = []  # 어텐션 가중치 저장용

    def forward(self, src, src_lengths, tgt=None, teacher_forcing_ratio=0.5):
        
        self.last_attention_weights = []
        
        batch_size = src.size(0)
        max_len = tgt.size(1)-1 if tgt is not None else 20
        vocab_size = self.decoder.output_layer.out_features
        
        # 인코더 출력
        encoder_outputs, hidden, cell = self.encoder(src, src_lengths)
        
        # 첫 번째 디코더 입력 (<sos> 토큰)
        decoder_input = torch.tensor([vocab_ko["<sos>"]] * batch_size).unsqueeze(1).to(self.device)
        
        # 출력 텐서 초기화
        outputs = torch.zeros(batch_size, max_len, vocab_size).to(self.device)
        
        # 마스크 생성
        mask = (src != vocab_en["<pad>"])
        # 디코딩 단계
        for t in range(max_len): 
                    
            # 디코더 한 스텝 실행
            output, hidden, cell, attn_weights = self.decoder(decoder_input, hidden, cell, encoder_outputs, mask)
            outputs[:, t] = output

            # 어텐션 가중치 저장
            self.last_attention_weights.append(attn_weights.cpu().detach().numpy())
            
            # 다음 입력 결정 (Teacher Forcing)
            use_teacher_forcing = (random.random() < teacher_forcing_ratio)
            
            if use_teacher_forcing and tgt is not None:
                decoder_input = tgt[:, t].unsqueeze(1)  # 정답 토큰 사용
            else:
                decoder_input = output.argmax(1).unsqueeze(1)  # 모델이 예측한 토큰 사용

        return outputs

인코더를 사용하여 encoder_outputs, hidden, cell 상태를 초기화합니다.
<sos> 토큰을 사용하여 디코더의 첫 번째 입력을 생성합니다.
outputs 텐서를 초기화하여 디코더의 출력값을 저장합니다.
입력 데이터에서 패딩 토큰의 위치를 나타내는 마스크를 생성합니다.
디코더를 실행하여 한 스텝의 출력을 계산하고, 이를 outputs 텐서에 저장합니다.
어텐션 가중치를 저장하여 각 스텝에서 모델이 어디에 집중했는지 확인할 수 있도록 합니다.
Teacher Forcing을 적용하여 디코더의 다음 입력값을 정답 토큰 또는 모델의 예측값으로 설정합니다.

❗ BadhanauAttentionSeq2Seq 모델의 출력을 확인해봅니다:

입력 배치가 모델을 통과하여 3차원 텐서로 변환되는 과정
출력 텐서의 각 차원이 의미하는 바:
- 첫 번째 차원: 배치 내 문장 개수
- 두 번째 차원: 생성된 문장 길이
- 세 번째 차원: 각 위치에서의 단어 예측 확률

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 결과 확인 코드
model = BadhanauAttentionSeq2Seq(encoder, decoder, device).to(device)

for src_batch, tgt_batch, src_lengths, tgt_lengths in dataloader:
    src_batch, tgt_batch = src_batch.to(device), tgt_batch.to(device)
    
    print("=== Seq2Seq 모델 동작 확인 ===")
    print("\n[입력 정보]")
    print(f"소스 문장 shape: {src_batch.shape}")
    print(f"타겟 문장 shape: {tgt_batch.shape}")
    
    outputs = model(src_batch, src_lengths, tgt_batch, teacher_forcing_ratio=0.8)
    
    print("\n[출력 정보]")
    print(f"출력 shape: {outputs.shape}")  # [batch_size, max_len, vocab_size]
    print(f"- batch_size: {outputs.shape[0]}")
    print(f"- sequence_length: {outputs.shape[1]}")
    print(f"- vocabulary_size: {outputs.shape[2]}")

    break

=== Seq2Seq 모델 동작 확인 ===

[입력 정보]
소스 문장 shape: torch.Size([2, 8])
타겟 문장 shape: torch.Size([2, 7])

[출력 정보]
출력 shape: torch.Size([2, 6, 18])
- batch_size: 2
- sequence_length: 6
- vocabulary_size: 18

9. 모델 학습 준비

이번에는 Seq2Seq 모델 학습 준비를 위한 단계를 진행합니다.
옵티마이저와 손실 함수를 정의하고, 데이터 배치 처리를 위한 DataLoader를 설정합니다.
모델의 학습 가능한 파라미터를 옵티마이저에 전달하며, 손실 계산 시 패딩 토큰의 영향을 무시합니다.

import torch.optim as optim

def init_weights(model):
    if isinstance(model, (nn.Linear, nn.LSTM, nn.Embedding)):
        for name, param in model.named_parameters():
            if 'weight' in name:
                nn.init.xavier_uniform_(param)
            elif 'bias' in name:
                nn.init.zeros_(param)

# 하이퍼파라미터 설정
learning_rate = 0.01
n_epochs = 200
batch_size = 2

# 옵티마이저 및 손실 함수 정의
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss(ignore_index=vocab_ko["<pad>"])

# 데이터로더 업데이트 (배치 크기 적용)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)


# 모델 가중치 초기화 적용
model.apply(init_weights)

optim.Adam(model.parameters(), lr=learning_rate)로 옵티마이저를 정의합니다.
nn.CrossEntropyLoss(ignore_index=vocab_ko["<pad>"])를 사용해 손실 함수를 설정합니다.
DataLoader에 dataset, batch_size, collate_fn을 전달하여 배치 데이터를 생성합니다.
model.apply(init_weights)로 Xavier 초기화를 적용하여 모델 가중치를 초기화합니다.

BadhanauAttentionSeq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(25, 16, padding_idx=0)
    (lstm): LSTM(16, 16, batch_first=True)
  )
  (decoder): Decoder(
    (embedding): Embedding(18, 16, padding_idx=0)
    (lstm): LSTM(32, 16, batch_first=True)
    (attention): BahdanauAttention(
      (attn_projection_layer): Linear(in_features=32, out_features=16, bias=True)
      (attention_v): Linear(in_features=16, out_features=1, bias=False)
    )
    (output_layer): Linear(in_features=32, out_features=18, bias=True)
  )
)

❗학습 준비 상태를 확인해봅니다:

하이퍼파라미터가 적절한 값으로 설정되었는지
데이터로더가 지정된 배치 크기로 데이터를 제공하는지
모델 가중치가 -0.08에서 0.08 사이로 균일하게 초기화되었는지

# 결과 확인 코드
print("=== 학습 준비 상태 확인 ===")
print("\n[하이퍼파라미터]")
print(f"학습률: {learning_rate}")
print(f"에폭 수: {n_epochs}")
print(f"배치 크기: {batch_size}")

print("\n[모델 가중치 초기화 확인]")
for name, param in model.named_parameters():
    if 'weight' in name:
        print(f"{name}: 평균={param.mean().item():.4f}, 표준편차={param.std().item():.4f}")

=== 학습 준비 상태 확인 ===

[하이퍼파라미터]
학습률: 0.01
에폭 수: 200
배치 크기: 2

[모델 가중치 초기화 확인]
encoder.embedding.weight: 평균=-0.0029, 표준편차=0.2236
encoder.lstm.weight_ih_l0: 평균=0.0060, 표준편차=0.1579
encoder.lstm.weight_hh_l0: 평균=-0.0038, 표준편차=0.1593
decoder.embedding.weight: 평균=-0.0181, 표준편차=0.2479
decoder.lstm.weight_ih_l0: 평균=-0.0030, 표준편차=0.1435
decoder.lstm.weight_hh_l0: 평균=-0.0022, 표준편차=0.1563
decoder.attention.attn_projection_layer.weight: 평균=0.0130, 표준편차=0.2048
decoder.attention.attention_v.weight: 평균=0.1043, 표준편차=0.3568
decoder.output_layer.weight: 평균=-0.0073, 표준편차=0.2038

10. 모델 학습 구현

이번에는 번역 모델의 학습 과정을 구현합니다.

학습 데이터 배치에 대해 모델을 순전파하고, 손실을 계산한 후 역전파를 통해 모델의 파라미터를 업데이트합니다.

def train(model, dataloader, optimizer, criterion, device, epoch):
    model.train()
    epoch_loss = 0
    total_batches = len(dataloader)
    
    for src_batch, tgt_batch, src_lengths, tgt_lengths in dataloader:
        # 1. 데이터 준비
        src_batch = src_batch.to(device)
        tgt_batch = tgt_batch.to(device)
        
        # 2. 그래디언트 초기화
        optimizer.zero_grad()
        
        # 3. 모델 순전파
        outputs = model(src_batch, src_lengths, tgt_batch, teacher_forcing_ratio=0.7)
        
        # 4. 출력 텐서 처리
        output_dim = outputs.shape[-1]
        max_tgt_length = tgt_batch.shape[1] - 1
 
        outputs = outputs[:, :max_tgt_length, :].reshape(-1, output_dim)   # [2, 6, 18] -> [12, 18]
        targets = tgt_batch[:, 1:max_tgt_length+1].reshape(-1)   # [2, 7] -> [12]
       
        # 5. 역전파 및 최적화
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
        
        # 6. 손실 누적
        epoch_loss += loss.item()
    
    return epoch_loss / total_batches

model.train()을 사용하여 모델을 학습 모드로 설정합니다.
optimizer.zero_grad()를 통해 그래디언트를 초기화합니다.
- 입력 배치(`src_batch)와 출력 배치(tgt_batch)를 모델에 전달하고 예측 결과(outputs)를 얻습니다.
outputs와 정답(targets)을 reshape하여 손실 함수에 전달할 수 있도록 변환합니다.
- 예: outputs의 shape은 [batch_size, seq_len, vocab_size]에서 [batch_size * seq_len, vocab_size]로 변경.
- targets의 shape은 [batch_size, seq_len]에서 [batch_size * seq_len]로 변경.
criterion(outputs, targets)를 사용해 손실을 계산합니다.
loss.backward()를 호출해 역전파를 수행합니다.
optimizer.step()을 호출해 파라미터를 업데이트합니다.

❗학습 결과를 확인해봅니다:

손실(Loss) 값의 변화
학습 곡선의 형태

loss_history = []

for epoch in range(n_epochs):
    epoch_loss = train(model, dataloader, optimizer, criterion, device,epoch)
    loss_history.append(epoch_loss)
    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1}/{n_epochs}, Loss: {epoch_loss:.4f}")

# 학습 곡선 시각화
import matplotlib.pyplot as plt

plt.figure()
plt.plot(range(1, n_epochs + 1), loss_history)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Loss Over Time')
plt.grid(True)
plt.show()

Epoch 10/200, Loss: 1.1914
Epoch 20/200, Loss: 0.5683
Epoch 30/200, Loss: 0.1888
Epoch 40/200, Loss: 0.0704
Epoch 50/200, Loss: 0.0472
Epoch 60/200, Loss: 0.0255
Epoch 70/200, Loss: 0.0181
Epoch 80/200, Loss: 0.0138
Epoch 90/200, Loss: 0.0109
Epoch 100/200, Loss: 0.0087
Epoch 110/200, Loss: 0.0074
Epoch 120/200, Loss: 0.0063
Epoch 130/200, Loss: 0.0053
Epoch 140/200, Loss: 0.0046
Epoch 150/200, Loss: 0.0040
Epoch 160/200, Loss: 0.0038
Epoch 170/200, Loss: 0.0032
Epoch 180/200, Loss: 0.0029
Epoch 190/200, Loss: 0.0026
Epoch 200/200, Loss: 0.0023

11. 번역 및 추론 구현

이번에는 학습된 모델을 사용하여 입력 문장을 번역하는 과정을 구현합니다.
특히, 모델의 추론 모드와 어텐션 가중치를 사용하여 번역 결과를 생성하고, 번역 과정에서 중복 단어를 방지하는 방법을 학습합니다.

def translate(model, sentence, vocab_en, vocab_ko, device, max_length=20):
    """
    학습된 모델을 사용해 입력 문장을 번역합니다.
    """
    model.eval()  # 평가 모드로 설정
    
    # 입력 문장을 인덱스 시퀀스로 변환하고 텐서로 변환
    src_indices = sentence_to_indices(sentence, vocab_en).to(device)
    src_length = [len(src_indices[0])]

    # 모델을 추론 모드로 호출
    with torch.no_grad():
        outputs = model(src_indices, src_length, tgt=None, teacher_forcing_ratio=0.0)
        predicted_indices = outputs.argmax(2).squeeze(0).cpu().numpy()
        attention_weights = model.last_attention_weights  # 어텐션 가중치 가져오기
    # 인덱스를 한국어 단어로 변환
    idx2word_ko = {v: k for k, v in vocab_ko.items()}
    translated_words = []
    prev_word = None
    
    for idx in predicted_indices:
        if idx in [vocab_ko["<sos>"], vocab_ko["<eos>"], vocab_ko["<pad>"]]:
            continue
            
        word = idx2word_ko[idx]
        # 이전 단어와 같은 단어가 연속되는 것을 방지
        if word != prev_word:
            translated_words.append(word)
            prev_word = word
            
        # <eos> 토큰이 나오면 번역 종료
        if idx == vocab_ko["<eos>"]:
            break
    
    return translated_words, attention_weights

model.eval()을 사용하여 모델을 평가 모드로 설정합니다.
torch.no_grad()를 사용하여 추론 시 그래디언트를 비활성화합니다.
모델 출력에서 argmax를 사용하여 가장 높은 확률의 인덱스를 예측합니다.
<sos>(시작), <eos>(종료), <pad>(패딩) 토큰을 번역 결과에서 제외합니다.
연속적으로 중복된 단어를 방지하기 위해 prev_word를 사용해 중복 검사를 수행합니다.

❗ 학습 데이터와 모델의 추론 결과를 비교하여 모델이 학습한 내용을 제대로 반영하는지 확인합니다. 주어진 테스트 문장은 학습 데이터에서 사용한 문장과 동일한 문장입니다.

테스트 문장은 영어 문장으로 구성되며, 학습한 모델을 사용해 번역됩니다.
번역된 결과와 학습 데이터의 번역 결과를 비교하여 모델이 학습 내용을 제대로 반영했는지 평가하세요.
어텐션 메커니즘을 사용한 모델이 학습된 데이터를 얼마나 잘 재현하는지 확인할 수 있습니다.

# 테스트 코드
print("\n=== 학습 데이터 테스트 (학습 확인) ===")
test_sentences = [
    "they are reading books",
    "he is writing at school",
    "they are reading books at home",
    "we are reading the books at school",
]

for sentence in test_sentences:
    translated_words, attention_weights = translate(model, sentence, vocab_en, vocab_ko, device)
    print(f"\nInput: {sentence}")
    print(f"Output: {' '.join(translated_words)}")



=== 학습 데이터 테스트 (학습 확인) ===

Input: they are reading books
Output: 그들은 책을 읽고 있다

Input: he is writing at school
Output: 그는 학교에서 쓰고 있다

Input: they are reading books at home
Output: 그들은 집에서 책을 읽고 있습니다

Input: we are reading the books at school
Output: 우리는 학교에서 책을 읽고 있습니다

12. 어텐션 가중치 시각화 및 의미 분석

학습된 번역 모델의 일반화 능력을 확인하고, 어텐션 가중치를 시각화하여 번역 과정에서의 단어 간 연관성을 분석합니다.

새로운 문장으로 모델의 번역 성능을 테스트합니다.
어텐션 가중치를 분석합니다.
어텐션 가중치를 시각화합니다.

def plot_attention(input_sentence, output_words, attention_weights):
    fig = plt.figure(figsize=(10, 10))
    ax = fig.add_subplot(111)
    
    #attention_matrix = np.vstack(attention_weights)
    attention_matrix = np.array([weights[0] for weights in attention_weights[:len(output_words)]])  # split() 제거
    
    print("Attention Matrix Shape:", attention_matrix.shape)
    print("Input words:", input_sentence.split())
    print("Output words:", output_words)
    #print("\nAttention Matrix:")
    #print(attention_matrix)

    im = ax.matshow(attention_matrix, cmap='viridis')
    fig.colorbar(im)
    
    # 입력 문장에 <sos>와 <eos> 토큰 추가
    input_tokens = ['<sos>'] + input_sentence.split() + ['<eos>']
    
    ax.set_xticks(range(len(input_tokens)))
    ax.set_xticklabels(input_tokens, rotation=45)
    
    ax.set_yticks(range(len(output_words)))
    ax.set_yticklabels(output_words)
    
    # 각 셀에 가중치 값 표시 (가독성을 위해 소수점 2자리까지만)
    for i in range(len(output_words)):
        for j in range(len(input_tokens)):
            text = ax.text(j, i, f'{attention_matrix[i, j]:.2f}',
                    ha="center", va="center", color="w")
    
    plt.title('Attention 가중치 시각화')
    plt.tight_layout()
    plt.show()

❗ test_sentence 문장에 대한 Attention 가중치를 시각화 합니다.

# 새로운 테스트 문장들로 번역 성능 확인
print("\n=== 새로운 데이터 테스트 (일반화 능력) ===")
new_test_sentences = [
    "he is writing books at library",    # 도서관에서 책 쓰기
    "i am reading books at school",      # 학교에서 책 읽기
    "they are reading at home",          # 집에서 읽기
    "she is writing a book at school",   # 학교에서 책 쓰기
    "we are eating at library"           # 도서관에서 먹기 (새로운 조합)
]

# 모든 문장에 대해 번역 결과 확인
translation_results = []
for test_sentence in new_test_sentences:
    translated_words, attention_weights = translate(model, test_sentence, vocab_en, vocab_ko, device)
    translation_results.append((test_sentence, translated_words, attention_weights))
    print(f"\n입력: {test_sentence}")
    print(f"번역: {' '.join(translated_words)}")

# 첫 번째 문장에 대해 가중치 분석 및 시각화
test_sentence, translated_words, attention_weights = translation_results[0]
print("\n=== 첫 번째 문장에 대한 어텐션 가중치 분석 및 시각화 ===")
input_words = ["<sos>"] + test_sentence.split() + ["<eos>"]

for idx, (output_word, weights) in enumerate(zip(translated_words, attention_weights)):
    print(f"\n[출력 단어: {output_word}]")
    print("주요 참조 단어(상위 3개):")

    # 가중치와 입력 단어를 매칭하여 정렬
    word_weights = list(zip(input_words, weights[0]))
    word_weights.sort(key=lambda x: x[1], reverse=True)

    # 상위 3개 단어와 가중치 출력
    for word, weight in word_weights[:3]:
        print(f"  - {word}: {weight:.3f}")

# 어텐션 시각화
plot_attention(test_sentence, translated_words, attention_weights)

=== 새로운 데이터 테스트 (일반화 능력) ===

입력: he is writing books at library
번역: 그는 도서관에서 책을 쓰고 있습니다

입력: i am reading books at school
번역: 나는 학교에서 책을 읽고 있습니다

입력: they are reading at home
번역: 그들은 집에서 책을 읽고 있습니다

입력: she is writing a book at school
번역: 그녀는 도서관에서 책을 쓰고 있습니다

입력: we are eating at library
번역: 우리는 집에서 먹고 있다

=== 첫 번째 문장에 대한 어텐션 가중치 분석 및 시각화 ===

[출력 단어: 그는]
주요 참조 단어(상위 3개):
  - <eos>: 0.916
  - library: 0.074
  - books: 0.007

[출력 단어: 도서관에서]
주요 참조 단어(상위 3개):
  - library: 0.854
  - <eos>: 0.136
  - at: 0.009

[출력 단어: 책을]
주요 참조 단어(상위 3개):
  - books: 0.987
  - library: 0.011
  - <eos>: 0.002

[출력 단어: 쓰고]
주요 참조 단어(상위 3개):
  - library: 0.939
  - at: 0.041
  - <eos>: 0.010

[출력 단어: 있습니다]
주요 참조 단어(상위 3개):
  - library: 0.522
  - at: 0.456
  - <eos>: 0.018
Attention Matrix Shape: (5, 8)
Input words: ['he', 'is', 'writing', 'books', 'at', 'library']
Output words: ['그는', '도서관에서', '책을', '쓰고', '있습니다']

※ 번역 성능 분석
먼저 새로운 테스트 문장들에 대한 번역 결과를 살펴보면, 모델이 대체로 훌륭한 번역 능력을 보여주고 있음을 알 수 있습니다.

입력: he is writing books at library
번역: 그는 도서관에서 책을 쓰고 있습니다

이 번역에서는 영어와 한국어의 어순 차이를 자연스럽게 처리하고, 적절한 조사를 사용하여 문법적으로 완성도 높은 한국어 문장을 생성했습니다.
하지만 일부 케이스에서는 아쉬운 부분도 발견됩니다
입력: we are eating at library
번역: 우리는 집에서 먹고 있다
'도서관'을 '집'으로 잘못 번역한 것을 볼 수 있습니다. 이는 우리가 단 10개의 문장으로만 학습을 진행했기 때문에 나타난 한계로 볼 수 있습니다. 실제 현업에서는 수만 개 이상의 문장 쌍으로 학습을 진행하므로, 이러한 오류는 크게 줄어들 것입니다.