Transformer Model [트랜스포머 모델] 정리

이 글은 Transformer 에 대해 직관적으로 이해하고 이해한 바를 잊지 않기 위해 여러 글을 참고하여 작성 / 정리해둔 글입니다.

1. Transformer 의 주요 하이퍼파라미터

1) 입력과 출력의 크기 [= 임베딩 크기, Embedding Size]

모델이 학습할 단어 표현의 차원을 결정한다.
임베딩 크기가 클수록 더 넓은 표현이 가능하지만 모델 복잡도와 메모리 사용량도 함께 증가한다.

2) 인코더와 디코더의 층 [= 레이어 수, Num of Layers]

층이 많을수록 모델이 복잡한 패턴을 학습할 수 있지만 과적합이 발생할 수 있다.
BERT 나 GPT 모델에서는 12, 24, 48 레이어를 사용하는 경우가 많다.

3) 어텐션 헤드의 수 [= Num of Attention Heads]

멀티헤드 어테션 ㅅ레이어에서 병렬로 처리되는 헤드 수를 의미한다.
수가 클수록 다양한 정보에 대해 동시에 학습할 수 있지만 계산 비용이 높아진다.

4) 은닉층의 수 [= 은닉층 크기, Hidden Size]

피드포워드 신경망의 은닉층의 수를 설정하는 파라미터로 피드포워드 레이어의 크기를 말한다.
보통 임베딩 크기의 4배 정도로 설정된다.

5) 시퀀스 길이 [= Sequence Length]

모델이 한 번에 처리할 수 있는 최대 토큰 수이다.
트랜스포머의 경우 고정된 입력 길이를 사용하기 때문에 이 길이가 중요하다.

그 외에도 lr, batch size 등이 있다.

2. 포지셔널 인코딩 구현

포지셔널 인코딩의 계산 식은 다음과 같다.

$PE_{pos, 2i} = sin(\frac{pos}{10000^{\frac{2i}{d}}})$

$PE_{pos, 2i+1} = cos(\frac{pos}{10000^{\frac{2i}{d}}})$

이 때, pos 는 단어의 위치, $i$ 는 차원의 인덱스, $d$ 는 임베딩 차원을 의미한다.

import numpy as np

def positional_encoding(seq_len, d_model):

    # 포지셔널 인코딩 배열 초기화
    pe = np.zeros((seq_len, d_model))  # (시퀀스 길이, 임베딩 차원)
    
    # 각 위치와 임베딩 차원에 대해 계산
    for pos in range(seq_len):
        for i in range(0, d_model, 2):
            pe[pos, i] = np.sin(pos / (10000 ** ((2 * i) / d_model)))
            pe[pos, i + 1] = np.cos(pos / (10000 ** ((2 * i) / d_model)))
    
    return pe

3. 스케일드 닷-프로덕트 어텐션 구현

스케일드 닷-프로덕트 어텐션은 다음과 같은 순서를 가진다.

어텐션 스코어 계산 [Query 와 Key 의 내적] -> 스케일링 [$\sqrt{d_{k}}$ 로 나누어 스케일링 진행함으로 안정성 높임, $d_{k}$ 는 key 벡터의 차원] -> 소프트맥스 적용 [확률 분포 = 어텐션 가중치 출력] -> 가중합 계산 [value * 어텐션 가중치]

$Attention(Q, K, V) = softmax(\frac{QK^{T}}{\sqrt{d_{k}}})V$

import numpy as np


def scaled_dot_product_attention(Q, K, V):

    """
    Q: Query matrix of shape (..., seq_len_q, d_k)
    K: Key matrix of shape (..., seq_len_k, d_k)
    V: Value matrix of shape (..., seq_len_v, d_v)
    """

	# 1. Query와 Key의 내적 수행
    d_k = Q.shape[-1]
    scores = np.matmul(Q, K.T) / np.sqrt(d_k)
    
    # 2. 소프트맥스 함수 적용
    attention_weights = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
    attention_weights /= np.sum(attention_weights, axis=-1, keepdims=True)
    
    # 3. 가중합 계산
    output = np.matmul(attention_weights, V)
    
    return output, attention_weights


# Query, Key, Value 예시
Q = np.array([[1.0, 0.0, 0.5]])
K = np.array([[0.5, 0.2, 0.3], [0.1, 1.0, 0.5], [0.3, 0.8, 0.7]])
V = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])

4. 멀티헤드 어텐션 구현

여러개의 독립적인 어텐션 헤드로 서로 다른 표현을 동시에 학습할 수 있게 해준다. 각 어텐션 헤드는 Query, Key, Value 를 각각 다른 가중치로 변환하여 병렬로 계산한 후 결과를 결합해서 모델의 성능을 높인다.

$MultiHead(Q, K, V) = Concat(head_{1}, ... , head_{h})W^{O} $

각 어텐션 헤드의 수식은 다음과 같다.

$head_{i} = Attention(QW_{i}^{Q}, KW_{i}^{K}, VW_{i}^{V})$

이 때, 각 $W_{i}^{Q} , W_{i}^{K}, W_{i}^{V}$ 는 Query, Key, Value 를 변환하기 위한 가중치 행렬을 의미한다.

import numpy as np


def scaled_dot_product_attention(Q, K, V):

    """
    Scaled Dot-Product Attention
    각 헤드의 어텐션 스코어를 계산하고, 소프트맥스를 적용하여 가중치를 구함
    """
    d_k = Q.shape[-1]
    scores = np.dot(Q, K.T) / np.sqrt(d_k)
    attention_weights = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
    attention_weights /= attention_weights.sum(axis=-1, keepdims=True)
    output = np.dot(attention_weights, V)
    
    return output


def multi_head_attention(Q, K, V, num_heads):

    """
    Multi-Head Attention implementation
    """
    d_model = Q.shape[-1]
    d_k = d_model // num_heads
    
    # 가중치 행렬 초기화
    W_Q = np.random.rand(num_heads, d_model, d_k)
    W_K = np.random.rand(num_heads, d_model, d_k)
    W_V = np.random.rand(num_heads, d_model, d_k)
    W_O = np.random.rand(num_heads * d_k, d_model)
    
    # 각 헤드에 대해 Query, Key, Value 변환
    heads = []
    for i in range(num_heads):
        Q_i = np.dot(Q, W_Q[i])
        K_i = np.dot(K, W_K[i])
        V_i = np.dot(V, W_V[i])
        head = scaled_dot_product_attention(Q_i, K_i, V_i)
        heads.append(head)
    
    # 각 헤드의 결과를 연결하여 최종 출력 생성
    concat_heads = np.concatenate(heads, axis=-1)
    output = np.dot(concat_heads, W_O)
    
    return output


# 인풋 예시
Q = np.array([[1.0, 0.0, 0.5]])
K = np.array([[0.5, 0.2, 0.3], [0.1, 1.0, 0.5], [0.3, 0.8, 0.7]])
V = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])

5. 인코더 구현

import numpy as np

class EncoderLayer:

    def __init__(self, embed_size, num_heads, hidden_size):
    
        self.embed_size = embed_size
        self.num_heads = num_heads
        self.hidden_size = hidden_size
        
        # 셀프 어텐션을 위한 가중치
        self.W_Q = np.random.rand(embed_size, embed_size)
        self.W_K = np.random.rand(embed_size, embed_size)
        self.W_V = np.random.rand(embed_size, embed_size)
        
        # 피드포워드 레이어 가중치
        self.W_1 = np.random.rand(embed_size, hidden_size)
        self.W_2 = np.random.rand(hidden_size, embed_size)

    def scaled_dot_product_attention(self, Q, K, V):
    
        d_k = Q.shape[-1]
        scores = np.dot(Q, K.T) / np.sqrt(d_k)
        attention_weights = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
        attention_weights /= np.sum(attention_weights, axis=-1, keepdims=True)
        
        return np.dot(attention_weights, V)

    def forward(self, x):
    
        # 1. 셀프 어텐션
        Q = np.dot(x, self.W_Q)
        K = np.dot(x, self.W_K)
        V = np.dot(x, self.W_V)
        attention = self.scaled_dot_product_attention(Q, K, V)
        
        # 잔차 연결 및 층 정규화
        x = x + attention
        x = (x - x.mean(axis=-1, keepdims=True)) / (x.std(axis=-1, keepdims=True) + 1e-5)
        
        # 2. 피드포워드 네트워크
        ff_output = np.dot(x, self.W_1)
        ff_output = np.maximum(0, ff_output)  # ReLU 활성화 함수
        ff_output = np.dot(ff_output, self.W_2)
        
        # 잔차 연결 및 층 정규화
        x = x + ff_output
        x = (x - x.mean(axis=-1, keepdims=True)) / (x.std(axis=-1, keepdims=True) + 1e-5)
        
        return x


class TransformerEncoder:

    def __init__(self, num_layers, embed_size, num_heads, hidden_size):
        self.layers = [EncoderLayer(embed_size, num_heads, hidden_size) for _ in range(num_layers)]
        
    def forward(self, x):
        # 인코더 레이어 순차 처리
        for layer in self.layers:
            x = layer.forward(x)
            
        return x



# 인코더 설정
num_layers = 6        # 인코더 레이어 개수
embed_size = 512      # 임베딩 크기
num_heads = 8         # 어텐션 헤드 개수
hidden_size = 2048    # 피드포워드 은닉 크기



# 인코더 초기화
encoder = TransformerEncoder(num_layers, embed_size, num_heads, hidden_size)

# 인풋 예시 (시퀀스 길이 10, 임베딩 크기 512)
x = np.random.rand(10, embed_size)

# 인코더 출력
output = encoder.forward(x)

'ML-DL > LLM' 카테고리의 다른 글

Transformer Model [트랜스포머 모델] 정리 - [2] (4)	2024.10.28
DPO [Direct Preference Optimization] 이란? (1)	2024.10.26
Transformer Model [트랜스포머 모델] 정리 - [1] (3)	2024.09.18
Attention Mechanism [어텐션 메커니즘] 정리 (2)	2024.09.18
Retrieval-Augmented Generation : RAG 란 ? (3)	2024.09.04

code-bean 님의 블로그

Transformer Model [트랜스포머 모델] 정리 - [3]

1. Transformer 의 주요 하이퍼파라미터

1) 입력과 출력의 크기 [= 임베딩 크기, Embedding Size]

2) 인코더와 디코더의 층 [= 레이어 수, Num of Layers]

3) 어텐션 헤드의 수 [= Num of Attention Heads]

4) 은닉층의 수 [= 은닉층 크기, Hidden Size]

5) 시퀀스 길이 [= Sequence Length]

2. 포지셔널 인코딩 구현

3. 스케일드 닷-프로덕트 어텐션 구현

4. 멀티헤드 어텐션 구현

5. 인코더 구현

'ML-DL > LLM' 카테고리의 다른 글

티스토리툴바

Transformer Model [트랜스포머 모델] 정리 - [3]

1. Transformer 의 주요 하이퍼파라미터

1) 입력과 출력의 크기 [= 임베딩 크기, Embedding Size]

2) 인코더와 디코더의 층 [= 레이어 수, Num of Layers]

3) 어텐션 헤드의 수 [= Num of Attention Heads]

4) 은닉층의 수 [= 은닉층 크기, Hidden Size]

5) 시퀀스 길이 [= Sequence Length]

2. 포지셔널 인코딩 구현

3. 스케일드 닷-프로덕트 어텐션 구현

4. 멀티헤드 어텐션 구현

5. 인코더 구현

'ML-DL > LLM' 카테고리의 다른 글

'ML-DL/LLM' Related Articles

티스토리툴바