Attention-mechanism

Published on
2026년 3월 11일
KV Cache 최적화 심층 분석: GQA·MLA·MHA 어텐션 메커니즘과 메모리 효율화 전략
ai-papers kv-cache attention-mechanism gqa mla transformer 2026-03 2026-03-11
Transformer Self-Attention의 KV Cache 기본 원리부터 MHA, MQA, GQA(Llama 2/3), MLA(DeepSeek-V2/V3) 메커니즘의 메모리 분석과 비교, KV Cache 압축 기법(양자화, 퇴거 정책, 슬라이딩 윈도우), PagedAttention(vLLM) 구현, PyTorch 코드 예제, OOM 장애 사례와 최적화 체크리스트를 다룹니다.
Published on
2026년 3월 9일
FlashAttention 논문 분석: IO-Aware Exact Attention으로 Transformer 학습·추론 속도 혁신
ai-papers flash-attention transformer gpu-optimization attention-mechanism 2026-03 2026-03-09
FlashAttention 시리즈(v1~v3) 핵심 논문 분석. IO-Aware 알고리즘의 tiling 전략, GPU SRAM/HBM 메모리 계층 활용, 역전파 recomputation, FlashAttention-2의 병렬화 개선, FlashAttention-3의 FP8 지원과 비동기 파이프라이닝까지 실전 벤치마크와 함께 다룹니다.

KV Cache 최적화 심층 분석: GQA·MLA·MHA 어텐션 메커니즘과 메모리 효율화 전략