About Guangyu Chen
Guangyu Chen (Nathan) is a machine learning researcher at Kimi, Moonshot AI, specializing in model architecture, efficient attention mechanisms, and continual learning. Despite being a high school student from Shenzhen, China, he has made significant contributions to AI research, most notably as the co-first author of the "Attention Residuals" paper, which proposes a novel approach to replacing fixed residual connections in Transformers with learned, input-dependent attention over depth. The paper garnered widespread attention and praise from Elon Musk and Andrej Karpathy. Chen's journey into ML began through contributing to the Flash Linear Attention open-source project, which led to his invitation to join the Kimi team.
Education
Senior Year
Currently a high school senior at an international school in Shenzhen, Guangdong, China. His interest in machine learning began through self-directed learning, competitive programming on Codeforces, and contributing to open-source projects.
Skills
Core technical competencies in machine learning research, model architecture, and systems optimization.
Efficient Attention
Model Architecture
Linear Attention
Residual Connections
Scaling Laws
Continual Learning
Triton Kernels
CUDA Optimization
GPU Compute
PyTorch
Distributed Training
Codeforces
Algorithms
Flash Linear Attention
Open Source Development
Python
Domains
Efficient Attention Mechanisms, Transformer Architecture, Model Scaling
Continual Learning, Hardware-Aligned ML Algorithms, Interpretability
Tags
Self-taught, Open Source Advocate, Prodigious, Driven, Collaborative
Novel Architectures, Scaling Efficiency, Depth-wise Aggregation, Linear Attention
Experience
ML Researcher
Working on ML research at Kimi, focusing on model architecture and efficient attention mechanisms. Co-first author of the Attention Residuals paper, which proposes replacing fixed residual connections with softmax attention over preceding layer outputs, achieving approximately 25% compute efficiency improvement on the Kimi Linear architecture (48B total / 3B activated parameters).
Research Intern
Completed a seven-week internship at a Silicon Valley AI startup. Managed a project involving 144 H100 GPUs and engaged directly with leadership on operational matters including recruitment and financing discussions.
Research Contributor
Worked on applied interpretability research, exploring how neural networks process and represent information internally.
Notable Work
Co-first author. Proposes AttnRes, replacing fixed residual accumulation with softmax attention over preceding layer outputs. Validated on Kimi Linear 48B architecture with 1.4T tokens, achieving ~25% compute efficiency improvement with <2% inference latency overhead. Praised by Elon Musk and Andrej Karpathy.
Core contributor to the FLA open-source project (4.7K+ GitHub stars), providing efficient implementations of state-of-the-art linear attention models in PyTorch and Triton.
Profile Summary
Efficient Attention, Model Architecture, Linear Attention, Continual Learning, Open Source
Twitter, LinkedIn, GitHub
