Skip to content

JetMoe

This model was released on 2023-06-07 and added to Hugging Face Transformers on 2024-05-14.

PyTorch FlashAttention SDPA

JetMoe-8B is an 8B Mixture-of-Experts (MoE) language model developed by Yikang Shen and MyShell. JetMoe project aims to provide a LLaMA2-level performance and efficient language model with a limited budget. To achieve this goal, JetMoe uses a sparsely activated architecture inspired by the ModuleFormer. Each JetMoe block consists of two MoE layers: Mixture of Attention Heads and Mixture of MLP Experts. Given the input tokens, it activates a subset of its experts to process them. This sparse activation schema enables JetMoe to achieve much better training throughput than similar size dense models. The training throughput of JetMoe-8B is around 100B tokens per day on a cluster of 96 H100 GPUs with a straightforward 3-way pipeline parallelism strategy.

This model was contributed by Yikang Shen.

[[autodoc]] JetMoeConfig

[[autodoc]] JetMoeModel - forward

[[autodoc]] JetMoeForCausalLM - forward

[[autodoc]] JetMoeForSequenceClassification - forward