JetMoe
This model was released on 2023-06-07 and added to Hugging Face Transformers on 2024-05-14.
JetMoe
Section titled “JetMoe”
Overview
Section titled “Overview”JetMoe-8B is an 8B Mixture-of-Experts (MoE) language model developed by Yikang Shen and MyShell. JetMoe project aims to provide a LLaMA2-level performance and efficient language model with a limited budget. To achieve this goal, JetMoe uses a sparsely activated architecture inspired by the ModuleFormer. Each JetMoe block consists of two MoE layers: Mixture of Attention Heads and Mixture of MLP Experts. Given the input tokens, it activates a subset of its experts to process them. This sparse activation schema enables JetMoe to achieve much better training throughput than similar size dense models. The training throughput of JetMoe-8B is around 100B tokens per day on a cluster of 96 H100 GPUs with a straightforward 3-way pipeline parallelism strategy.
This model was contributed by Yikang Shen.
JetMoeConfig
Section titled “JetMoeConfig”[[autodoc]] JetMoeConfig
JetMoeModel
Section titled “JetMoeModel”[[autodoc]] JetMoeModel - forward
JetMoeForCausalLM
Section titled “JetMoeForCausalLM”[[autodoc]] JetMoeForCausalLM - forward
JetMoeForSequenceClassification
Section titled “JetMoeForSequenceClassification”[[autodoc]] JetMoeForSequenceClassification - forward