OrionZheng
/

openmoe-8b

@@ -15,7 +15,7 @@ Our project began in the summer of 2023. On August 22, 2023, we released the fir
 As a small student team, instead of pursuing the best model with better data, computation, and human power, we devote to fully sharing our training data, strategies, model architecture, weights, and everything we have with the community. We hope this project will promote research on this promising field and invite more contributors to work on open-sourced MoE projects together!
-[2024.01.12] Currently, the paper for the project and more evaluations are underway. For more information about the model, training, and evaluations, please visit our GitHub [repository](https://github.com/XueFuzhao/OpenMoE/tree/main).
 ## Model Weights
@@ -26,7 +26,7 @@ We provide all these checkpoints on Huggingface(in pytorch) and Google Cloud Sto
 | Model Name     | Description                      | #Param   |Huggingface |
 |----------------|-------------------------------------------------|----------|-------------|
-| OpenMoE-base   | A small MoE model for debugging(only go through 128B tokens)         |637M      |[Link](https://huggingface.co/OrionZheng/openmoe-base) |
 | OpenLLaMA-base | A dense counter-part of OpenMoE-base            |310M      |[Link](https://huggingface.co/fuzhao/OpenLLaMA_Base) |
 | OpenMoE-8B-200B   | 8B MoE with comparable FLOPs of a 1.6B LLaMA(No SFT) |8B        |[Link](https://huggingface.co/OrionZheng/openmoe-8b-200B/tree/main) |
 | OpenMoE-8B-890B   | 8B MoE with comparable FLOPs of a 1.6B LLaMA(No SFT)  |8B        |[Link](https://huggingface.co/OrionZheng/openmoe-8b-890B) |
@@ -35,7 +35,7 @@ We provide all these checkpoints on Huggingface(in pytorch) and Google Cloud Sto
 | **OpenMoE-34B/32E (200B)**   |  34B MoE with comparable FLOPs of a 7B LLaMA(No SFT)  |34B        |[Link](https://huggingface.co/OrionZheng/openmoe-34b-200B) |
-The base models, which were trained using 128 billion tokens, served primarily for debugging purposes. After validating the effectiveness of our model architexture, we did not pursue further training. Consequently, their performance might not be very well, and the checkpoint are not suitable for practical applications.
 The OpenMoE-8B with 4 MoE layers and 32 experts has been trained by 1.1T tokens. The SFT version has also been released after we finetuned the OpenMoE-8B-1.1T on the [wildchat]((https://huggingface.co/datasets/allenai/WildChat-nontoxic)) dataset's GPT-4 subset. Besides, we also provide some intermediate checkpoints at 200B and 890B tokens for research purposes.
@@ -98,13 +98,7 @@ Since the models are trained on The Redpajama and The Stack dataset, please chec
 This project is currently contributed by the following authors:
-- [Fuzhao Xue](https://xuefuzhao.github.io/)
-- [Zian Zheng](https://zheng-zian-andy.com)
-- [Yao Fu](https://franxyao.github.io/)
-- [Jinjie Ni](http://jinjie.one/)
-- [Zangwei Zheng](https://zhengzangw.github.io/)
-- [Wangchunshu Zhou](https://michaelzhouwang.github.io/)
-- [Yang You](https://www.comp.nus.edu.sg/~youy/)
 ## Acknowledgement
 The computational resources for this project were generously provided by the [Google TPU Research Cloud(TRC)](https://sites.research.google/trc/about/). We extend our heartfelt thanks to TRC for their invaluable support, which has been fundamental to the success of our work. Besides, we are extremely grateful to the [ColossalAI Team](https://github.com/hpcaitech/ColossalAI) for their tremendous support with the PyTorch implementation, especially [Xuanlei Zhao](https://oahzxl.github.io/) and [Wenhao Chen](https://github.com/CWHer), making training and inference of OpenMoE on GPUs a reality.

 As a small student team, instead of pursuing the best model with better data, computation, and human power, we devote to fully sharing our training data, strategies, model architecture, weights, and everything we have with the community. We hope this project will promote research on this promising field and invite more contributors to work on open-sourced MoE projects together!
+[2024.01.12] The paper for the project and more evaluations are underway. For more information about the model, training, and evaluations, please visit our GitHub [repository](https://github.com/XueFuzhao/OpenMoE/tree/main).
 ## Model Weights
 | Model Name     | Description                      | #Param   |Huggingface |
 |----------------|-------------------------------------------------|----------|-------------|
+| OpenMoE-base   | A small MoE model for debugging only       |637M      |[Link](https://huggingface.co/OrionZheng/openmoe-base) |
 | OpenLLaMA-base | A dense counter-part of OpenMoE-base            |310M      |[Link](https://huggingface.co/fuzhao/OpenLLaMA_Base) |
 | OpenMoE-8B-200B   | 8B MoE with comparable FLOPs of a 1.6B LLaMA(No SFT) |8B        |[Link](https://huggingface.co/OrionZheng/openmoe-8b-200B/tree/main) |
 | OpenMoE-8B-890B   | 8B MoE with comparable FLOPs of a 1.6B LLaMA(No SFT)  |8B        |[Link](https://huggingface.co/OrionZheng/openmoe-8b-890B) |
 | **OpenMoE-34B/32E (200B)**   |  34B MoE with comparable FLOPs of a 7B LLaMA(No SFT)  |34B        |[Link](https://huggingface.co/OrionZheng/openmoe-34b-200B) |
+The base model, which were trained using 128 billion tokens, served primarily for debugging purposes. After validating the effectiveness of our model architecture, we did not pursue further training. Consequently, their performance might not be very well, and the checkpoint are not suitable for practical applications. Better performence can be oberved from our 8B or 34B versions.
 The OpenMoE-8B with 4 MoE layers and 32 experts has been trained by 1.1T tokens. The SFT version has also been released after we finetuned the OpenMoE-8B-1.1T on the [wildchat]((https://huggingface.co/datasets/allenai/WildChat-nontoxic)) dataset's GPT-4 subset. Besides, we also provide some intermediate checkpoints at 200B and 890B tokens for research purposes.
 This project is currently contributed by the following authors:
+[Fuzhao Xue](https://xuefuzhao.github.io/), [Zian Zheng](https://zheng-zian-andy.com), [Yao Fu](https://franxyao.github.io/), [Jinjie Ni](http://jinjie.one/), [Zangwei Zheng](https://zhengzangw.github.io/), [Wangchunshu Zhou](https://michaelzhouwang.github.io/), [Yang You](https://www.comp.nus.edu.sg/~youy/)
 ## Acknowledgement
 The computational resources for this project were generously provided by the [Google TPU Research Cloud(TRC)](https://sites.research.google/trc/about/). We extend our heartfelt thanks to TRC for their invaluable support, which has been fundamental to the success of our work. Besides, we are extremely grateful to the [ColossalAI Team](https://github.com/hpcaitech/ColossalAI) for their tremendous support with the PyTorch implementation, especially [Xuanlei Zhao](https://oahzxl.github.io/) and [Wenhao Chen](https://github.com/CWHer), making training and inference of OpenMoE on GPUs a reality.