steventrouble
/

EfficientZeroRemastered

Reinforcement Learning

Model card Files Files and versions Community

steventrouble commited on Jul 15, 2023

Commit

f2ab607

·

1 Parent(s): 41d902a

Update README.md

Files changed (1) hide show

README.md +26 -1

README.md CHANGED Viewed

@@ -45,4 +45,29 @@ Note that `env_steps` can differ from `train_steps` because the model can
 continue fine-tuning using its replay buffer. In the paper, the last 20k
 epochs are done in this manner. This isn't necessary outside of benchmarks
 and in theory better performance should be attainable by getting more samples
-from the env.

 continue fine-tuning using its replay buffer. In the paper, the last 20k
 epochs are done in this manner. This isn't necessary outside of benchmarks
 and in theory better performance should be attainable by getting more samples
+from the env.
+---
+## Findings
+Our primary goal in this project was to test out EfficientZero and see its capabilities.
+We were amazed by the model overall, especially on Breakout, where it far outperformed
+the human baseline. The overall cost was only about $50 per fully trained model, compared
+to the hundreds of thousands of dollars needed to train MuZero.
+Though the trained models achieved impressive scores in Atari, they didn't reach the
+stellar scores demonstrated in the paper. This could be because we used different hardware
+and dependencies or because ML research papers tend to cherry-pick models and environments
+to showcase good results.
+Additionally, the models tended to hit a performance wall between 75-100k steps. While we
+don't have enough data to know why or how often this happens, it's not surprising: the model
+was tuned specifically for data efficiency, so it hasn't been tested at larger scales. A
+model like MuZero might be more appropriate if you have a large budget.
+Training times seemed longer than those reported in the EfficientZero paper. The paper
+stated that they could train a model to completion in 7 hours, while in practice, we've found
+that it takes an A100 with 32 cores between 1 to 2 days to train a model to completion. This
+is likely because the training process uses more CPU than other models and therefore does not
+perform well on the low-frequency, many-core CPUs found in GPU clusters.