steventrouble
commited on
Commit
·
f2ab607
1
Parent(s):
41d902a
Update README.md
Browse files
README.md
CHANGED
@@ -45,4 +45,29 @@ Note that `env_steps` can differ from `train_steps` because the model can
|
|
45 |
continue fine-tuning using its replay buffer. In the paper, the last 20k
|
46 |
epochs are done in this manner. This isn't necessary outside of benchmarks
|
47 |
and in theory better performance should be attainable by getting more samples
|
48 |
-
from the env.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
45 |
continue fine-tuning using its replay buffer. In the paper, the last 20k
|
46 |
epochs are done in this manner. This isn't necessary outside of benchmarks
|
47 |
and in theory better performance should be attainable by getting more samples
|
48 |
+
from the env.
|
49 |
+
|
50 |
+
---
|
51 |
+
|
52 |
+
## Findings
|
53 |
+
|
54 |
+
Our primary goal in this project was to test out EfficientZero and see its capabilities.
|
55 |
+
We were amazed by the model overall, especially on Breakout, where it far outperformed
|
56 |
+
the human baseline. The overall cost was only about $50 per fully trained model, compared
|
57 |
+
to the hundreds of thousands of dollars needed to train MuZero.
|
58 |
+
|
59 |
+
Though the trained models achieved impressive scores in Atari, they didn't reach the
|
60 |
+
stellar scores demonstrated in the paper. This could be because we used different hardware
|
61 |
+
and dependencies or because ML research papers tend to cherry-pick models and environments
|
62 |
+
to showcase good results.
|
63 |
+
|
64 |
+
Additionally, the models tended to hit a performance wall between 75-100k steps. While we
|
65 |
+
don't have enough data to know why or how often this happens, it's not surprising: the model
|
66 |
+
was tuned specifically for data efficiency, so it hasn't been tested at larger scales. A
|
67 |
+
model like MuZero might be more appropriate if you have a large budget.
|
68 |
+
|
69 |
+
Training times seemed longer than those reported in the EfficientZero paper. The paper
|
70 |
+
stated that they could train a model to completion in 7 hours, while in practice, we've found
|
71 |
+
that it takes an A100 with 32 cores between 1 to 2 days to train a model to completion. This
|
72 |
+
is likely because the training process uses more CPU than other models and therefore does not
|
73 |
+
perform well on the low-frequency, many-core CPUs found in GPU clusters.
|