gabehubner commited on
Commit
569299e
·
1 Parent(s): bdd2eba
.DS_Store ADDED
Binary file (8.2 kB). View file
 
README.md CHANGED
@@ -11,3 +11,47 @@ license: apache-2.0
11
  ---
12
 
13
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ---
12
 
13
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
14
+
15
+ # Continuous Lunar Lander Environment with DDPG
16
+
17
+ Reinforcement Learning for Continuous Action Spaces and Continuous Observation Spaces
18
+
19
+ https://github.com/ghubnerr/continuous_lunar_lander/assets/91924667/961fc1da-3fc0-47b7-9fac-c073d2354cd5
20
+
21
+ ## Observations
22
+
23
+ - [x] The agent had much more angular control, but struggled to pivot to the sides, as it may have been rewarded more for staying upright than to head to the land zone
24
+ - [x] The reward function was probably not punishing the agent enough for landing outside of the landing zone.
25
+ - [x] Rather, the agent preferred to do a successful landing (touch its legs on the floor), than to position itself correctly in the zone.
26
+ - [x] Continuous control (**DDPG**) gave it a better grip and the movement was not so loose as its [Discrete version](https://github.com/ghubnerr/lunar_lander).
27
+
28
+ ## Continuous Action Space
29
+
30
+ If `continuous=True` is passed, continuous actions (corresponding to the throttle of the engines) will be used and the action space will be `Box(-1, +1, (2,), dtype=np.float32)`. The first coordinate of an action determines the throttle of the main engine, while the second coordinate specifies the throttle of the lateral boosters. Given an action `np.array([main, lateral])`, the main engine will be turned off completely if `main < 0` and the throttle scales affinely from 50% to 100% for `0 <= main <= 1` (in particular, the main engine doesn’t work with less than 50% power). Similarly, if `-0.5 < lateral < 0.5`, the lateral boosters will not fire at all. If `lateral < -0.5`, the left booster will fire, and if `lateral > 0.5`, the right booster will fire. Again, the throttle scales affinely from 50% to 100% between -1 and -0.5 (and 0.5 and 1, respectively).
31
+ Documentation: Gymnasium
32
+
33
+ ## Usage and Packages
34
+
35
+ `pip install torch gymnasium 'gymnasium[box2d]'`
36
+
37
+ You might need to install Box2D Separately, which requires a swig package to compile code from Python into C/C++, which is the language that Box2d was built in:
38
+
39
+ `brew install swig`
40
+
41
+ `pip install box2d`
42
+
43
+ ## Average Score: 164.38 (significant improvement from discrete action spaces)
44
+
45
+ For each step, the reward:
46
+
47
+ - is increased/decreased the closer/further the lander is to the landing pad.
48
+ - is increased/decreased the slower/faster the lander is moving.
49
+ - is decreased the more the lander is tilted (angle not horizontal).
50
+ - is increased by 10 points for each leg that is in contact with the ground.
51
+ - is decreased by 0.03 points each frame a side engine is firing.
52
+ - is decreased by 0.3 points each frame the main engine is firing.
53
+ The episode receives an additional reward of -100 or +100 points for crashing or landing safely respectively. An episode is considered a solution if it scores at least 200 points.\*\*
54
+
55
+ ## `train()` and `load_trained()`
56
+
57
+ `load_trained()` function loads a pre-trained model that ran through 1000 episodes of training, while `train()` does training from scratch. You can edit which one of the functions is running from the bottom of the main.py file. If you set render_mode=False, the program will train a lot faster.
__pycache__/ddpg.cpython-310.pyc ADDED
Binary file (9.45 kB). View file
 
__pycache__/ddpg.cpython-311.pyc ADDED
Binary file (22 kB). View file
 
ddpg.py ADDED
@@ -0,0 +1,279 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import torch as T
3
+ import torch.nn as nn
4
+ import torch.nn.functional as F
5
+ import torch.optim as optim
6
+ import numpy as np
7
+ from captum.attr import (IntegratedGradients, LayerConductance, NeuronAttribution)
8
+
9
+ class OUActionNoise(object): # Ornstein-Uhlenbeck process -> Temporary correlated noise
10
+ def __init__(self, mu, sigma=0.15, theta=0.2, dt=1e-2, x0=None):
11
+ self.theta = theta
12
+ self.mu = mu
13
+ self.sigma = sigma
14
+ self.dt = dt
15
+ self.x0 = x0
16
+ self.reset()
17
+
18
+ def __call__(self):
19
+ x = self.x_prev + self.theta * (self.mu - self.x_prev) * self.dt + self.sigma*np.sqrt(self.dt)*np.random.normal(size=self.mu.shape)
20
+ self.x_prev = x
21
+ return x
22
+
23
+ def reset(self):
24
+ self.x_prev = self.x0 if self.x0 is not None else np.zeros_like(self.mu)
25
+
26
+ class ReplayBuffer(object):
27
+ def __init__(self, max_size, input_shape, n_actions):
28
+ self.mem_size = max_size
29
+ self.mem_cntr = 0
30
+ self.state_memory = np.zeros((self.mem_size, *input_shape))
31
+ self.new_state_memory = np.zeros((self.mem_size, *input_shape))
32
+ self.action_memory = np.zeros((self.mem_size, n_actions))
33
+ self.reward_memory = np.zeros(self.mem_size)
34
+ self.terminal_memory = np.zeros(self.mem_size, dtype=np.float32)
35
+
36
+ def store_transition(self, state, action, reward, state_, done):
37
+ index = self.mem_cntr % self.mem_size # index of the memory
38
+
39
+ self.state_memory[index] = state
40
+ self.action_memory[index] = action
41
+ self.reward_memory[index] = reward
42
+ self.new_state_memory[index] = state_
43
+ self.terminal_memory[index] = 1 - done
44
+ self.mem_cntr += 1
45
+
46
+
47
+ def sample_buffer(self, batch_size):
48
+ max_mem = min(self.mem_cntr, self.mem_size) # if memory is not full, use mem_cntr
49
+ batch = np.random.choice(max_mem, batch_size)
50
+
51
+ states = self.state_memory[batch]
52
+ actions = self.action_memory[batch]
53
+ rewards = self.reward_memory[batch]
54
+ new_states = self.new_state_memory[batch]
55
+ terminal = self.terminal_memory[batch]
56
+
57
+ return states, actions, rewards, new_states, terminal
58
+
59
+ class CriticNetwork(nn.Module):
60
+ def __init__(self, beta, input_dims, fc1_dims, fc2_dims, n_actions, name, chkpt_dir="tmp/ddpg"):
61
+ super(CriticNetwork, self).__init__()
62
+ self.input_dims = input_dims
63
+ self.fc1_dims = fc1_dims
64
+ self.fc2_dims = fc2_dims
65
+ self.n_actions = n_actions
66
+ self.checkpoint_dir = chkpt_dir
67
+ self.checkpoint_file = os.path.join(self.checkpoint_dir, name+'_ddpg')
68
+ self.fc1 = nn.Linear(*self.input_dims, self.fc1_dims)
69
+ f1 = 1./np.sqrt(self.fc1.weight.data.size()[0])
70
+ T.nn.init.uniform_(self.fc1.weight.data, -f1, f1)
71
+ T.nn.init.uniform_(self.fc1.bias.data, -f1, f1)
72
+ self.bn1 = nn.LayerNorm(self.fc1_dims)
73
+ self.fc2 = nn.Linear(self.fc1_dims, self.fc2_dims)
74
+ f2 = 1./np.sqrt(self.fc2.weight.data.size()[0])
75
+ T.nn.init.uniform_(self.fc2.weight.data, -f2, f2)
76
+ T.nn.init.uniform_(self.fc2.bias.data, -f2, f2)
77
+ self.bn2 = nn.LayerNorm(self.fc2_dims)
78
+
79
+ self.action_value = nn.Linear(self.n_actions, self.fc2_dims)
80
+ f3 = 0.003 # From paper
81
+ self.q = nn.Linear(self.fc2_dims, 1)
82
+ T.nn.init.uniform_(self.q.weight.data, -f3, f3)
83
+ T.nn.init.uniform_(self.q.bias.data, -f3, f3)
84
+
85
+ self.optimizer = optim.Adam(self.parameters(), lr=beta, weight_decay=0.01)
86
+ self.device = T.device("cpu")
87
+
88
+ self.to(self.device)
89
+
90
+ def forward(self, state, action):
91
+ state_value = self.fc1(state)
92
+ state_value = self.bn1(state_value)
93
+ state_value = F.relu(state_value)
94
+ state_value = self.fc2(state_value)
95
+ state_value = self.bn2(state_value)
96
+
97
+ action_value = F.relu(self.action_value(action))
98
+ state_action_value = F.relu(T.add(state_value, action_value))
99
+
100
+ state_action_value = self.q(state_action_value)
101
+
102
+ return state_action_value
103
+
104
+ def save_checkpoint(self):
105
+ print('... saving checkpoint ...')
106
+ T.save(self.state_dict(), self.checkpoint_file)
107
+
108
+ def load_checkpoint(self):
109
+ print('... loading checkpoint ...')
110
+ self.load_state_dict(T.load(self.checkpoint_file))
111
+
112
+
113
+ class ActorNetwork(nn.Module):
114
+ def __init__(self, alpha, input_dims, fc1_dims, fc2_dims, n_actions, name, chkpt_dir="tmp/ddpg"):
115
+ super(ActorNetwork, self).__init__()
116
+ self.input_dims = input_dims
117
+ self.fc1_dims = fc1_dims
118
+ self.fc2_dims = fc2_dims
119
+ self.n_actions = n_actions
120
+ self.checkpoint_dir = chkpt_dir
121
+ self.checkpoint_file = os.path.join(self.checkpoint_dir, name+'_ddpg')
122
+
123
+ self.fc1 = nn.Linear(*self.input_dims, self.fc1_dims)
124
+ f1 = 1./np.sqrt(self.fc1.weight.data.size()[0])
125
+ T.nn.init.uniform_(self.fc1.weight.data, -f1, f1)
126
+ T.nn.init.uniform_(self.fc1.bias.data, -f1, f1)
127
+ self.bn1 = nn.LayerNorm(self.fc1_dims)
128
+
129
+ self.fc2 = nn.Linear(self.fc1_dims, self.fc2_dims)
130
+ f2 = 1./np.sqrt(self.fc2.weight.data.size()[0])
131
+ T.nn.init.uniform_(self.fc2.weight.data, -f2, f2)
132
+ T.nn.init.uniform_(self.fc2.bias.data, -f2, f2)
133
+ self.bn2 = nn.LayerNorm(self.fc2_dims)
134
+
135
+ f3 = 0.003 # From paper
136
+ self.mu = nn.Linear(self.fc2_dims, self.n_actions)
137
+ T.nn.init.uniform_(self.mu.weight.data, -f3, f3)
138
+ T.nn.init.uniform_(self.mu.bias.data, -f3, f3)
139
+ T.nn.init.uniform_(self.mu.bias.data, -f3, f3)
140
+
141
+ self.optimizer = optim.Adam(self.parameters(), lr=alpha)
142
+ self.device = T.device("cpu")
143
+ self.to(self.device)
144
+
145
+ def forward(self, state):
146
+ print(f"State in forward function: {state.shape=}")
147
+ x = self.fc1(state)
148
+ x = self.bn1(x)
149
+ x = F.relu(x)
150
+ x = self.fc2(x)
151
+ x = self.bn2(x)
152
+ x = F.relu(x)
153
+ x = T.tanh(self.mu(x))
154
+
155
+ return x
156
+
157
+ def save_checkpoint(self):
158
+ print('... saving checkpoint ...')
159
+ T.save(self.state_dict(), self.checkpoint_file)
160
+
161
+ def load_checkpoint(self):
162
+ print('... loading checkpoint ...')
163
+ self.load_state_dict(T.load(self.checkpoint_file))
164
+
165
+ class Agent(object):
166
+ def __init__(self, alpha, beta, input_dims, tau, env, gamma=0.99, n_actions=2, max_size=1000000, layer1_size=400, layer2_size=300, batch_size=64):
167
+ self.gamma = gamma
168
+ self.tau = tau
169
+ self.batch_size = batch_size
170
+ self.memory = ReplayBuffer(max_size, input_dims, n_actions)
171
+
172
+ self.actor = ActorNetwork(alpha, input_dims, layer1_size, layer2_size, n_actions=n_actions, name="actor")
173
+ self.critic = CriticNetwork(beta, input_dims, layer1_size, layer2_size, n_actions=n_actions, name="critic")
174
+
175
+ self.target_actor = ActorNetwork(alpha, input_dims, layer1_size, layer2_size, n_actions=n_actions, name="target_actor")
176
+ self.target_critic = CriticNetwork(beta, input_dims, layer1_size, layer2_size, n_actions=n_actions, name="target_critic")
177
+
178
+ self.noise = OUActionNoise(mu=np.zeros(n_actions))
179
+
180
+ self.update_network_parameters(tau=1)
181
+
182
+ def choose_action(self, observation, attribution : IntegratedGradients = None, baseline : np.ndarray=None):
183
+ self.actor.eval()
184
+ observation = T.tensor(observation, dtype=T.float).to(self.actor.device)
185
+ print(f"Observation: {observation.shape=}")
186
+ mu = self.actor(observation).to(self.actor.device)
187
+
188
+ if attribution is not None:
189
+ if baseline is None:
190
+ baseline = T.zeros(observation.shape)
191
+ attributions = attribution.attribute((observation), baselines=baseline, target=0)
192
+ print('Attributions:', attributions)
193
+
194
+
195
+ mu_prime = mu + T.tensor(self.noise(), dtype=T.float).to(self.actor.device)
196
+ self.actor.train()
197
+ return mu_prime.cpu().detach().numpy()
198
+
199
+ def remember(self, state, action, reward, new_state, done):
200
+ self.memory.store_transition(state, action, reward, new_state, done)
201
+
202
+ def learn(self):
203
+ if self.memory.mem_cntr < self.batch_size:
204
+ return
205
+ state, action, reward, new_state, done = self.memory.sample_buffer(self.batch_size)
206
+ reward = T.tensor(reward, dtype=T.float).to(self.critic.device)
207
+ done = T.tensor(done).to(self.critic.device)
208
+ new_state = T.tensor(new_state, dtype=T.float).to(self.critic.device)
209
+ action = T.tensor(action, dtype=T.float).to(self.critic.device)
210
+ state = T.tensor(state, dtype=T.float).to(self.critic.device)
211
+
212
+ self.target_actor.eval()
213
+ self.target_critic.eval()
214
+ self.critic.eval()
215
+
216
+ target_actions = self.target_actor.forward(new_state)
217
+ critic_value_ = self.target_critic.forward(new_state, target_actions)
218
+ critic_value = self.critic.forward(state, action)
219
+
220
+ target = []
221
+ for j in range(self.batch_size):
222
+ target.append(reward[j] + self.gamma*critic_value_[j]*done[j])
223
+ target = T.tensor(target).to(self.critic.device)
224
+ target = target.view(self.batch_size, 1)
225
+
226
+ self.critic.train()
227
+ self.critic.optimizer.zero_grad()
228
+ critic_loss = F.mse_loss(target, critic_value)
229
+ critic_loss.backward()
230
+ self.critic.optimizer.step()
231
+ self.critic.eval()
232
+
233
+ self.actor.optimizer.zero_grad()
234
+ mu = self.actor.forward(state)
235
+ self.actor.train()
236
+ actor_loss = -self.critic.forward(state, mu)
237
+ actor_loss = T.mean(actor_loss)
238
+ actor_loss.backward()
239
+ self.actor.optimizer.step()
240
+
241
+ self.update_network_parameters()
242
+
243
+ def update_network_parameters(self, tau=None):
244
+ if tau is None:
245
+ tau = self.tau
246
+
247
+ actor_params = self.actor.named_parameters()
248
+ critic_params = self.critic.named_parameters()
249
+ target_actor_params = self.target_actor.named_parameters()
250
+ target_critic_params = self.target_critic.named_parameters()
251
+
252
+ critic_state_dict = dict(critic_params)
253
+ actor_state_dict = dict(actor_params)
254
+ target_critic_state_dict = dict(target_critic_params)
255
+ target_actor_state_dict = dict(target_actor_params)
256
+
257
+ for name in critic_state_dict:
258
+ critic_state_dict[name] = tau*critic_state_dict[name].clone() + (1-tau)*target_critic_state_dict[name].clone()
259
+
260
+ self.target_critic.load_state_dict(critic_state_dict)
261
+
262
+ for name in actor_state_dict:
263
+ actor_state_dict[name] = tau*actor_state_dict[name].clone() + (1-tau)*target_actor_state_dict[name].clone()
264
+
265
+ self.target_actor.load_state_dict(actor_state_dict)
266
+
267
+ def save_models(self):
268
+ self.actor.save_checkpoint()
269
+ self.target_actor.save_checkpoint()
270
+ self.critic.save_checkpoint()
271
+ self.target_critic.save_checkpoint()
272
+
273
+ def load_models(self):
274
+ self.actor.load_checkpoint()
275
+ self.target_actor.load_checkpoint()
276
+ self.critic.load_checkpoint()
277
+ self.target_critic.load_checkpoint()
278
+
279
+
main.py ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from ddpg import Agent
2
+ import gymnasium as gym
3
+ import numpy as np
4
+ import matplotlib.pyplot as plt
5
+ import torch
6
+ import argparse
7
+ from train import TrainingLoop
8
+ from captum.attr import (IntegratedGradients, LayerConductance, NeuronAttribution)
9
+
10
+ training_loop = TrainingLoop()
11
+
12
+ parser = argparse.ArgumentParser(description="Choose a function to run.")
13
+ parser.add_argument("function", choices=["train", "load-trained", "attribute"], help="The function to run.")
14
+
15
+ args = parser.parse_args()
16
+
17
+ if args.function == "train":
18
+ training_loop.train()
19
+ elif args.function == "load-trained":
20
+ training_loop.load_trained()
21
+ elif args.function == "attribute":
22
+ training_loop.explain_trained(option="2", num_iterations=10)
tmp/ddpg/actor_ddpg ADDED
Binary file (510 kB). View file
 
tmp/ddpg/critic_ddpg ADDED
Binary file (512 kB). View file
 
tmp/ddpg/target_actor_ddpg ADDED
Binary file (510 kB). View file
 
tmp/ddpg/target_critic_ddpg ADDED
Binary file (513 kB). View file
 
train.py ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from ddpg import Agent
2
+ import gymnasium as gym
3
+ import numpy as np
4
+ import matplotlib.pyplot as plt
5
+ import torch
6
+ import argparse
7
+ from captum.attr import (IntegratedGradients)
8
+
9
+
10
+ class TrainingLoop:
11
+ def __init__(self):
12
+ pass
13
+
14
+ def train(self):
15
+ env = gym.make(
16
+ "LunarLander-v2",
17
+ continuous = True,
18
+ gravity = -10.0,
19
+ render_mode = None
20
+ )
21
+
22
+ agent = Agent(alpha=0.000025, beta=0.00025, input_dims=[8], tau=0.001, env=env, batch_size=64, layer1_size=400, layer2_size=300, n_actions=4)
23
+ agent.load_models()
24
+
25
+ np.random.seed(0)
26
+ score_history = []
27
+
28
+ for i in range(1000):
29
+ done = False
30
+ score = 0
31
+ obs, _ = env.reset()
32
+ while not done:
33
+ act = agent.choose_action(obs)
34
+ new_state, reward, terminated, truncated, info = env.step(act)
35
+ done = terminated or truncated
36
+ agent.remember(obs, act, reward, new_state, int(done))
37
+ agent.learn()
38
+ score += reward
39
+ obs = new_state
40
+
41
+ score_history.append(score)
42
+ print("episode", i, "score %.2f" % score, "100 game average %.2f" % np.mean(score_history[-100:]))
43
+ if i % 25 == 0:
44
+ agent.save_models()
45
+
46
+
47
+ def load_trained(self):
48
+ env = gym.make(
49
+ "LunarLanderContinuous-v2",
50
+ render_mode = "human"
51
+ )
52
+
53
+ agent = Agent(alpha=0.000025, beta=0.00025, input_dims=[8], tau=0.001, env=env, batch_size=64, layer1_size=400, layer2_size=300, n_actions=4)
54
+ agent.load_models()
55
+
56
+ np.random.seed(0)
57
+ score_history = []
58
+
59
+ for i in range(50):
60
+ done = False
61
+ score = 0
62
+ obs, _ = env.reset()
63
+
64
+
65
+
66
+ while not done:
67
+ act = agent.choose_action(obs)
68
+ new_state, reward, terminated, truncated, info = env.step(act)
69
+ done = terminated or truncated
70
+ score += reward
71
+ obs = new_state
72
+
73
+ score_history.append(score)
74
+ print("episode", i, "score %.2f" % score, "100 game average %.2f" % np.mean(score_history[-100:]))
75
+
76
+ # Model Explainability
77
+
78
+ from captum.attr import (IntegratedGradients)
79
+
80
+ def _collect_running_baseline_average(self, num_iterations: int) -> torch.Tensor:
81
+ env = gym.make(
82
+ "LunarLanderContinuous-v2",
83
+ render_mode = None
84
+ )
85
+
86
+ agent = Agent(alpha=0.000025, beta=0.00025, input_dims=[8], tau=0.001, env=env, batch_size=64, layer1_size=400, layer2_size=300, n_actions=4)
87
+ agent.load_models()
88
+
89
+ torch.manual_seed(0)
90
+
91
+ sum_obs = torch.zeros(8)
92
+
93
+ for i in range(num_iterations):
94
+ done = False
95
+ score = 0
96
+ obs, _ = env.reset()
97
+
98
+ sum_obs += obs
99
+ print(f"Baseline on interation #{i}: {obs}")
100
+
101
+ while not done:
102
+ act = agent.choose_action(obs, attribution=None, baseline=None)
103
+ new_state, reward, terminated, truncated, info = env.step(act)
104
+ done = terminated or truncated
105
+ score += reward
106
+ obs = new_state
107
+
108
+ return sum_obs / num_iterations
109
+
110
+
111
+ def explain_trained(self, option: str, num_iterations :int = 10) -> None:
112
+ baseline_options = {
113
+ "1": torch.zeros(8),
114
+ "2": self._collect_running_baseline_average(num_iterations),
115
+ }
116
+
117
+ baseline = baseline_options[option]
118
+
119
+ env = gym.make(
120
+ "LunarLanderContinuous-v2",
121
+ render_mode = "human"
122
+ )
123
+
124
+ agent = Agent(alpha=0.000025, beta=0.00025, input_dims=[8], tau=0.001, env=env, batch_size=64, layer1_size=400, layer2_size=300, n_actions=4)
125
+
126
+ agent.load_models()
127
+
128
+ ig = IntegratedGradients(agent.actor)
129
+
130
+ np.random.seed(0)
131
+ score_history = []
132
+
133
+ for i in range(50):
134
+ done = False
135
+ score = 0
136
+ obs, _ = env.reset()
137
+ while not done:
138
+ act = agent.choose_action(obs, attribution=ig, baseline=baseline)
139
+ new_state, reward, terminated, truncated, info = env.step(act)
140
+ done = terminated or truncated
141
+ score += reward
142
+ obs = new_state
143
+
144
+ score_history.append(score)
145
+ print("episode", i, "score %.2f" % score, "100 game average %.2f" % np.mean(score_history[-100:]))
146
+