Erin O.
asked 10/18/23Nonstationary Bandit (Python Implementation)
I have to implement the Nonstationary Bandit Problem in Python and cannot figure out what's going on with my implementation. Would really appreciate a second set of eyes and someone to walk through my code with me. The problem is specified as follows:
At a particular step, the ratio of optimal actions is calculated as the number of times the bandit took an action that was actually optimal at that timestep divided by the total number of runs, which is 300. Therefore, if at timestep 7913, the agent selected the optimal action in 210 out of the 300 runs (70%), the value for timestep 7913 should be 0.7.
Note that each averaged value should represent the number for each timestep rather than cumulative statistics, such as moving average.
Your output file should contain 4 lines of numbers. The first and the second line represent the average rewards and ratio of optimal actions generated by action-value method using sample averages at each step. The third and the fourth line represent the same results with action-value method using constant step-size. Since we test it for 10,000 steps, each row should contain exactly 10,000 numbers. (Hint: use numpy.savetxt(fname, data_array) to write such results.)
1 Expert Answer

Talha B. answered 10/25/23
Python Virtuoso: Uniting Mastery, Passion, and the Essence of Modern C
General Setup
Initialize k bandit arms with true values (q*).
Initialize N runs of T timesteps each.
Sample Averages
Initialize action-value estimates Q(a) for all a to 0.
Initialize counts N(a) for all a to 0.
FOR each run
FOR each timestep t
IF random() < epsilon THEN
SELECT a randomly
ELSE
SELECT a = argmax(Q(a))
ENDIF
RECEIVE reward R from bandit arm a
UPDATE N(a) = N(a) + 1
UPDATE Q(a) = Q(a) + (R - Q(a)) / N(a)
IF a is optimal THEN
UPDATE optimal action counter
ENDIF
ENDFOR
UPDATE true values q* with random walk
ENDFOR
Constant Step-size
Initialize action-value estimates Q(a) for all a to 0.
Initialize counts N(a) for all a to 0.
Initialize a constant step-size parameter alpha.
FOR each run
FOR each timestep t
IF random() < epsilon THEN
SELECT a randomly
ELSE
SELECT a = argmax(Q(a))
ENDIF
RECEIVE reward R from bandit arm a
UPDATE N(a) = N(a) + 1
UPDATE Q(a) = Q(a) + alpha * (R - Q(a))
IF a is optimal THEN
UPDATE optimal action counter
ENDIF
ENDFOR
UPDATE true values q* with random walk
ENDFOR
Guidelines
Numpy: Use Numpy arrays to store action-values and counts for performance benefits.
Random Walk: Implement a function to update q* (the true values of each arm) by a random walk at the end of each timestep.
Optimal Action: Keep track of the optimal action at each timestep to calculate the ratio of optimal actions.
Output: Store your average rewards and ratio of optimal actions for each method in separate Numpy arrays. After your runs are complete, output these as specified.
Still looking for help? Get the right answer, fast.
Get a free answer to a quick problem.
Most questions answered within 4 hours.
OR
Choose an expert and meet online. No packages or subscriptions, pay only for the time you need.
Erin O.
There seems to be an issue with my optimal action calculation. Any help would be greatly appreciated!10/18/23