Erin O.

asked • 10/18/23

Nonstationary Bandit (Python Implementation)

I have to implement the Nonstationary Bandit Problem in Python and cannot figure out what's going on with my implementation. Would really appreciate a second set of eyes and someone to walk through my code with me. The problem is specified as follows:

At a particular step, the ratio of optimal actions is calculated as the number of times the bandit took an action that was actually optimal at that timestep divided by the total number of runs, which is 300. Therefore, if at timestep 7913, the agent selected the optimal action in 210 out of the 300 runs (70%), the value for timestep 7913 should be 0.7.

Note that each averaged value should represent the number for each timestep rather than cumulative statistics, such as moving average.

Your output file should contain 4 lines of numbers. The first and the second line represent the average rewards and ratio of optimal actions generated by action-value method using sample averages at each step. The third and the fourth line represent the same results with action-value method using constant step-size. Since we test it for 10,000 steps, each row should contain exactly 10,000 numbers. (Hint: use numpy.savetxt(fname, data_array) to write such results.)

Erin O.

There seems to be an issue with my optimal action calculation. Any help would be greatly appreciated!
Report

10/18/23

1 Expert Answer

By:

Talha B. answered • 10/25/23

Tutor
5 (4)

Python Virtuoso: Uniting Mastery, Passion, and the Essence of Modern C

Still looking for help? Get the right answer, fast.

Ask a question for free

Get a free answer to a quick problem.
Most questions answered within 4 hours.

OR

Find an Online Tutor Now

Choose an expert and meet online. No packages or subscriptions, pay only for the time you need.