Nonstationary Bandit (Python Implementation)

Question

I have to implement the Nonstationary Bandit Problem in Python and cannot figure out what's going on with my implementation. Would really appreciate a second set of eyes and someone to walk through my code with me. The problem is specified as follows:At a particular step, the ratio of optimal actions is calculated as the number of times the bandit took an action that was actually optimal at that timestep divided by the total number of runs, which is 300. Therefore, if at timestep 7913, the agent selected the optimal action in 210 out of the 300 runs (70%), the value for timestep 7913 should be 0.7.Note that each averaged value should represent the number for each timestep rather than cumulative statistics, such as moving average.Your output file should contain 4 lines of numbers. The first and the second line represent the average rewards and ratio of optimal actions generated by action-value method using sample averages at each step. The third and the fourth line represent the same results with action-value method using constant step-size. Since we test it for 10,000 steps, each row should contain exactly 10,000 numbers. (Hint: use numpy.savetxt(fname, data_array) to write such results.)

Talha B. · Accepted Answer

General SetupInitialize k bandit arms with true values (q*).Initialize N runs of T timesteps each.Sample AveragesInitialize action-value estimates Q(a) for all a to 0.Initialize counts N(a) for all a to 0.FOR each run    FOR each timestep t        IF random() < epsilon THEN            SELECT a randomly        ELSE            SELECT a = argmax(Q(a))        ENDIF        RECEIVE reward R from bandit arm a        UPDATE N(a) = N(a) + 1        UPDATE Q(a) = Q(a) + (R - Q(a)) / N(a)                IF a is optimal THEN            UPDATE optimal action counter        ENDIF    ENDFOR    UPDATE true values q* with random walkENDFORConstant Step-sizeInitialize action-value estimates Q(a) for all a to 0.Initialize counts N(a) for all a to 0.Initialize a constant step-size parameter alpha.FOR each run    FOR each timestep t        IF random() < epsilon THEN            SELECT a randomly        ELSE            SELECT a = argmax(Q(a))        ENDIF        RECEIVE reward R from bandit arm a        UPDATE N(a) = N(a) + 1        UPDATE Q(a) = Q(a) + alpha * (R - Q(a))                IF a is optimal THEN            UPDATE optimal action counter        ENDIF    ENDFOR    UPDATE true values q* with random walkENDFORGuidelinesNumpy: Use Numpy arrays to store action-values and counts for performance benefits.Random Walk: Implement a function to update q* (the true values of each arm) by a random walk at the end of each timestep.Optimal Action: Keep track of the optimal action at each timestep to calculate the ratio of optimal actions.Output: Store your average rewards and ratio of optimal actions for each method in separate Numpy arrays. After your runs are complete, output these as specified.

Nonstationary Bandit (Python Implementation)

1 Expert Answer

Still looking for help? Get the right answer, fast.

OR

RELATED TOPICS

RELATED QUESTIONS

what are all the common multiples of 12 and 15

need to know how to do this problem

what are methods used to measure ingredients and their units of measure

how do you multiply money

spimlify 4x-(2-3x)-5

RECOMMENDED TUTORS

IXL

Rosetta Stone

Education.com

TPT

Vocabulary.com

ABCya

SpanishDictionary.com

Inglés.com

Emmersion

Nonstationary Bandit (Python Implementation)

1 Expert Answer

Still looking for help? Get the right answer, fast.

OR

RELATED TOPICS

RELATED QUESTIONS

what are all the common multiples of 12 and 15

need to know how to do this problem

what are methods used to measure ingredients and their units of measure

how do you multiply money

spimlify 4x-(2-3x)-5

RECOMMENDED TUTORS

find an online tutor