Sinus environment: time-based
Six Hump Camel Back environment: time-based
CartPole environment: time-based
I've run the CartPole experiment again with more challenging properties:
- Goal: balance the pole for 1000 steps
- Payoff:
- For each step that the pole is balanced: +1
- If the pole's angle is > 12 degrees from upright position: -1 (and terminal state)
- Maximum expected payoff of 1000 due to the goal's statement
- Added Gaussian noise with standard deviation of 0.3 to both rewards and actions
- See [1] for more detailed description of the environment properties
- Agent's evaluation time per step: 50 ms
Succes
rate
|
Average
payoff
|
|
IRTI + TLS
|
94%
|
971.747
|
IRTI + HOLOP
|
85%
|
922.047
|
HOO + TLS
|
0%
|
77.254
|
HOO + HOLOP
|
3%
|
273.525
|
Transposition Tree
|
65%
|
808.480
|
MC
|
9%
|
389.590
|
[1] H. Van Hasselt and M. A. Wiering, “Reinforcement learning in continuous action spaces,” in Approximate Dynamic Programming and Reinforcement Learning 2007 ADPRL 2007 IEEE International Symposium on, no. Adprl, pp. 272–279, 2007.