Sinus environment: time-based
Six Hump Camel Back environment: time-based
CartPole environment: time-based
I've run the CartPole experiment again with more challenging properties:
- Goal: balance the pole for 1000 steps
- Payoff:
- For each step that the pole is balanced: +1
- If the pole's angle is > 12 degrees from upright position: -1 (and terminal state)
- Maximum expected payoff of 1000 due to the goal's statement
- Added Gaussian noise with standard deviation of 0.3 to both rewards and actions
- See [1] for more detailed description of the environment properties
- Agent's evaluation time per step: 50 ms
Succes
rate
|
Average
payoff
|
|
IRTI + TLS
|
94%
|
971.747
|
IRTI + HOLOP
|
85%
|
922.047
|
HOO + TLS
|
0%
|
77.254
|
HOO + HOLOP
|
3%
|
273.525
|
Transposition Tree
|
65%
|
808.480
|
MC
|
9%
|
389.590
|
[1] H. Van Hasselt and M. A. Wiering, “Reinforcement learning in continuous action spaces,” in Approximate Dynamic Programming and Reinforcement Learning 2007 ADPRL 2007 IEEE International Symposium on, no. Adprl, pp. 272–279, 2007.
Dear Colin,
ReplyDeletenow that's getting quite interesting. In your balancing experiment, I'm surprised MTL is faster for R but slower for H. Do you have an idea how that comes?
Maybe you can also compare noise levels in some simple game for the sake of clarity, or also give performance over samples for different noise levels.
What did you take from the given reference?
Good work :) Best regards, Michael
Dear Michael,
ReplyDeleteThanks again for the feedback!
- Indeed, this is also true for the results of a couple of posts back (then with other settings and 1000 ms learning time). Here my shot at explaining this:
HMTL (HOO + TLS) is very bad without recall or other modifications; HOO splits every simulation meaning every simulation meta tree nodes are dropped from the tree and the meta tree will never grow beyond depth 1. This creation and deletion of meta tree nodes is ofcourse very ineffiecient. If I let HOO split only at each 10th or so step, the number of simulations per step increase significantly.
- Which simple game do you suggest; for Donut World no actual planning is needed and Helicopter is a more complex environment then CartPole.
- From the paper from Hasselt and Wiering I took:
- weight of cart = 1.0 kg
- weight of pole = 0.1 kg
- length of pole = 1 m
- duration of one step = 0.1 s
- payoff function (see point 2 of my post)
- pole's bounds = -12 to +12 degrees
- cart's bounds = -1 to +1 m
- initial deviation of the pole = random number in interval [-0.05, 0.05]
- (Gaussian) noise level of 0.3
Two parameters I set myself:
- Action bounds (maximum force) = 10 N
- Learning time per step = 5 ms
Regards,
Colin