Colin Schepers' Blog: Feedback 18-05-2012

Friday, May 18, 2012

Feedback 18-05-2012

Results

Sinus environment: time-based

Six Hump Camel Back environment: time-based

CartPole environment: time-based

I've run the CartPole experiment again with more challenging properties:

Goal: balance the pole for 1000 steps
Payoff:

For each step that the pole is balanced: +1
If the pole's angle is > 12 degrees from upright position: -1 (and terminal state)
Maximum expected payoff of 1000 due to the goal's statement

Added Gaussian noise with standard deviation of 0.3 to both rewards and actions
See [1] for more detailed description of the environment properties
Agent's evaluation time per step: 50 ms

	Succes rate	Average payoff
IRTI + TLS	94%	971.747
IRTI + HOLOP	85%	922.047
HOO + TLS	0%	77.254
HOO + HOLOP	3%	273.525
Transposition Tree	65%	808.480
MC	9%	389.590

[1] H. Van Hasselt and M. A. Wiering, “Reinforcement learning in continuous action spaces,” in Approximate Dynamic Programming and Reinforcement Learning 2007 ADPRL 2007 IEEE International Symposium on, no. Adprl, pp. 272–279, 2007.

2 comments:

MichaelKaisers18 May, 2012 21:20
Dear Colin,

now that's getting quite interesting. In your balancing experiment, I'm surprised MTL is faster for R but slower for H. Do you have an idea how that comes?

Maybe you can also compare noise levels in some simple game for the sake of clarity, or also give performance over samples for different noise levels.

What did you take from the given reference?

Good work :) Best regards, Michael
ReplyDelete
Replies
Colin18 May, 2012 23:00
Dear Michael,

Thanks again for the feedback!

- Indeed, this is also true for the results of a couple of posts back (then with other settings and 1000 ms learning time). Here my shot at explaining this:
HMTL (HOO + TLS) is very bad without recall or other modifications; HOO splits every simulation meaning every simulation meta tree nodes are dropped from the tree and the meta tree will never grow beyond depth 1. This creation and deletion of meta tree nodes is ofcourse very ineffiecient. If I let HOO split only at each 10th or so step, the number of simulations per step increase significantly.

- Which simple game do you suggest; for Donut World no actual planning is needed and Helicopter is a more complex environment then CartPole.

- From the paper from Hasselt and Wiering I took:
- weight of cart = 1.0 kg
- weight of pole = 0.1 kg
- length of pole = 1 m
- duration of one step = 0.1 s
- payoff function (see point 2 of my post)
- pole's bounds = -12 to +12 degrees
- cart's bounds = -1 to +1 m
- initial deviation of the pole = random number in interval [-0.05, 0.05]
- (Gaussian) noise level of 0.3
Two parameters I set myself:
- Action bounds (maximum force) = 10 N
- Learning time per step = 5 ms

Regards,
Colin
ReplyDelete
Replies

Add comment