Friday, May 18, 2012

Feedback 18-05-2012

Results

Sinus environment: time-based


Six Hump Camel Back environment: time-based


CartPole environment: time-based

I've run the CartPole experiment again with more challenging properties:
  1. Goal: balance the pole for 1000 steps
  2. Payoff:
    1. For each step that the pole is balanced: +1
    2. If the pole's angle is > 12 degrees from upright position: -1 (and terminal state)
    3. Maximum expected payoff of 1000 due to the goal's statement
  3. Added Gaussian noise with standard deviation of 0.3 to both rewards and actions
  4. See [1] for more detailed description of the environment properties
  5. Agent's evaluation time per step: 50 ms


Succes rate
Average payoff
IRTI + TLS
94%
971.747
IRTI + HOLOP
85%
922.047
HOO + TLS
0%
77.254
HOO + HOLOP
3%
273.525
Transposition Tree
65%
808.480
MC
9%
389.590


[1] H. Van Hasselt and M. A. Wiering, “Reinforcement learning in continuous action spaces,” in Approximate Dynamic Programming and Reinforcement Learning 2007 ADPRL 2007 IEEE International Symposium on, no. Adprl, pp. 272–279, 2007.

2 comments:

  1. Dear Colin,

    now that's getting quite interesting. In your balancing experiment, I'm surprised MTL is faster for R but slower for H. Do you have an idea how that comes?

    Maybe you can also compare noise levels in some simple game for the sake of clarity, or also give performance over samples for different noise levels.

    What did you take from the given reference?

    Good work :) Best regards, Michael

    ReplyDelete
  2. Dear Michael,

    Thanks again for the feedback!

    - Indeed, this is also true for the results of a couple of posts back (then with other settings and 1000 ms learning time). Here my shot at explaining this:
    HMTL (HOO + TLS) is very bad without recall or other modifications; HOO splits every simulation meaning every simulation meta tree nodes are dropped from the tree and the meta tree will never grow beyond depth 1. This creation and deletion of meta tree nodes is ofcourse very ineffiecient. If I let HOO split only at each 10th or so step, the number of simulations per step increase significantly.

    - Which simple game do you suggest; for Donut World no actual planning is needed and Helicopter is a more complex environment then CartPole.

    - From the paper from Hasselt and Wiering I took:
    - weight of cart = 1.0 kg
    - weight of pole = 0.1 kg
    - length of pole = 1 m
    - duration of one step = 0.1 s
    - payoff function (see point 2 of my post)
    - pole's bounds = -12 to +12 degrees
    - cart's bounds = -1 to +1 m
    - initial deviation of the pole = random number in interval [-0.05, 0.05]
    - (Gaussian) noise level of 0.3
    Two parameters I set myself:
    - Action bounds (maximum force) = 10 N
    - Learning time per step = 5 ms

    Regards,
    Colin

    ReplyDelete