Friday, April 27, 2012

Feedback 27-04-2012


Experiment Results

I also ran experiments in the Six Hump Camel Back environment. Again, all experiments are done over 1000 samples and are averaged over 1000 runs. For every experiment the 95% confidence intervals are displayed by a colored region surrounding the lines.

Incremental Regression Tree Induction (IRTI)
  • MCTS_C = (parentRangeSize / globalRewardVolume) * Config.MCTS_K
  • MCTS_K=1.0
  • IRTI_SPLIT_NR_TESTS=100
  • IRTI_SPLIT_MIN_NR_SAMPLES=75
  • IRTI_SIGNIFICANCE_LEVEL=0.001
  • IRTI_MEMORIZATION = true







Hierarchical Optimistic Optimization (HOO)
  • MCTS_C = (parentActionSpaceVolume / globalActionSpaceVolume) * globalRewardVolume * MCTS_K)
  • MCTS_K=0.5
  • HOO_V_1 = (sqrt(nrActionDimensions) / 2) ^ HOO_ALPHA
  • HOO_RHO = 2 ^ (- HOO_ALPHA /  nrActionDimensions)
  • HOO_ALPHA=0.99
  • HOO_MEMORIZATION = true

IRTI + HOO +  MC + Random + UCT (pre-discretization with 2 splits per depth)


Planning
  1. Run experiments regarding the multi-step agents and possibly debug / optimize code.

Sunday, April 22, 2012

Feedback 22-04-2012

Experiment Results

Finally some nice results to show after some debugging and parameter tuning; manual and    automatic. All experiments are done over 1,000 samples and are averaged over 10,000 runs. For every experiment the 95% confidence intervals are displayed by a colored region surrounding the lines.

Incremental Regression Tree Induction (IRTI)
  • MCTS_C = (childRewardRangeSize / globalRewardRangeSize) * MCTS_K)
  • MCTS_K = 2.0
  • IRTI_SPLIT_NR_TESTS = 100
  • IRTI_SPLIT_MIN_NR_SAMPLES = 50
  • IRTI_SIGNIFICANCE_LEVEL = 0.001
  • IRTI_MEMORIZATION = true







Hierarchical Optimistic Optimization (HOO)
  • MCTS_C  = (parentActionSpaceVolume / globalActionSpaceVolume) * globalRewardVolume * MCTS_K
  • MCTS_K=5.0
  • HOO_MEMORIZATION = true
  • HOO_V_1 = (sqrt(nrActionDimensions) / 2) ^ HOO_ALPHA 
  • HOO_RHO = 2 ^ (- HOO_ALPHA /  nrActionDimensions)
  • HOO_ALPHA = 0.99

IRTI + HOO +  MC + Random  + UCT (pre-discretization with 25 splits per depth)




Work

Implemented all four multi-step agents and seem to be able to balance the pole in the Cartpole environment pretty well.
    1. Regression-based Meta Tree Learning agent (RMTL)
    2. Hierarchical Optimistic Meta Tree Learning agent (HOMTL)
    3. Regression-based Sequence Tree Learning agent (RSTL)
    4. Hierarchical Optimistic Sequence Tree Learning agent (HOSTL)
Planning
  1. Find a better parameter set for HOO for the Six Hump Camel Back function as it is now barely better than UCT with pre-discretization in that environment. I can then generate the same pictures as the ones above.
  2. Do experiments regarding the multi-step agents and possibly debug / optimize code.

Saturday, April 14, 2012

Meeting 12-04-2012

Work
  1. Fixing, debugging and improving IRTI and HOO. 
  2. Furthermore, a lot of parameter tuning for above two algorithms.
Action Points
  1. Michael and I had a discussion about the exploration factor
    1. C should be related to the reward range
      1. constant C =  globalRangeSize * K
      2. adaptive C =  parentRangeSize * K
      3. Lukas proposed: adaptive C = (childRangeSize / globalRangeSize) * K
    2. For HOO, you should also multiply the last term (diam) with C or divide the first term (mean reward) with C.
      1. R + C * exploration + C * diam 
      2. R / C + exploration + diam
Planning
  • Change the HOO formula (see action points)
  • Generate reward distributions (learning curve, greedy curve, error) of several algorithms
    • IRTI
    • HOO
    • UCT (pre-discretization)
    • Vanilla MC (random sampling, greedy returns best sample seen)
    • Random
  • I should ask Lukas about the working of his IRTI algorithm to find out why mine does not converge samling at the global maximum.

Friday, April 6, 2012

Feedback 06-04-2012

Work
  1. Looked into sample Scala code for parameters used for the experiments of the TLS paper. Unfortunately I wasn't able to generate the same results yet. From the code, it seems following settings are used:
    1. Non-Adaptive C of 0.5
    2. T-test
    3. minNbSamples = 25
    4. minNbSamplesPerPopulation = 5
    5. significanceThreshold = 0.001
  2. Following 4 points are some experiments performed of Regression Trees / HOO in combination with constant / adaptive C in the Sinus experiment (averaged over 1000 tests).
  3. Regression Trees with constant C (= 0.5 *  totalRewardRangeSize)




    1. Regression Trees with adaptive C (= 0.5 *  parentRewardRangeSize) 




    2. HOO with constant C (= 0.5 * totalRewardRangeSize) 




    3. HOO with adaptive C (= 0.5 * parentRewardRangeSize) 




    4. Regression trees is not able to sample at the global maximum (as of now). It does not explore and split properly to find the best region.
    5. HOO does a much better job than regression trees in the sinus environment and finds the global maximum most of the time.
    6. Although an adaptive c constant causes to focus more on the promising regions, it also causes the algorithm to get stuck in local maxima sometimes. Therefore, over multiple tests, the error is lower for HOO with adaptive c.