Colin Schepers' Blog: April 2012

Friday, April 27, 2012

Feedback 27-04-2012

Experiment Results

I also ran experiments in the Six Hump Camel Back environment. Again, all experiments are done over 1000 samples and are averaged over 1000 runs. For every experiment the 95% confidence intervals are displayed by a colored region surrounding the lines.

Incremental Regression Tree Induction (IRTI)

MCTS_C = (parentRangeSize / globalRewardVolume) * Config.MCTS_K

MCTS_K=1.0

IRTI_SPLIT_NR_TESTS=100

IRTI_SPLIT_MIN_NR_SAMPLES=75

IRTI_SIGNIFICANCE_LEVEL=0.001

IRTI_MEMORIZATION = true

Hierarchical Optimistic Optimization (HOO)

MCTS_C = (parentActionSpaceVolume / globalActionSpaceVolume) * globalRewardVolume * MCTS_K)
MCTS_K=0.5
HOO_V_1 = (sqrt(nrActionDimensions) / 2) ^ HOO_ALPHA
HOO_RHO = 2 ^ (- HOO_ALPHA / nrActionDimensions)
HOO_ALPHA=0.99
HOO_MEMORIZATION = true

IRTI + HOO + MC + Random + UCT (pre-discretization with 2 splits per depth)

Planning

Run experiments regarding the multi-step agents and possibly debug / optimize code.

Sunday, April 22, 2012

Feedback 22-04-2012

Experiment Results

Finally some nice results to show after some debugging and parameter tuning; manual and automatic. All experiments are done over 1,000 samples and are averaged over 10,000 runs. For every experiment the 95% confidence intervals are displayed by a colored region surrounding the lines.

Incremental Regression Tree Induction (IRTI)

MCTS_C = (childRewardRangeSize / globalRewardRangeSize) * MCTS_K)

MCTS_K = 2.0

IRTI_SPLIT_NR_TESTS = 100

IRTI_SPLIT_MIN_NR_SAMPLES = 50

IRTI_SIGNIFICANCE_LEVEL = 0.001

IRTI_MEMORIZATION = true

Hierarchical Optimistic Optimization (HOO)

MCTS_C = (parentActionSpaceVolume / globalActionSpaceVolume) * globalRewardVolume * MCTS_K
MCTS_K=5.0
HOO_MEMORIZATION = true
HOO_V_1 = (sqrt(nrActionDimensions) / 2) ^ HOO_ALPHA
HOO_RHO = 2 ^ (- HOO_ALPHA / nrActionDimensions)
HOO_ALPHA = 0.99

IRTI + HOO + MC + Random + UCT (pre-discretization with 25 splits per depth)

Work

Implemented all four multi-step agents and seem to be able to balance the pole in the Cartpole environment pretty well.

Regression-based Meta Tree Learning agent (RMTL)
Hierarchical Optimistic Meta Tree Learning agent (HOMTL)
Regression-based Sequence Tree Learning agent (RSTL)
Hierarchical Optimistic Sequence Tree Learning agent (HOSTL)

Planning

Find a better parameter set for HOO for the Six Hump Camel Back function as it is now barely better than UCT with pre-discretization in that environment. I can then generate the same pictures as the ones above.
Do experiments regarding the multi-step agents and possibly debug / optimize code.

Saturday, April 14, 2012

Meeting 12-04-2012

Work

Fixing, debugging and improving IRTI and HOO.
Furthermore, a lot of parameter tuning for above two algorithms.

Action Points

Michael and I had a discussion about the exploration factor

C should be related to the reward range

constant C = globalRangeSize * K
adaptive C = parentRangeSize * K
Lukas proposed: adaptive C = (childRangeSize / globalRangeSize) * K

For HOO, you should also multiply the last term (diam) with C or divide the first term (mean reward) with C.

R + C * exploration + C * diam
R / C + exploration + diam

Planning

Change the HOO formula (see action points)
Generate reward distributions (learning curve, greedy curve, error) of several algorithms

IRTI
HOO
UCT (pre-discretization)
Vanilla MC (random sampling, greedy returns best sample seen)
Random

I should ask Lukas about the working of his IRTI algorithm to find out why mine does not converge samling at the global maximum.

Friday, April 6, 2012

Feedback 06-04-2012

Work

Looked into sample Scala code for parameters used for the experiments of the TLS paper. Unfortunately I wasn't able to generate the same results yet. From the code, it seems following settings are used:

Non-Adaptive C of 0.5
T-test
minNbSamples = 25
minNbSamplesPerPopulation = 5
significanceThreshold = 0.001

Following 4 points are some experiments performed of Regression Trees / HOO in combination with constant / adaptive C in the Sinus experiment (averaged over 1000 tests).
Regression Trees with constant C (= 0.5 * totalRewardRangeSize)

Regression Trees with adaptive C (= 0.5 * parentRewardRangeSize)

HOO with constant C (= 0.5 * totalRewardRangeSize)

HOO with adaptive C (= 0.5 * parentRewardRangeSize)

Regression trees is not able to sample at the global maximum (as of now). It does not explore and split properly to find the best region.
HOO does a much better job than regression trees in the sinus environment and finds the global maximum most of the time.
Although an adaptive c constant causes to focus more on the promising regions, it also causes the algorithm to get stuck in local maxima sometimes. Therefore, over multiple tests, the error is lower for HOO with adaptive c.