Colin Schepers' Blog: Feedback 16-05-2012

Work

Below is my planning from my last post with comments.

Investigate why the IRTI line horizontal at the end of the Six Hump Camel Back Experiments.

The maximum tree depth was not the cause
It was caused due to the fact that the values get really close together at some point and the exploration value is not able to "fix" random bad choices (at least not within a reasonable number of sampels)

Implement a visualization to show at which time a split occurs (at which sample number).

See the two posts with the results of the Sinus and Six Hump Camel Back environment
The "1-dimensional heatmap" or "vector or colors" representation did not look so nice as the graph representation

Check if it is correct that the RMTL agent has a significant decrease in number of simulations per second when memorization is enabled, as seen in the table of the last post.

Think which parameters are relevant (to mention in the report) and remove the redundant.

The posts with the Sinus and Six Hump results are edited and now only show a reduced amount of relevent parameters

Think of a graph representation of the multi-step experiments and rerun them under more difficult settings, i.e. less time or the settings from the paper (with noisy rewards, noisy actions, etc).

I tried a different setting for the CartPole which is a bit more difficult [1]
I found out that Vanilla MC is really good and outperforms all the other algorithms; it can balance perfectly each time given only 50ms per step. The reason is because it can achieve a high number of simulations.
I saw that for greedy pick I used the best sample found and not the best sample of the leaf with the best mean (for the results of the post from 02-05). In the first case, the agent is comparible with the Vanilla agent and performs pretty well while in the latter case, the agent performs worse and is not able to achieve enough simulations for an accurate approximation.
The Donut World environment has the property/flaw that optimizing (only) the next step is sufficient and no future planning is actually needed.

Implemented a way that multi-step action sequences with reward sequences are stored and "replayed" whenever a split occurres, as discussed with Lukas and Andreas. It seems the agents perform better with this option.

Planning

[1] H. Van Hasseld and M.A. Wiering, "Using Continuous Action Spaces to Solve Discrete Problems".

Wednesday, May 16, 2012