Wednesday, May 16, 2012

Feedback 16-05-2012

Work

Below is my planning from my last post with comments.
  1. Investigate why the IRTI line horizontal at the end of the Six Hump Camel Back Experiments.
    1. The maximum tree depth was not the cause
    2. It was caused due to the fact that the values get really close together at some point and the exploration value is not able to "fix" random bad choices (at least not within a reasonable number of sampels)
  2. Implement a visualization to show at which time a split occurs (at which sample number). 
    1. See the two posts with the results of the Sinus and Six Hump Camel Back environment
    2. The "1-dimensional heatmap" or "vector or colors" representation did not look so nice as the graph representation
  3. Check if it is correct that the RMTL agent has a significant decrease in number of simulations per second when memorization is enabled, as seen in the table of the last post.
    1. This is correct
  4. Think which parameters are relevant (to mention in the report) and remove the redundant.
    1. The posts with the Sinus and Six Hump results are edited and now only show a reduced amount of relevent parameters
  5. Optimize the multi-step agents
    1. Been tweaking the  multi-step agents a lot
  6. Think of a graph representation of the multi-step experiments and rerun them under more difficult settings, i.e. less time or the settings from the paper (with noisy rewards, noisy actions, etc).
    1. I tried a different setting for the CartPole which is a bit more difficult [1]
    2. I found out that Vanilla MC is really good and outperforms all the other algorithms; it can balance perfectly each time given only 50ms per step. The reason is because it can achieve a high number of simulations. 
    3. I saw that for greedy pick I used the best sample found and not the best sample of the leaf with the best mean (for the results of the post from 02-05). In the first case, the agent is comparible with the Vanilla agent and performs pretty well while in the latter case, the agent performs worse and is not able to achieve enough simulations for an accurate approximation.
    4. The Donut World environment has the property/flaw that optimizing (only) the next step is sufficient and no future planning is actually needed.
  7. Implement perfect recall on meta tree level
    1. Implemented a way that multi-step action sequences with reward sequences are stored and "replayed" whenever a split occurres, as discussed with Lukas and Andreas. It seems the agents perform better with this option.
  8. Furthermore, I've been writing on the report.
Planning
  1. Report writing
  2. Work on multi-step agents and experiments
  3. Implement Transposition Tree

[1] H. Van Hasseld and M.A. Wiering, "Using Continuous Action Spaces to Solve Discrete Problems".

1 comment:

  1. Dear Colin,

    nice polishing of previous results.

    If you want to challenge simple MC and elicit the advantage of trees you should test the environments with additive noise (corrupt the actual reward with some zero-mean noise, e.g., normally distributed).

    Looking forward to the multi-step evaluations. Best regards, Michael

    ReplyDelete