Sunday, May 27, 2012

Feedback 27-05-2012

Work
  1. Looked into the perfect recall and fixed it. RMTL is able to balance for 1000 steps 94% (vs 69% before). I've updated the table from a couple of posts back and for also provided it below in this post.
  2. Implemented the transposition tree and works pretty well. For the CartPole environment it performs better than Random, MC, HOMTL and HOSTL (see table). Offline it does not very good and probably lacks information about important states (not provided in the table).
  3. I removed the number of simulations from the table as in my opinion they were not really comparable anymore since I changed the parameters for HOO so it could achieve more simulations (but less information). Furthermore, the Transposition Tree Agent actually updates the tree structure each step (and not each simulation). Meaning that after 1 rollout, it can actually update 20 times (depending on the length of the rollout). Therefore I chose to remove them out of the table.
  4. I rerun the noisy Donut World experiment but now with reward noise. I appeared that there is no significant decrease in performance caused by this noise. Furthermore, I tried the transposition tree agent for this environment and it performed better than all other algorithms.

Succes rate
Average payoff
IRTI + TLS
94%
971.747
IRTI + HOLOP
85%
922.047
HOO + TLS
0%
77.254
HOO + HOLOP
3%
273.525
Transposition Tree
65%
808.480
MC
9%
389.590

Friday, May 25, 2012

Meeting 25-05-2012

Action Points
  1. We started with discussing the planning proposed at the beginning of the year. For me, I was pretty accurate, except that I did not yet implement the "transposition tree" and the writing which started a bit later. 
  2. We should use noise for the reward in stead of on the actions because if adding noise to the action can also change the optimal reward function, i.e. even when playing the best actions one would not be able to achieve the best reward. To draw the optimal line when having noisy actions you'd have to calculate products / intervals (to keep it general). 
  3. For multi step results I should also make a MTL-UCT agent
  4. I asked what to do with the state information of my "transposition tree". Michael proposed an idea and I brainstormed with Andreas afterwards:
    1. Assume the state/observation space to be Markov
    2. First level is an "observation tree", discretizing to observation space.
    3. Each leaf representing a region of this space links to an action tree on the second level, discretizing the action space and keeping information about which actions are best for the observation range from the level above.
    4. This observation-action tree can be re-used for each state the agent is in due the the property of 1.
Plans
  1. Change last experiment so that noise is on rewards.
  2. Add the MTL-UCT agent to the experiment
  3. Look into the global/perfect recall again
  4. Implement the "transposition tree"
  5. Writing

Feedback 25-05-2012

[Post removed; see post of 04-06-2012]


Friday, May 18, 2012

Feedback 18-05-2012

Results

Sinus environment: time-based


Six Hump Camel Back environment: time-based


CartPole environment: time-based

I've run the CartPole experiment again with more challenging properties:
  1. Goal: balance the pole for 1000 steps
  2. Payoff:
    1. For each step that the pole is balanced: +1
    2. If the pole's angle is > 12 degrees from upright position: -1 (and terminal state)
    3. Maximum expected payoff of 1000 due to the goal's statement
  3. Added Gaussian noise with standard deviation of 0.3 to both rewards and actions
  4. See [1] for more detailed description of the environment properties
  5. Agent's evaluation time per step: 50 ms


Succes rate
Average payoff
IRTI + TLS
94%
971.747
IRTI + HOLOP
85%
922.047
HOO + TLS
0%
77.254
HOO + HOLOP
3%
273.525
Transposition Tree
65%
808.480
MC
9%
389.590


[1] H. Van Hasselt and M. A. Wiering, “Reinforcement learning in continuous action spaces,” in Approximate Dynamic Programming and Reinforcement Learning 2007 ADPRL 2007 IEEE International Symposium on, no. Adprl, pp. 272–279, 2007.

Wednesday, May 16, 2012

Feedback 16-05-2012

Work

Below is my planning from my last post with comments.
  1. Investigate why the IRTI line horizontal at the end of the Six Hump Camel Back Experiments.
    1. The maximum tree depth was not the cause
    2. It was caused due to the fact that the values get really close together at some point and the exploration value is not able to "fix" random bad choices (at least not within a reasonable number of sampels)
  2. Implement a visualization to show at which time a split occurs (at which sample number). 
    1. See the two posts with the results of the Sinus and Six Hump Camel Back environment
    2. The "1-dimensional heatmap" or "vector or colors" representation did not look so nice as the graph representation
  3. Check if it is correct that the RMTL agent has a significant decrease in number of simulations per second when memorization is enabled, as seen in the table of the last post.
    1. This is correct
  4. Think which parameters are relevant (to mention in the report) and remove the redundant.
    1. The posts with the Sinus and Six Hump results are edited and now only show a reduced amount of relevent parameters
  5. Optimize the multi-step agents
    1. Been tweaking the  multi-step agents a lot
  6. Think of a graph representation of the multi-step experiments and rerun them under more difficult settings, i.e. less time or the settings from the paper (with noisy rewards, noisy actions, etc).
    1. I tried a different setting for the CartPole which is a bit more difficult [1]
    2. I found out that Vanilla MC is really good and outperforms all the other algorithms; it can balance perfectly each time given only 50ms per step. The reason is because it can achieve a high number of simulations. 
    3. I saw that for greedy pick I used the best sample found and not the best sample of the leaf with the best mean (for the results of the post from 02-05). In the first case, the agent is comparible with the Vanilla agent and performs pretty well while in the latter case, the agent performs worse and is not able to achieve enough simulations for an accurate approximation.
    4. The Donut World environment has the property/flaw that optimizing (only) the next step is sufficient and no future planning is actually needed.
  7. Implement perfect recall on meta tree level
    1. Implemented a way that multi-step action sequences with reward sequences are stored and "replayed" whenever a split occurres, as discussed with Lukas and Andreas. It seems the agents perform better with this option.
  8. Furthermore, I've been writing on the report.
Planning
  1. Report writing
  2. Work on multi-step agents and experiments
  3. Implement Transposition Tree

[1] H. Van Hasseld and M.A. Wiering, "Using Continuous Action Spaces to Solve Discrete Problems".

Wednesday, May 2, 2012

Meeting 02-05-2012

Discussion Points

Andreas, Lukas and me presented the results of the experiments (see previous posts for my results). I've got some usefull feedback and comment, which are summarized in the planning below.

Planning

  1. Investigate why the IRTI line horizontal at the end of the Six Hump Camel Back Experiments.
  2. Implement a visualization to show at which time a split occurs (at which sample number). 
  3. Check if it is correct that the RMTL agent has a significant decrease in number of simulations per second when memorization is enabled, as seen in the table of the last post.
  4. Think which parameters are relevant (to mention in the report) and remove the redundant.
  5. Optimize the multi-step agents
  6. Think of a graph representation of the multi-step experiments and rerun them under more difficult settings, i.e. less time or the settings from the paper (with noisy rewards, noisy actions, etc).
  7. Implement perfect recall on meta tree level
  8. Implement Transposition Tree

Feedback 02-05-2012

Multi-step Results

I've ran some experiments regarding the four multi-step algorithms in the CartPole environment:

- I used pretty "easy" settings for the environment (no transition/reward/observation noise, a small but sufficient action space (force applied to the cart), etc.).
- On the other hand, I limited the agent to (only) 100 ms per step.
- The rewards are equal to the number of steps the agent is able to balance the pole.
- Per algorithm, 100 runs were performed.
- Note that I did not improve/profile the agents yet and neither did I tune the parameters in detail. Furthermore, I could only do 100 runs due to time constraints of today's meeting. Therefore, the results have to be considered preliminary.

algorithm  |  reward
<= 100
> 100
> 250
> 500
>= 1000
RMTL
0
0
0
1
99
RSTL
2
2
3
3
90
HOSTL
2
2
3
8
85
HOMTL
0
0
0
0
100

- The next table shows the number of simulations each algorithm can perform in 1 second (roughly; on average) during the first 10 steps in the CartPole environment.

algorithm   |   avg sims/s
Memorization
No memorization
RMTL
37000
39000
RSTL
18000
32000
HOSTL
7300
7700
HOMTL
5000
5000


Thesis Overview