Wednesday, March 28, 2012

Meeting 28-03-2012


Action Points
  • We all taked about our progress, results so far and current work
  • I implemented IRTI, TLS, HOO and HOLOP although the latter two are under development
  • By observing the agent's behavior in RL Viz I notices a couple of things which I should look into
    • IRTI gets stuck in a non-optimal regions very rarely (Six Hump Camel Back; approximately 1 out of 100 times)
    • The TLS agent does sometimes a bad move in the Double Integrator
    • The sinus function seems a difficult problem for IRTI and sampling is mostly done at the local maxima at the left
    • HOO performs better than IRTI at the Sinus function and very bad in the Six Hump Camel Back environment
    • HOO is much slower than IRTI
    • HOO does not explore correctly
    • HOLOP performs bad, most likely due to the incorrect workings of HOO
  • I should look into these problems. The planning below show some solutions for above problems
Planning
  • Try to split sooner for the Sinus environment and observe results
  • Make nodes beforehand
  • Remove storing the action ranges within the nodes (to avoid array copying each split)
  • Scale the exploration term in the calculation of U in HOO
  • Update U and B values before the selection step
  • Change the discounting of reward (the power should match the depth)
  • Investigate saving of images in RL Viz
Next meeting
  • Individual meeting: +- April 11
  • Joint meeting: +- April 25

Saturday, March 17, 2012

Meeting 16-03-2012

Work
  • Implemented the option of time based learning (basides simulation based learning)
  • Implemented discounted rewards for the backtracking step
  • Added a parameter for the maximum expansion depth of the meta tree
  • Did some optimization for IRTI / RMTL and experimented a bit
  • Implemented the Double Integrator environment mentioned in several papers
  • Started implementation HOO
    • Seems to work for one-step one-dimensional problems (i.e. one-step Donut World)
    • Fails in Six Hump Camel Back
Action Points
  • We looked at the results from previous post which looked good
    • I should look into the "artifacts" still present in the samples "901-1000" (you still see some lines of samples near the (local and global) optima)
    • I still have to add error bars
  • Kurt gave me some tips for the presentation and report
  • As it was not really clear to me from the literature I asked about the choice of splitting in HOO
    • Which dimension? Random
    • At which point in the dimension's range to split? Random
  • For HOLOP however, dimensions corresponding to early stages in the sequence should be chosen more often as they tend to have a bigger contribution the to return (see HOLOP paper)
  • The last term in the upperbound formula (sometimes referred as v1*p^h and sometimes diam(Ph,i)) was also not clear to me. I'm still struggling a bit with this, but on the other hand this should not affect the working of HOO to much. 
Planning
  • Finish presentation for thesis meeting March 21
  • Finalise / Debug HOO
  • Thesis report wrinting / restructuring
Next meeting
  • Joint meeting: Wednesday, March 28, 2012, 10:00-12:00

Monday, March 5, 2012

Feedback 05-03-2012

Work
  • Set up structure (chapters / sections) of the report 
  • Wrote few blocks of text for report
  • Though about the names for the 4 combinations (Regression Tree / Hoo + Meta Tree / Sequence Tree)
    • Regression-based Meta Tree Learning (RMTL)
      • similar to Tree Learning Search (TLS)
    • Hierarchical Optimistic Sequence-based Tree Learning (HOSTL)
      • similar to Hierarchical Open-Loop Optimistic Planning (HOLOP)
    • Hierarchical Optimistic Meta Tree Learning (HOMTL)
    • Regression and Sequence-based Tree Learning (RSTL)
  • While thinking about the meta tree interface for next meeting I already had a go on the implementation resulting in a working RMTL / TLS agent (so it seems)
  • Below are some results for a multi-step problem using RMTL / TLS
    • Environment: Donut World
    • average of 100 episodes
    • 3 step limit
    • 10,000 simulations per step
    • maximum reward per step = 1 (when being exactly on the middle of the donut region)
    • gradual reward degrease to 0 towards the (inner or outer) edge of the donut 
    • Minimum reward = 0 (off the donut)
Step
Average reward
Minimum reward
Maximum reward
1
0.977023
0.888396
0.999574
2
0.983432
0.768632
0.999977
3
0.993064
0.62132
1
 
Average cumulative reward
Minimum cumulative reward
Maximum cumulative reward
2.95352
2.555052
2.998965

Planning
  • Preparation for meeting Wednesday March 7th
  • Report writing
  • Debug / investigate TLS / RMTL