Thursday, April 5, 2012

Feedback 05-04-2012

Work
  1. Lukas pointed out that my discounting was incorrect in TLS. The power of the gamma should always match the depth, even when updating nodes lower in the tree. I fixed this.
  2. Instantiating nodes beforehand for HOO did not make a difference in the performance.
  3. Removing the storage of the action ranges within the nodes (to avoid array copying each split) makes only a small difference in simulations per second. For compatibility and readability I decided to stick to the initial implementation of having the action ranges within the nodes.
  4. Updating U and B values before the selection step increased the number of simulations. This means a full tree traversal in done to update the U and B values of all nodes, updating the children before the parent (since the calculation of B takes the B-values of the children). To avoid a stack overflow when using a recursive post-order traversal, I decided to (also) store the (reference to) nodes in a list in which the nodes added in order of creation. By simply using a descendingIterator makes sure the children are always updated before its parent.
  5. I tried scaling the exploration term in the calculation of U in HOO, but did not (yet) succeed to let the HOO algorithm perform well in environments with rewards not in the range of 0 to 1. The sinus environment however is in the range of about 0 to 0.6 and HOO performs well as long as the splitting occurs in or near the middle (see next point).
  6. I implemented a choice in splitting behaviour for HOO (the decision of where to split given an action range of one of the action dimensions) 
    1. Normally distributed around the middle of the range (with an option to change the shape of the "bell")
    2. Uniformly distributed
    3. Exactly in the middle
  7. Splitting sooner for the regression tree agent in the sinus environment did not change the reward;  it keeps sampling in the left most peek most of the time.
  8. I found a problem regarding the adaptive C value. This value can be 0, "removing" the exploration term whenever a node has count 1 in the selection step (because then min equals max and therefore the size of the range equals 0). For regression trees this never occurs since a node has gathered more than 1 sample before splitting but for HOO this can be a problem.
Planning
  1. Try to split sooner for the Sinus environment and observe results
  2. Scaling the exploration term in the calculation of U in HOO
  3. Investigate saving of images in RL Viz
  4. Generate some pictures / results
Next meeting
  • Individual meeting: Thursday, April 12, 2012, 11:00 - 12:00
  • Joint meeting: Wednesday, May 2, 13:00 - 15:00

Wednesday, March 28, 2012

Meeting 28-03-2012


Action Points
  • We all taked about our progress, results so far and current work
  • I implemented IRTI, TLS, HOO and HOLOP although the latter two are under development
  • By observing the agent's behavior in RL Viz I notices a couple of things which I should look into
    • IRTI gets stuck in a non-optimal regions very rarely (Six Hump Camel Back; approximately 1 out of 100 times)
    • The TLS agent does sometimes a bad move in the Double Integrator
    • The sinus function seems a difficult problem for IRTI and sampling is mostly done at the local maxima at the left
    • HOO performs better than IRTI at the Sinus function and very bad in the Six Hump Camel Back environment
    • HOO is much slower than IRTI
    • HOO does not explore correctly
    • HOLOP performs bad, most likely due to the incorrect workings of HOO
  • I should look into these problems. The planning below show some solutions for above problems
Planning
  • Try to split sooner for the Sinus environment and observe results
  • Make nodes beforehand
  • Remove storing the action ranges within the nodes (to avoid array copying each split)
  • Scale the exploration term in the calculation of U in HOO
  • Update U and B values before the selection step
  • Change the discounting of reward (the power should match the depth)
  • Investigate saving of images in RL Viz
Next meeting
  • Individual meeting: +- April 11
  • Joint meeting: +- April 25

Saturday, March 17, 2012

Meeting 16-03-2012

Work
  • Implemented the option of time based learning (basides simulation based learning)
  • Implemented discounted rewards for the backtracking step
  • Added a parameter for the maximum expansion depth of the meta tree
  • Did some optimization for IRTI / RMTL and experimented a bit
  • Implemented the Double Integrator environment mentioned in several papers
  • Started implementation HOO
    • Seems to work for one-step one-dimensional problems (i.e. one-step Donut World)
    • Fails in Six Hump Camel Back
Action Points
  • We looked at the results from previous post which looked good
    • I should look into the "artifacts" still present in the samples "901-1000" (you still see some lines of samples near the (local and global) optima)
    • I still have to add error bars
  • Kurt gave me some tips for the presentation and report
  • As it was not really clear to me from the literature I asked about the choice of splitting in HOO
    • Which dimension? Random
    • At which point in the dimension's range to split? Random
  • For HOLOP however, dimensions corresponding to early stages in the sequence should be chosen more often as they tend to have a bigger contribution the to return (see HOLOP paper)
  • The last term in the upperbound formula (sometimes referred as v1*p^h and sometimes diam(Ph,i)) was also not clear to me. I'm still struggling a bit with this, but on the other hand this should not affect the working of HOO to much. 
Planning
  • Finish presentation for thesis meeting March 21
  • Finalise / Debug HOO
  • Thesis report wrinting / restructuring
Next meeting
  • Joint meeting: Wednesday, March 28, 2012, 10:00-12:00

Monday, March 5, 2012

Feedback 05-03-2012

Work
  • Set up structure (chapters / sections) of the report 
  • Wrote few blocks of text for report
  • Though about the names for the 4 combinations (Regression Tree / Hoo + Meta Tree / Sequence Tree)
    • Regression-based Meta Tree Learning (RMTL)
      • similar to Tree Learning Search (TLS)
    • Hierarchical Optimistic Sequence-based Tree Learning (HOSTL)
      • similar to Hierarchical Open-Loop Optimistic Planning (HOLOP)
    • Hierarchical Optimistic Meta Tree Learning (HOMTL)
    • Regression and Sequence-based Tree Learning (RSTL)
  • While thinking about the meta tree interface for next meeting I already had a go on the implementation resulting in a working RMTL / TLS agent (so it seems)
  • Below are some results for a multi-step problem using RMTL / TLS
    • Environment: Donut World
    • average of 100 episodes
    • 3 step limit
    • 10,000 simulations per step
    • maximum reward per step = 1 (when being exactly on the middle of the donut region)
    • gradual reward degrease to 0 towards the (inner or outer) edge of the donut 
    • Minimum reward = 0 (off the donut)
Step
Average reward
Minimum reward
Maximum reward
1
0.977023
0.888396
0.999574
2
0.983432
0.768632
0.999977
3
0.993064
0.62132
1
 
Average cumulative reward
Minimum cumulative reward
Maximum cumulative reward
2.95352
2.555052
2.998965

Planning
  • Preparation for meeting Wednesday March 7th
  • Report writing
  • Debug / investigate TLS / RMTL

Wednesday, February 29, 2012

Feedback 29-02-2012

Work

  • I have implemented a visualization tool in Matlab which plots several statistics for the Six Hump Camel Back
    • (x, y) coordinates of (non-greedy) samples m to n
    • error against sample number 
    • best reward against samples number
    • reward against samples number
  • Below are some results


  • Kurt/Michael told me the F-test using 1 to n degrees of freedom is correct
    • See Regression problems section of the F-test wikipedia page for reference
    • In my case, two regression models are being compared representing before and after split. The first has one parameter (average before split) and the other two parameters (average of left child after split and the average of the right child after splitting).
    • p2 - p1 is therefore 2 - 1 = 1 and n - p2 is n - 2
    • The residual sum of squares (of sum of squared errors) is calculated as follows
      • RSS = ∑ (y - f(x))² where y is the return of one sample and f(x) is the mean of the data
    • In my case the input consists of 6 values; countLeft, countRight, ySumLeft, ySumRight, ySumSquaredLeft, ySumSquaredRight 
    • The totalCount, yTotalSum and yTotalSquaredSum values of the parent node can be calculated by summing the left child's value and the right child's value
    • The mean of the three data sets are of course the sum divided by the count
    • Given these values the RSS can also be calculated by 
      • RSS = ySumSquared - (2 * yMean * ySum) + (count * yMean²)
    • With this formula to calculate the RSS, the wikipedia's formula can be solved
      • RSS1 = yTotalSumSquared - (2 * yTotalMean * yTotalSum) + (totalCount * yTotalMean²) 
      • RSS21 = yLeftSumSquared - (2 * yLeftMean * yLeftSum) + (leftCount * yLeftMean²) 
      • RSS22 = yRightSumSquared - (2 * yRightMean * yRightSum) + (rightCount * yRightMean²) 
      • RSS2 = RSS21 + RSS22 
      • Fcalc = (RSS1 - RSS2) / (RSS2 / (totalCount - 2))
    • Finally, a split should be introducted when the statistics before and after the split are shown to be significantly different which is the case if (Fcalc >= Fcrit), where Fcrit is the critical value from the F distribution table for degrees of freedom 1 and (totalCount - 2)
  •  It turned out that my F-test implementation based on variances (from Scala code) returned the same Fcalc values as the method with RSS explained above

Planning
  • Write everything relevant from this blog into the Latex draft already
  • Think about the names of all 4 combinations of algorithms
  • Implement TLS taking the regression tree component in a multi-step environment. Think about the representation of the sequence of the actions fed to the TLS component (and later to the HOLOP component)

Friday, February 24, 2012

Meeting 24-02-2012


Action Points
  • I explained that I just fixed a bug regarding the adaptive UCT constant. During selection each child should take the same UCT constant (dependant on the range of the parent node), whereas at first I used the ranges of the children themselves meaning the children would use different UCT constants. In some cases a "bad" child would be preferred over a "good" child (with high average / low number of trails) when this bad child has a low minimum (and so a large range size).
  • I showed the behaviour and results in RL Glue and .cvs data; this looked good
  • Two problems are still open and are to be investigated (Kurt; thanks in advance)
    • From the Scala code it can be concluded that a T-test is used for the Six Hump Camel Back experiment, while the paper itself states an F-test is used?
    • The F-values used in the Scala code are of the entries 1 and n degrees of freedom of the F-Distribution table, while I though it would be more logical to use the entry n and n (because the degrees of freedom is equal for both sides)?
  • HOLOP and TLS and the combinations have been explained to me. On the one hand you have a "tree of trees" taking either a regression tree (as in TLS) or HOO and on the other hand you have a "tree of sequences" which can also either be in the form of a regression tree or HOO (as in HOLOP). A sequence can in this case be seen as the cartesian product / concatenation of the actions taken during an episode.
  • Besides these four algorithms, Michael proposed UCT with pre-discretization. I said that I already implemented a simple agent that pre-discretizes the action space (in a specified number of "parts") and uses a basic UCT algorithm to learn. I did this to test the MCTS / UCT methods / components which are also used in the regression trees.
  • I announced that I have a job and will be working two days a week from now on. This means my productivity will slightly decrease.
Planning
  • Implement (in Matlab) reading and visualizing the output from experiments
  • Write everything relevant from this blog into the Latex draft already
  • Think about the names of all 4 combinations of algorithms
  • Implement TLS taking the regression tree component in a multi-step environment. Think about the representation of the sequence of the actions fed to the TLS component (and later to the HOLOP component)
Next meeting
  • Wednesday, March 7, 2012, 11:00
  • Andreas and Lukas are also invited to this meeting

Wednesday, February 15, 2012

Feedback 15-02-2012

Work

  • Both the environment and  agent visualization of RL Glue indicate the best sample and best regression tree leaf (indicated as a 1D or 2D area by their ranges) by the color green
  • All leafs in the regression tree remember the best sample seen which is picked in case the best greedy action is chosen
  • Implemented memorization of the samples adjustable by three parameters
    • Maximum number of samples to store at any time in a leaf
    • Maximum number of samples to pass to best child in case of a split
    • Maximum number of samples to pass to worst child in case of a split
  • I have looked into the Scala code and the algorithm is now able to converge to the global optimum within approximately 200 samples (as seen in the visualization);
    • A problem I noticed is that for the Six Hump Camel Back a lot of times a split was introduced at the very edge of the state space, leaving a group of only one sample. A minimum amount of samples a child should at least have in case of a split is introduced.
    • I don't know which significance test was used for the experiments in the paper, but in the Scala code the F-test is commented out and a T-test is used. I therefore also implemented a T-test so I could compare between both
    • Furthermore I noticed that the F-distribution table they were using showed incorrect values. I have no clue how they got to those values. I experimented with and have the option to use both value sets; their (0.1 and 0.001) tables and Apache's F-test method (generating the "correct" values according to the literature; for any significance level).
    • The most important change was the adaptive UCT constant. This makes a huge difference. 
Planning
  • Holiday!
  • Implement the possibility to specify and read a certain properties file (instead of only a default file in the root folder, as of now)
  • Implement (in Matlab) reading and visualizing the output from experiments (i.e. the generated .csv files; see scala code received from Kurt for reference)
  • Run an experiment to compare with the results from the paper.