Colin Schepers' Blog

Thursday, April 5, 2012

Feedback 05-04-2012

Work

Lukas pointed out that my discounting was incorrect in TLS. The power of the gamma should always match the depth, even when updating nodes lower in the tree. I fixed this.
Instantiating nodes beforehand for HOO did not make a difference in the performance.
Removing the storage of the action ranges within the nodes (to avoid array copying each split) makes only a small difference in simulations per second. For compatibility and readability I decided to stick to the initial implementation of having the action ranges within the nodes.
Updating U and B values before the selection step increased the number of simulations. This means a full tree traversal in done to update the U and B values of all nodes, updating the children before the parent (since the calculation of B takes the B-values of the children). To avoid a stack overflow when using a recursive post-order traversal, I decided to (also) store the (reference to) nodes in a list in which the nodes added in order of creation. By simply using a descendingIterator makes sure the children are always updated before its parent.
I tried scaling the exploration term in the calculation of U in HOO, but did not (yet) succeed to let the HOO algorithm perform well in environments with rewards not in the range of 0 to 1. The sinus environment however is in the range of about 0 to 0.6 and HOO performs well as long as the splitting occurs in or near the middle (see next point).
I implemented a choice in splitting behaviour for HOO (the decision of where to split given an action range of one of the action dimensions)

Normally distributed around the middle of the range (with an option to change the shape of the "bell")
Uniformly distributed
Exactly in the middle

Splitting sooner for the regression tree agent in the sinus environment did not change the reward; it keeps sampling in the left most peek most of the time.
I found a problem regarding the adaptive C value. This value can be 0, "removing" the exploration term whenever a node has count 1 in the selection step (because then min equals max and therefore the size of the range equals 0). For regression trees this never occurs since a node has gathered more than 1 sample before splitting but for HOO this can be a problem.

Planning

Try to split sooner for the Sinus environment and observe results
Scaling the exploration term in the calculation of U in HOO
Investigate saving of images in RL Viz
Generate some pictures / results

Next meeting

Individual meeting: Thursday, April 12, 2012, 11:00 - 12:00
Joint meeting: Wednesday, May 2, 13:00 - 15:00

Wednesday, March 28, 2012

Meeting 28-03-2012

Action Points

We all taked about our progress, results so far and current work
I implemented IRTI, TLS, HOO and HOLOP although the latter two are under development
By observing the agent's behavior in RL Viz I notices a couple of things which I should look into

IRTI gets stuck in a non-optimal regions very rarely (Six Hump Camel Back; approximately 1 out of 100 times)
The TLS agent does sometimes a bad move in the Double Integrator
The sinus function seems a difficult problem for IRTI and sampling is mostly done at the local maxima at the left
HOO performs better than IRTI at the Sinus function and very bad in the Six Hump Camel Back environment
HOO is much slower than IRTI
HOO does not explore correctly
HOLOP performs bad, most likely due to the incorrect workings of HOO

I should look into these problems. The planning below show some solutions for above problems

Planning

Try to split sooner for the Sinus environment and observe results
Make nodes beforehand
Remove storing the action ranges within the nodes (to avoid array copying each split)
Scale the exploration term in the calculation of U in HOO
Update U and B values before the selection step
Change the discounting of reward (the power should match the depth)
Investigate saving of images in RL Viz

Next meeting

Individual meeting: +- April 11
Joint meeting: +- April 25

Saturday, March 17, 2012

Meeting 16-03-2012

Work

Implemented the option of time based learning (basides simulation based learning)
Implemented discounted rewards for the backtracking step
Added a parameter for the maximum expansion depth of the meta tree
Did some optimization for IRTI / RMTL and experimented a bit
Implemented the Double Integrator environment mentioned in several papers
Started implementation HOO

Seems to work for one-step one-dimensional problems (i.e. one-step Donut World)
Fails in Six Hump Camel Back

Action Points

We looked at the results from previous post which looked good

I should look into the "artifacts" still present in the samples "901-1000" (you still see some lines of samples near the (local and global) optima)
I still have to add error bars

Kurt gave me some tips for the presentation and report
As it was not really clear to me from the literature I asked about the choice of splitting in HOO

Which dimension? Random
At which point in the dimension's range to split? Random

For HOLOP however, dimensions corresponding to early stages in the sequence should be chosen more often as they tend to have a bigger contribution the to return (see HOLOP paper)
The last term in the upperbound formula (sometimes referred as v1*p^h and sometimes diam(Ph,i)) was also not clear to me. I'm still struggling a bit with this, but on the other hand this should not affect the working of HOO to much.

Planning

Finish presentation for thesis meeting March 21
Finalise / Debug HOO
Thesis report wrinting / restructuring

Next meeting

Joint meeting: Wednesday, March 28, 2012, 10:00-12:00

Monday, March 5, 2012

Feedback 05-03-2012

Work

Set up structure (chapters / sections) of the report
Wrote few blocks of text for report
Though about the names for the 4 combinations (Regression Tree / Hoo + Meta Tree / Sequence Tree)

Regression-based Meta Tree Learning (RMTL)

similar to Tree Learning Search (TLS)

Hierarchical Optimistic Sequence-based Tree Learning (HOSTL)

similar to Hierarchical Open-Loop Optimistic Planning (HOLOP)

Hierarchical Optimistic Meta Tree Learning (HOMTL)
Regression and Sequence-based Tree Learning (RSTL)

While thinking about the meta tree interface for next meeting I already had a go on the implementation resulting in a working RMTL / TLS agent (so it seems)
Below are some results for a multi-step problem using RMTL / TLS

Environment: Donut World
average of 100 episodes
3 step limit
10,000 simulations per step
maximum reward per step = 1 (when being exactly on the middle of the donut region)
gradual reward degrease to 0 towards the (inner or outer) edge of the donut
Minimum reward = 0 (off the donut)

Step	Average reward	Minimum reward	Maximum reward
1	0.977023	0.888396	0.999574
2	0.983432	0.768632	0.999977
3	0.993064	0.62132	1

Average cumulative reward	Minimum cumulative reward	Maximum cumulative reward
2.95352	2.555052	2.998965

Planning

Preparation for meeting Wednesday March 7th
Report writing
Debug / investigate TLS / RMTL

Wednesday, February 29, 2012

Feedback 29-02-2012

Work

I have implemented a visualization tool in Matlab which plots several statistics for the Six Hump Camel Back

(x, y) coordinates of (non-greedy) samples m to n
error against sample number
best reward against samples number
reward against samples number

Below are some results

Kurt/Michael told me the F-test using 1 to n degrees of freedom is correct

See Regression problems section of the F-test wikipedia page for reference
In my case, two regression models are being compared representing before and after split. The first has one parameter (average before split) and the other two parameters (average of left child after split and the average of the right child after splitting).
p2 - p1 is therefore 2 - 1 = 1 and n - p2 is n - 2
The residual sum of squares (of sum of squared errors) is calculated as follows

RSS = ∑ (y - f(x))² where y is the return of one sample and f(x) is the mean of the data

In my case the input consists of 6 values; countLeft, countRight, ySumLeft, ySumRight, ySumSquaredLeft, ySumSquaredRight
The totalCount, yTotalSum and yTotalSquaredSum values of the parent node can be calculated by summing the left child's value and the right child's value
The mean of the three data sets are of course the sum divided by the count
Given these values the RSS can also be calculated by

RSS = ySumSquared - (2 * yMean * ySum) + (count * yMean²)

With this formula to calculate the RSS, the wikipedia's formula can be solved

RSS1 = yTotalSumSquared - (2 * yTotalMean * yTotalSum) + (totalCount * yTotalMean²)
RSS21 = yLeftSumSquared - (2 * yLeftMean * yLeftSum) + (leftCount * yLeftMean²)
RSS22 = yRightSumSquared - (2 * yRightMean * yRightSum) + (rightCount * yRightMean²)
RSS2 = RSS21 + RSS22
Fcalc = (RSS1 - RSS2) / (RSS2 / (totalCount - 2))

Finally, a split should be introducted when the statistics before and after the split are shown to be significantly different which is the case if (Fcalc >= Fcrit), where Fcrit is the critical value from the F distribution table for degrees of freedom 1 and (totalCount - 2)

It turned out that my F-test implementation based on variances (from Scala code) returned the same Fcalc values as the method with RSS explained above

Planning

Write everything relevant from this blog into the Latex draft already
Think about the names of all 4 combinations of algorithms
Implement TLS taking the regression tree component in a multi-step environment. Think about the representation of the sequence of the actions fed to the TLS component (and later to the HOLOP component)

Friday, February 24, 2012

Meeting 24-02-2012

Action Points

I explained that I just fixed a bug regarding the adaptive UCT constant. During selection each child should take the same UCT constant (dependant on the range of the parent node), whereas at first I used the ranges of the children themselves meaning the children would use different UCT constants. In some cases a "bad" child would be preferred over a "good" child (with high average / low number of trails) when this bad child has a low minimum (and so a large range size).
I showed the behaviour and results in RL Glue and .cvs data; this looked good
Two problems are still open and are to be investigated (Kurt; thanks in advance)

From the Scala code it can be concluded that a T-test is used for the Six Hump Camel Back experiment, while the paper itself states an F-test is used?
The F-values used in the Scala code are of the entries 1 and n degrees of freedom of the F-Distribution table, while I though it would be more logical to use the entry n and n (because the degrees of freedom is equal for both sides)?

HOLOP and TLS and the combinations have been explained to me. On the one hand you have a "tree of trees" taking either a regression tree (as in TLS) or HOO and on the other hand you have a "tree of sequences" which can also either be in the form of a regression tree or HOO (as in HOLOP). A sequence can in this case be seen as the cartesian product / concatenation of the actions taken during an episode.
Besides these four algorithms, Michael proposed UCT with pre-discretization. I said that I already implemented a simple agent that pre-discretizes the action space (in a specified number of "parts") and uses a basic UCT algorithm to learn. I did this to test the MCTS / UCT methods / components which are also used in the regression trees.
I announced that I have a job and will be working two days a week from now on. This means my productivity will slightly decrease.

Planning

Implement (in Matlab) reading and visualizing the output from experiments
Write everything relevant from this blog into the Latex draft already
Think about the names of all 4 combinations of algorithms
Implement TLS taking the regression tree component in a multi-step environment. Think about the representation of the sequence of the actions fed to the TLS component (and later to the HOLOP component)

Next meeting

Wednesday, March 7, 2012, 11:00
Andreas and Lukas are also invited to this meeting

Wednesday, February 15, 2012

Feedback 15-02-2012

Work

Both the environment and agent visualization of RL Glue indicate the best sample and best regression tree leaf (indicated as a 1D or 2D area by their ranges) by the color green
All leafs in the regression tree remember the best sample seen which is picked in case the best greedy action is chosen
Implemented memorization of the samples adjustable by three parameters

Maximum number of samples to store at any time in a leaf
Maximum number of samples to pass to best child in case of a split
Maximum number of samples to pass to worst child in case of a split

I have looked into the Scala code and the algorithm is now able to converge to the global optimum within approximately 200 samples (as seen in the visualization);

A problem I noticed is that for the Six Hump Camel Back a lot of times a split was introduced at the very edge of the state space, leaving a group of only one sample. A minimum amount of samples a child should at least have in case of a split is introduced.
I don't know which significance test was used for the experiments in the paper, but in the Scala code the F-test is commented out and a T-test is used. I therefore also implemented a T-test so I could compare between both
Furthermore I noticed that the F-distribution table they were using showed incorrect values. I have no clue how they got to those values. I experimented with and have the option to use both value sets; their (0.1 and 0.001) tables and Apache's F-test method (generating the "correct" values according to the literature; for any significance level).
The most important change was the adaptive UCT constant. This makes a huge difference.

Planning

Holiday!
Implement the possibility to specify and read a certain properties file (instead of only a default file in the root folder, as of now)
Implement (in Matlab) reading and visualizing the output from experiments (i.e. the generated .csv files; see scala code received from Kurt for reference)
Run an experiment to compare with the results from the paper.