Colin Schepers' Blog: February 2012

Wednesday, February 29, 2012

Feedback 29-02-2012

Work

I have implemented a visualization tool in Matlab which plots several statistics for the Six Hump Camel Back

(x, y) coordinates of (non-greedy) samples m to n
error against sample number
best reward against samples number
reward against samples number

Below are some results

Kurt/Michael told me the F-test using 1 to n degrees of freedom is correct

See Regression problems section of the F-test wikipedia page for reference
In my case, two regression models are being compared representing before and after split. The first has one parameter (average before split) and the other two parameters (average of left child after split and the average of the right child after splitting).
p2 - p1 is therefore 2 - 1 = 1 and n - p2 is n - 2
The residual sum of squares (of sum of squared errors) is calculated as follows

RSS = ∑ (y - f(x))² where y is the return of one sample and f(x) is the mean of the data

In my case the input consists of 6 values; countLeft, countRight, ySumLeft, ySumRight, ySumSquaredLeft, ySumSquaredRight
The totalCount, yTotalSum and yTotalSquaredSum values of the parent node can be calculated by summing the left child's value and the right child's value
The mean of the three data sets are of course the sum divided by the count
Given these values the RSS can also be calculated by

RSS = ySumSquared - (2 * yMean * ySum) + (count * yMean²)

With this formula to calculate the RSS, the wikipedia's formula can be solved

RSS1 = yTotalSumSquared - (2 * yTotalMean * yTotalSum) + (totalCount * yTotalMean²)
RSS21 = yLeftSumSquared - (2 * yLeftMean * yLeftSum) + (leftCount * yLeftMean²)
RSS22 = yRightSumSquared - (2 * yRightMean * yRightSum) + (rightCount * yRightMean²)
RSS2 = RSS21 + RSS22
Fcalc = (RSS1 - RSS2) / (RSS2 / (totalCount - 2))

Finally, a split should be introducted when the statistics before and after the split are shown to be significantly different which is the case if (Fcalc >= Fcrit), where Fcrit is the critical value from the F distribution table for degrees of freedom 1 and (totalCount - 2)

It turned out that my F-test implementation based on variances (from Scala code) returned the same Fcalc values as the method with RSS explained above

Planning

Write everything relevant from this blog into the Latex draft already
Think about the names of all 4 combinations of algorithms
Implement TLS taking the regression tree component in a multi-step environment. Think about the representation of the sequence of the actions fed to the TLS component (and later to the HOLOP component)

Friday, February 24, 2012

Meeting 24-02-2012

Action Points

I explained that I just fixed a bug regarding the adaptive UCT constant. During selection each child should take the same UCT constant (dependant on the range of the parent node), whereas at first I used the ranges of the children themselves meaning the children would use different UCT constants. In some cases a "bad" child would be preferred over a "good" child (with high average / low number of trails) when this bad child has a low minimum (and so a large range size).
I showed the behaviour and results in RL Glue and .cvs data; this looked good
Two problems are still open and are to be investigated (Kurt; thanks in advance)

From the Scala code it can be concluded that a T-test is used for the Six Hump Camel Back experiment, while the paper itself states an F-test is used?
The F-values used in the Scala code are of the entries 1 and n degrees of freedom of the F-Distribution table, while I though it would be more logical to use the entry n and n (because the degrees of freedom is equal for both sides)?

HOLOP and TLS and the combinations have been explained to me. On the one hand you have a "tree of trees" taking either a regression tree (as in TLS) or HOO and on the other hand you have a "tree of sequences" which can also either be in the form of a regression tree or HOO (as in HOLOP). A sequence can in this case be seen as the cartesian product / concatenation of the actions taken during an episode.
Besides these four algorithms, Michael proposed UCT with pre-discretization. I said that I already implemented a simple agent that pre-discretizes the action space (in a specified number of "parts") and uses a basic UCT algorithm to learn. I did this to test the MCTS / UCT methods / components which are also used in the regression trees.
I announced that I have a job and will be working two days a week from now on. This means my productivity will slightly decrease.

Planning

Implement (in Matlab) reading and visualizing the output from experiments
Write everything relevant from this blog into the Latex draft already
Think about the names of all 4 combinations of algorithms
Implement TLS taking the regression tree component in a multi-step environment. Think about the representation of the sequence of the actions fed to the TLS component (and later to the HOLOP component)

Next meeting

Wednesday, March 7, 2012, 11:00
Andreas and Lukas are also invited to this meeting

Wednesday, February 15, 2012

Feedback 15-02-2012

Work

Both the environment and agent visualization of RL Glue indicate the best sample and best regression tree leaf (indicated as a 1D or 2D area by their ranges) by the color green
All leafs in the regression tree remember the best sample seen which is picked in case the best greedy action is chosen
Implemented memorization of the samples adjustable by three parameters

Maximum number of samples to store at any time in a leaf
Maximum number of samples to pass to best child in case of a split
Maximum number of samples to pass to worst child in case of a split

I have looked into the Scala code and the algorithm is now able to converge to the global optimum within approximately 200 samples (as seen in the visualization);

A problem I noticed is that for the Six Hump Camel Back a lot of times a split was introduced at the very edge of the state space, leaving a group of only one sample. A minimum amount of samples a child should at least have in case of a split is introduced.
I don't know which significance test was used for the experiments in the paper, but in the Scala code the F-test is commented out and a T-test is used. I therefore also implemented a T-test so I could compare between both
Furthermore I noticed that the F-distribution table they were using showed incorrect values. I have no clue how they got to those values. I experimented with and have the option to use both value sets; their (0.1 and 0.001) tables and Apache's F-test method (generating the "correct" values according to the literature; for any significance level).
The most important change was the adaptive UCT constant. This makes a huge difference.

Planning

Holiday!
Implement the possibility to specify and read a certain properties file (instead of only a default file in the root folder, as of now)
Implement (in Matlab) reading and visualizing the output from experiments (i.e. the generated .csv files; see scala code received from Kurt for reference)
Run an experiment to compare with the results from the paper.

Monday, February 13, 2012

Meeting 13-02-2012

Action Points

Fortunately, I had correctly implemented the splitting criteron. There were two reasons for the incorrect behaviour of the splitting.

The parameters were not set correctly and changed them to match the parameters used for the experiments in [1]

F-test significance level = 0.001 (I had set this too high (0.05) and therefore splits were introduced to early)
The minimum amount of samples to collect before considering a split = 15
MCTS exploration constant = 0.5 * (reward range size of parent node)

Samples / statistics were completely discarted after a split to save memory. In my case, storing the samples would most likely be much more beneficial.

We talked about the experimental implementations (see posts last week). Everything look fine, although I should add the options as described in the planning below.
Results of an experiment I tried a couple of days ago showed that over time, there was sometimes a decrease in the reward when taking greedy actions. This is a flaw in the take greedy action method; storage of the best sample in each leaf should solve this problem.
I will be on holiday from Wednesday February 15 to Tuesday February 21

Planning

Store the best sample seen at each leaf of the regression tree which can be used when a greedy best action is requested
For the RL Glue visualizations; it might be nice to color the leaf/action with the highest average reward in a different color
Implement the memorization of samples (i.e. all attribute values and regression value of each sample) in each leaf. In case of a split these samples can be re-used by the two children (by re-inserting the samples at the parent)
Think about and add the option to change above mentioned memorization, i.e.

turn on memorization
turn off memorization
only memorize a certain number of samples
only re-insert samples going to the best child
etc.

Implement the possibility to specify and read a certain properties file (instead of only a default file in the root folder, as of now)
Implement (in Matlab) reading and visualizing the output from experiments (i.e. the generated .csv files; see scala code received from Kurt for reference)
Run an experiment to compare with the results from the paper.

Next Meeting

Friday, Februari 24, 2012, 11:00

[1] G. Van den Broeck and K. Driessens, “Automatic discretization of actions and states in Monte-Carlo tree search,” in Proceedings of the ECML/PKDD

2011 Workshop on Machine Learning and Data Mining in and around Games (T. Croonenborghs, K. Driessens, and O. Missura, eds.), pp. 1–12, Sep 2011.

Friday, February 10, 2012

Feedback 10-02-2012

Work

Implemented the loading and saving of a .properties file holding the constants for the algorithms. One default file is loaded from the root of the project folder and can be changed / replaced. For each experiment a folder is created named after the experiment name and a subfolder named after the current time stamp (so that every result of identical experiments are "unique"). This folder is populated with the .properties file used for that experiment, a text file holding the output generated while running the experiment and possibly other output files (like .csv files).
I looked again into the splitting criteria and read that the best option is to only find the best test and do an f-test to check whether the split is significant (no need to compare two tests to find the most significant test). For this no Hoeffding bound is needed.

Problems

I'm still not sure if checking for a significant split is correctly understood and implemented.
Running the Six-Hump Camel back does show some splitting but does not do what it should do.
I should ask if the values for the constants I'm using are any good.

Monday, February 6, 2012

Feedback 06-02

Work

Implemented the distinction between "in head" and "real" actions. I did this by changing the environments to support being hard-copied so that agents can use this copy when asked for an action to perform by RL Glue. Maybe not the nicest way to do it, but the other option of running two environment instances in RL Glue had two problems:

RL Glue's manual stated that it is only possible to run one instance of agent and one instance of environment and even if I would be possible (since I use the Java source code of RL Glue) it would require a large amount of changes in the code.
Environments can only be reset to the starting state and not set to a specific state which is a problem after one or more steps in the real world. The "in head" siumulations would not start at the current state of the real world.

I implemented regression tree induction

Possible to switch between both methods of FIMT and TG (binary tree and "first n examples")
Hoeffding bounds
Standard Deviation Reduction
Tau; tie breaking mechanism

Tested the regression tree with promising results using a one-step lookahead showing the automatic discretization of the action space for the current state the agent is in.

Problems

Do not exactly understand which and when to split;

FIMT ranks attributes by their best split and splits the best attribute when there is enough evidence that the best attribute is better. What if there is only one attribute (one action dimension)? Should I compare the best and second best split for one attribute?
One paper states that TG uses a standard F-test to decide whether a split is probably significant and the other states that an F-test is used to decide which is the best split? In the latter case, what is SDR used for then in TG?

Planning

Implement test results output (see tasks of last meeting)
Look further into incremental regression tree induction

Friday, February 3, 2012

Meeting 03-02

Action Points

As of now, I've implemented the agent in RL Glue not knowing the model of the environment and sampling by playing "real" actions. I should change this so that the agent knows the model and can sample "in his head". After a specified amount of samples, the agent returns a "real" action to RL Glue. I could either run two instances of RL Glue (in head and in real world) or give the agent acces to the environment classes.
In general, states (or observations) are not taken into consideration for TLS. In case of different starting states, one could add one depth to the tree where the root now accounts for deciding in which initial state the agent is in. The agent then starts sampling from this node.
An example of a VFDT for the context of TLS

Attribute values (X) = action values (one per action dimension)
Regression value (value to predict; y) = reward

Choosing split points for a continuous attribute

A test is in the form of exampleValueY <= splitPoint
Incrementally updated as new examples arrive. 7 values have to be stored per test;

split point value
number of examples that pass / fail the test
sum of y values that pass / fail the test
sum of squared y values that pass / fail the test

FIMT uses a binary tree representing all the examples
Other possibility proposed by Michael / Kurt and similar to TG's is to use the values of the first n examples as possible splitting point. After that, each new example only updates the statistics for these splitting points.

The R in the Hoeffding bounds formula represents the size of the range of the regression value; the reward range.

Tasks

Changes RL Glue structure so that agent can distinct between "in head" and "real" simulations.
Implement so that for each experiment, a new output folder is created containing (at least) the results and a file containing the values of all the constants used for that experiment. Possibly add being able to load this file.
Continue implementing regression trees, and make it possible to switch between the "binary tree" method of FIMT and the "first n examples" method (of TG).

Next Meeting

Monday, Februari 13, 2012, 13:00

Thursday, February 2, 2012

Feedback 02-02-2012

Work

Implemented abstract class for experiments; easy to setup an experiment now

Implemented some re-usable print outs experiments
Implemented method to output an array to a csv-file for easy evaluation of results
Implemented basic structure regarding MCTS, Regression trees and TLS
Implemented a simple UCT agent discretizing the action space

Parameter: discretization steps (number of actions per action dimension)
Parameter: Exploration constant (the "C" in the UCT formula)

Did more reading and some initial coding regarding VFDT / SDR / Hoeffding / FIMT

Problems Encountered

Due to observations, nodes in the tree can represent only estimates of the states, i.e. Cart Pole has different starting states meaning the root of the MCTS tree does not always represent the same starting state.
I do not yet fully understand VFDT / FIMT

How are split points defined in case of a continuous range? Papers only state: "for each possible split point"?
What is an example in this context?

Planning

Continue on incremental regression tree induction