## Wednesday, February 29, 2012

### Feedback 29-02-2012

Work

• I have implemented a visualization tool in Matlab which plots several statistics for the Six Hump Camel Back
• (x, y) coordinates of (non-greedy) samples m to n
• error against sample number
• best reward against samples number
• reward against samples number
• Below are some results

• Kurt/Michael told me the F-test using 1 to n degrees of freedom is correct
• See Regression problems section of the F-test wikipedia page for reference
• In my case, two regression models are being compared representing before and after split. The first has one parameter (average before split) and the other two parameters (average of left child after split and the average of the right child after splitting).
• p2 - p1 is therefore 2 - 1 = 1 and n - p2 is n - 2
• The residual sum of squares (of sum of squared errors) is calculated as follows
• RSS = ∑ (y - f(x))² where y is the return of one sample and f(x) is the mean of the data
• In my case the input consists of 6 values; countLeft, countRight, ySumLeft, ySumRight, ySumSquaredLeft, ySumSquaredRight
• The totalCount, yTotalSum and yTotalSquaredSum values of the parent node can be calculated by summing the left child's value and the right child's value
• The mean of the three data sets are of course the sum divided by the count
• Given these values the RSS can also be calculated by
• RSS = ySumSquared - (2 * yMean * ySum) + (count * yMean²)
• With this formula to calculate the RSS, the wikipedia's formula can be solved
• RSS1 = yTotalSumSquared - (2 * yTotalMean * yTotalSum) + (totalCount * yTotalMean²)
• RSS21 = yLeftSumSquared - (2 * yLeftMean * yLeftSum) + (leftCount * yLeftMean²)
• RSS22 = yRightSumSquared - (2 * yRightMean * yRightSum) + (rightCount * yRightMean²)
• Finally, a split should be introducted when the statistics before and after the split are shown to be significantly different which is the case if (Fcalc >= Fcrit), where Fcrit is the critical value from the F distribution table for degrees of freedom 1 and (totalCount - 2)
•  It turned out that my F-test implementation based on variances (from Scala code) returned the same Fcalc values as the method with RSS explained above

Planning
• Think about the names of all 4 combinations of algorithms
• Implement TLS taking the regression tree component in a multi-step environment. Think about the representation of the sequence of the actions fed to the TLS component (and later to the HOLOP component)

## Friday, February 24, 2012

### Meeting 24-02-2012

Action Points
• I explained that I just fixed a bug regarding the adaptive UCT constant. During selection each child should take the same UCT constant (dependant on the range of the parent node), whereas at first I used the ranges of the children themselves meaning the children would use different UCT constants. In some cases a "bad" child would be preferred over a "good" child (with high average / low number of trails) when this bad child has a low minimum (and so a large range size).
• I showed the behaviour and results in RL Glue and .cvs data; this looked good
• Two problems are still open and are to be investigated (Kurt; thanks in advance)
• From the Scala code it can be concluded that a T-test is used for the Six Hump Camel Back experiment, while the paper itself states an F-test is used?
• The F-values used in the Scala code are of the entries 1 and n degrees of freedom of the F-Distribution table, while I though it would be more logical to use the entry n and n (because the degrees of freedom is equal for both sides)?
• HOLOP and TLS and the combinations have been explained to me. On the one hand you have a "tree of trees" taking either a regression tree (as in TLS) or HOO and on the other hand you have a "tree of sequences" which can also either be in the form of a regression tree or HOO (as in HOLOP). A sequence can in this case be seen as the cartesian product / concatenation of the actions taken during an episode.
• Besides these four algorithms, Michael proposed UCT with pre-discretization. I said that I already implemented a simple agent that pre-discretizes the action space (in a specified number of "parts") and uses a basic UCT algorithm to learn. I did this to test the MCTS / UCT methods / components which are also used in the regression trees.
• I announced that I have a job and will be working two days a week from now on. This means my productivity will slightly decrease.
Planning
• Implement (in Matlab) reading and visualizing the output from experiments
• Think about the names of all 4 combinations of algorithms
• Implement TLS taking the regression tree component in a multi-step environment. Think about the representation of the sequence of the actions fed to the TLS component (and later to the HOLOP component)
Next meeting
• Wednesday, March 7, 2012, 11:00
• Andreas and Lukas are also invited to this meeting

## Wednesday, February 15, 2012

### Feedback 15-02-2012

Work

• Both the environment and  agent visualization of RL Glue indicate the best sample and best regression tree leaf (indicated as a 1D or 2D area by their ranges) by the color green
• All leafs in the regression tree remember the best sample seen which is picked in case the best greedy action is chosen
• Implemented memorization of the samples adjustable by three parameters
• Maximum number of samples to store at any time in a leaf
• Maximum number of samples to pass to best child in case of a split
• Maximum number of samples to pass to worst child in case of a split
• I have looked into the Scala code and the algorithm is now able to converge to the global optimum within approximately 200 samples (as seen in the visualization);
• A problem I noticed is that for the Six Hump Camel Back a lot of times a split was introduced at the very edge of the state space, leaving a group of only one sample. A minimum amount of samples a child should at least have in case of a split is introduced.
• I don't know which significance test was used for the experiments in the paper, but in the Scala code the F-test is commented out and a T-test is used. I therefore also implemented a T-test so I could compare between both
• Furthermore I noticed that the F-distribution table they were using showed incorrect values. I have no clue how they got to those values. I experimented with and have the option to use both value sets; their (0.1 and 0.001) tables and Apache's F-test method (generating the "correct" values according to the literature; for any significance level).
• The most important change was the adaptive UCT constant. This makes a huge difference.
Planning
• Holiday!
• Implement the possibility to specify and read a certain properties file (instead of only a default file in the root folder, as of now)
• Implement (in Matlab) reading and visualizing the output from experiments (i.e. the generated .csv files; see scala code received from Kurt for reference)
• Run an experiment to compare with the results from the paper.

## Monday, February 13, 2012

### Meeting 13-02-2012

Action Points
• Fortunately, I had correctly implemented the splitting criteron. There were two reasons for the incorrect behaviour of the splitting.
• The parameters were not set correctly and changed them to match the parameters used for the experiments in [1]
• F-test significance level = 0.001 (I had set this too high (0.05) and therefore splits were introduced to early)
• The minimum amount of samples to collect before considering a split = 15
• MCTS exploration constant = 0.5 * (reward range size of parent node)
• Samples / statistics were completely discarted after a split to save memory. In my case, storing the samples would most likely be much more beneficial.
• We talked about the experimental implementations (see posts last week). Everything look fine, although I should add the options as described in the planning below.
• Results of an experiment I tried a couple of days ago showed that over time, there was sometimes a decrease in the reward when taking greedy actions. This is a flaw in the take greedy action method; storage of the best sample in each leaf should solve this problem.
• I will be on holiday from Wednesday February 15 to Tuesday February 21
Planning
• Store the best sample seen at each leaf of the regression tree which can be used when a greedy best action is requested
• For the RL Glue visualizations; it might be nice to color the leaf/action with the highest average reward in a different color
• Implement the memorization of samples (i.e. all attribute values and regression value of each sample) in each leaf. In case of a split these samples can be re-used by the two children (by re-inserting the samples at the parent)
• Think about and add the option to change above mentioned memorization, i.e.
• turn on memorization
• turn off memorization
• only memorize a certain number of samples
• only re-insert samples going to the best child
• etc.
• Implement the possibility to specify and read a certain properties file (instead of only a default file in the root folder, as of now)
• Implement (in Matlab) reading and visualizing the output from experiments (i.e. the generated .csv files; see scala code received from Kurt for reference)
• Run an experiment to compare with the results from the paper.
Next Meeting
• Friday, Februari 24, 2012, 11:00
[1] G. Van den Broeck and K. Driessens, “Automatic discretization of actions and states in Monte-Carlo tree search,” in Proceedings of the ECML/PKDD
2011 Workshop on Machine Learning and Data Mining in and around Games (T. Croonenborghs, K. Driessens, and O. Missura, eds.), pp. 1–12, Sep 2011.

## Friday, February 10, 2012

### Feedback 10-02-2012

Work

• Implemented the loading and saving of a .properties file holding the constants for the algorithms. One default file is loaded from the root of the project folder and can be changed / replaced. For each experiment a folder is created named after the experiment name and a subfolder named after the current time stamp (so that every result of identical experiments are "unique"). This folder is populated with the .properties file used for that experiment, a text file holding the output generated while running the experiment and possibly other output files (like .csv files).
• I looked again into the splitting criteria and read that the best option is to only find the best test and do an f-test to check whether the split is significant (no need to compare two tests to find the most significant test). For this no Hoeffding bound is needed.
Problems
• I'm still not sure if checking for a significant split is correctly understood and implemented.
• Running the Six-Hump Camel back does show some splitting but does not do what it should do.
• I should ask if the values for the constants I'm using are any good.

## Monday, February 6, 2012

### Feedback 06-02

Work

• Implemented the distinction between "in head" and "real" actions. I did this by changing the environments to support being hard-copied so that agents can use this copy when asked for an action to perform by RL Glue. Maybe not the nicest way to do it, but the other option of running two environment instances in RL Glue had two problems:
• RL Glue's manual stated that it is only possible to run one instance of agent and one instance of environment and even  if I would be possible (since I use the Java source code of RL Glue) it would require a large amount of changes in the code.
• Environments can only be reset to the starting state and not set to a specific state which is a problem after one or more steps in the real world. The "in head" siumulations would not start at the current state of the real world.
• I implemented regression tree induction
• Possible to switch between both methods of FIMT and TG (binary tree and "first n examples")
• Hoeffding bounds
• Standard Deviation Reduction
• Tau; tie breaking mechanism
• Tested the regression tree with promising results using a one-step lookahead showing the automatic discretization of the action space for the current state the agent is in.
Problems
• Do not exactly understand which and when to split;
• FIMT ranks attributes by their best split and splits the best attribute when there is enough evidence that the best attribute is better. What if there is only one attribute (one action dimension)? Should I compare the best and second best split for one attribute?
• One paper states that TG uses a standard F-test to decide whether a split is probably significant and the other states that an F-test is used to decide which is the best split? In the latter case, what is SDR used for then in TG?
Planning
• Implement test results output (see tasks of last meeting)
• Look further into incremental regression tree induction

## Friday, February 3, 2012

### Meeting 03-02

Action Points
• As of now, I've implemented the agent in RL Glue not knowing the model of the environment and sampling by playing "real" actions. I should change this so that the agent knows the model and can sample "in his head". After a specified amount of samples, the agent returns a "real" action to RL Glue. I could either run two instances of RL Glue (in head and in real world) or give the agent acces to the environment classes.
• In general, states (or observations) are not taken into consideration for TLS. In case of different starting states, one could add one depth to the tree where the root now accounts for deciding in which initial state the agent is in. The agent then starts sampling from this node.
• An example of a VFDT for the context of TLS
• Attribute values (X) = action values (one per action dimension)
• Regression value (value to predict; y) = reward
•  Choosing split points for a continuous attribute
• A test is in the form of exampleValueY <= splitPoint
• Incrementally updated as new examples arrive. 7 values have to be stored per test;
• split point value
• number of examples that pass / fail the test
• sum of y values that pass / fail  the test
• sum of squared y values that pass / fail the test
• FIMT uses a binary tree representing all the examples
• Other possibility proposed by Michael / Kurt and similar to TG's is to use the values of the first n examples as possible splitting point. After that, each new example only updates the statistics for these splitting points.
• The R in the Hoeffding bounds formula represents the size of the range of the regression value; the reward range.
• Changes RL Glue structure so that agent can distinct between "in head" and "real" simulations.
• Implement so that for each experiment, a new output folder is created containing (at least) the results and a file containing the values of all the constants used for that experiment. Possibly add being able to load this file.
• Continue implementing regression trees, and make it possible to switch between the "binary tree" method of FIMT and the "first n examples" method (of TG).
Next Meeting
• Monday, Februari 13, 2012, 13:00

## Thursday, February 2, 2012

### Feedback 02-02-2012

Work
• Implemented abstract class for experiments; easy to setup an experiment now
• Implemented some re-usable print outs experiments
• Implemented method to output an array to a csv-file for easy evaluation of results
• Implemented basic structure regarding MCTS, Regression trees and TLS
• Implemented a simple UCT agent discretizing the action space
• Parameter: discretization steps (number of actions per action dimension)
• Parameter: Exploration constant (the "C" in the UCT formula)
• Did more reading and some initial coding regarding   VFDT / SDR / Hoeffding /  FIMT
Problems Encountered
• Due to observations, nodes in the tree can represent only estimates of the states, i.e. Cart Pole has different starting states meaning the root of the MCTS tree does not always represent the same starting state.
• I do not yet fully understand  VFDT /  FIMT
• How are split points defined in case of a continuous range? Papers only state: "for each possible split point"?
• What is an example in this context?
Planning
• Continue on incremental regression tree induction