Colin Schepers' Blog: Feedback 05-04-2012

Work

Lukas pointed out that my discounting was incorrect in TLS. The power of the gamma should always match the depth, even when updating nodes lower in the tree. I fixed this.
Instantiating nodes beforehand for HOO did not make a difference in the performance.
Removing the storage of the action ranges within the nodes (to avoid array copying each split) makes only a small difference in simulations per second. For compatibility and readability I decided to stick to the initial implementation of having the action ranges within the nodes.
Updating U and B values before the selection step increased the number of simulations. This means a full tree traversal in done to update the U and B values of all nodes, updating the children before the parent (since the calculation of B takes the B-values of the children). To avoid a stack overflow when using a recursive post-order traversal, I decided to (also) store the (reference to) nodes in a list in which the nodes added in order of creation. By simply using a descendingIterator makes sure the children are always updated before its parent.
I tried scaling the exploration term in the calculation of U in HOO, but did not (yet) succeed to let the HOO algorithm perform well in environments with rewards not in the range of 0 to 1. The sinus environment however is in the range of about 0 to 0.6 and HOO performs well as long as the splitting occurs in or near the middle (see next point).
I implemented a choice in splitting behaviour for HOO (the decision of where to split given an action range of one of the action dimensions)

Normally distributed around the middle of the range (with an option to change the shape of the "bell")
Uniformly distributed
Exactly in the middle

Splitting sooner for the regression tree agent in the sinus environment did not change the reward; it keeps sampling in the left most peek most of the time.
I found a problem regarding the adaptive C value. This value can be 0, "removing" the exploration term whenever a node has count 1 in the selection step (because then min equals max and therefore the size of the range equals 0). For regression trees this never occurs since a node has gathered more than 1 sample before splitting but for HOO this can be a problem.

Planning

Try to split sooner for the Sinus environment and observe results
Scaling the exploration term in the calculation of U in HOO
Investigate saving of images in RL Viz
Generate some pictures / results

Next meeting

Individual meeting: Thursday, April 12, 2012, 11:00 - 12:00
Joint meeting: Wednesday, May 2, 13:00 - 15:00

2 comments:

Michael Kaisers05 April, 2012 17:59
Hey,
- can you use enumerations rather than bullets so I can refer to your points with numbers?
- your first point is a matter of definition and coherence. If you are reusing trees you may want to pretend at any depth that it is 'now', and only discount when backing up.
- the adaptive C (which is then NOT a constant) in combination with HOO is a new idea, so make sure to compare it to some constant C value.
- Are there any parameter values you can use for the reproduction of the sinus/RegressionTree results of the TLS article?

Good luck, Michael
Colin05 April, 2012 19:19
Hi Michael,
1. It's now enumerated :)
2. Ok, I'm not reusing trees.
3. Ok, I'll compare to a constant C and observe results.
4. I will have another "dive" into the Scala code and keep on trying...
Thanks for the feedback!
Regards,
Colin

Thursday, April 5, 2012

Feedback 05-04-2012

2 comments: