Colin Schepers' Blog: March 2012

Wednesday, March 28, 2012

Action Points

We all taked about our progress, results so far and current work
I implemented IRTI, TLS, HOO and HOLOP although the latter two are under development
By observing the agent's behavior in RL Viz I notices a couple of things which I should look into

IRTI gets stuck in a non-optimal regions very rarely (Six Hump Camel Back; approximately 1 out of 100 times)
The TLS agent does sometimes a bad move in the Double Integrator
The sinus function seems a difficult problem for IRTI and sampling is mostly done at the local maxima at the left
HOO performs better than IRTI at the Sinus function and very bad in the Six Hump Camel Back environment
HOO is much slower than IRTI
HOO does not explore correctly
HOLOP performs bad, most likely due to the incorrect workings of HOO

I should look into these problems. The planning below show some solutions for above problems

Planning

Try to split sooner for the Sinus environment and observe results
Make nodes beforehand
Remove storing the action ranges within the nodes (to avoid array copying each split)
Scale the exploration term in the calculation of U in HOO
Update U and B values before the selection step
Change the discounting of reward (the power should match the depth)
Investigate saving of images in RL Viz

Next meeting

Work

Implemented the option of time based learning (basides simulation based learning)
Implemented discounted rewards for the backtracking step
Added a parameter for the maximum expansion depth of the meta tree
Did some optimization for IRTI / RMTL and experimented a bit
Implemented the Double Integrator environment mentioned in several papers
Started implementation HOO

Action Points

I should look into the "artifacts" still present in the samples "901-1000" (you still see some lines of samples near the (local and global) optima)
I still have to add error bars

Kurt gave me some tips for the presentation and report
As it was not really clear to me from the literature I asked about the choice of splitting in HOO

For HOLOP however, dimensions corresponding to early stages in the sequence should be chosen more often as they tend to have a bigger contribution the to return (see HOLOP paper)
The last term in the upperbound formula (sometimes referred as v1*p^h and sometimes diam(Ph,i)) was also not clear to me. I'm still struggling a bit with this, but on the other hand this should not affect the working of HOO to much.

Planning

Next meeting

Work

Set up structure (chapters / sections) of the report
Wrote few blocks of text for report
Though about the names for the 4 combinations (Regression Tree / Hoo + Meta Tree / Sequence Tree)

While thinking about the meta tree interface for next meeting I already had a go on the implementation resulting in a working RMTL / TLS agent (so it seems)
Below are some results for a multi-step problem using RMTL / TLS

Environment: Donut World
average of 100 episodes
3 step limit
10,000 simulations per step
maximum reward per step = 1 (when being exactly on the middle of the donut region)
gradual reward degrease to 0 towards the (inner or outer) edge of the donut
Minimum reward = 0 (off the donut)

Average cumulative reward	Minimum cumulative reward	Maximum cumulative reward
2.95352	2.555052	2.998965

Planning