- Fixing, debugging and improving IRTI and HOO.
- Furthermore, a lot of parameter tuning for above two algorithms.
- Michael and I had a discussion about the exploration factor
- C should be related to the reward range
- constant C = globalRangeSize * K
- adaptive C = parentRangeSize * K
- Lukas proposed: adaptive C = (childRangeSize / globalRangeSize) * K
- For HOO, you should also multiply the last term (diam) with C or divide the first term (mean reward) with C.
- R + C * exploration + C * diam
- R / C + exploration + diam
Planning
- Change the HOO formula (see action points)
- Generate reward distributions (learning curve, greedy curve, error) of several algorithms
- IRTI
- HOO
- UCT (pre-discretization)
- Vanilla MC (random sampling, greedy returns best sample seen)
- Random
- I should ask Lukas about the working of his IRTI algorithm to find out why mine does not converge samling at the global maximum.
No comments:
Post a Comment