Colin Schepers' Blog: Feedback 30-01-2012

Finished the "Donut World" environment. Some small comments to discuss next meeting;

Agent can turn only for a limited range left or right (default 135.0 degrees); otherwise it could walk back and forth in stead of forward over the donut
If the agent would be located outside the bounds at the next step, its next location will be where it would touch the bounds. Other option could be: Game Over?
Several other options are configurable as parameters;

step limit
agent's step size
agent's maximum turn angle (in degrees)
agent's starting x, y, and angle
donut's x, y, radius and thickness
a single value reward or a gradual reward range when being on donut
amount of noise (observation and/or transition noise)

Here is again an overview of the environments that can be used to validate the algorithms. See references/weblinks for more information.

Sinus function optimization (upcoming; see [1])

1 dimensional continuous observation and action space
step limit = 1

Six hump camel back function optimization (see [1])

2 dimensional continuous observation and action space
step limit = 1

Donut world

3 dimensional continuous observation space (x, y and alpha of the agent)
1 dimensional continuous action space (amount of degree to turn left or right)

Cart pole (discrete actions variant: link)

4 dimensional continuous observation space
1 dimensional continuous action space

Helicopter hoovering

12 dimensional continuous observation space
4 dimensional continuous action space

Octopus arm

82 dimensional continuous observation space
32 dimensional continuous action space

Evaluation approach

All environments can be measured by the (cumulative) reward at each time step. These series of values can be benchmarked against other results. Results will be averaged for better approximations.
These rewards can be compared to other agents

Random agent
Lukas' agent(s)

Some environments are tested in literature which can be used for evaluation

The two function optimization functions are discussed in [1]
The Cart pole environment with continuous actions is discussed in [2]

[1] G. Van den Broeck and K. Driessens, “Automatic discretization of actions and states in Monte-Carlo tree search,” in Proceedings of the ECML/PKDD

2011 Workshop on Machine Learning and Data Mining in and around Games (T. Croonenborghs, K. Driessens, and O. Missura, eds.), pp. 1–12,

Sep 2011.
[2] H. Van Hasselt and M. A. Wiering, “Reinforcement learning in continuous action spaces,” in Approximate Dynamic Programming and Reinforcement Learning 2007 ADPRL 2007 IEEE International Symposium on, no. Adprl, pp. 272–279, 2007.

3 comments:

Anonymous30 January, 2012 14:36
I'll refer to your bullets in numbers:
1.1 Good idea.
1.2 Your choice. You may invent parameters for that too. Possibly, start on the disk and everything off the disk is game-over?
1.3 Add noise as a parameter?
2. Good overview, you could add information about state/action complexity in dimensions, possibly as a function of the planning horizont. You already ordered them by increasing complexity, so that's good.
3.1 Not only at the end, but cumulative reward as a function of # steps?

Keep up the good progress :)
KD30 January, 2012 15:21
Why do you need bounds in the donut world? Coordinates are simply doubles right? Eventually, you will run out of double values, but that should take a long time?
Colin30 January, 2012 16:05
Thanks for the feedback. I've edited the post as a result of the comments.

1.2: I'm not really sure what's best. I made it so the agent bumps against the bounds because it sometimes occured that the agent would not be visible anymore on the visualization. Besides, locations far away from the donut would be not interesting anyway.

1.3: I had already implemented this. I've listed all the parameters on the post now.

2 and 3.1: Changed.

Monday, January 30, 2012

Feedback 30-01-2012

3 comments: