Monday, January 30, 2012

Feedback 30-01-2012


  • Finished the "Donut World" environment. Some small comments to discuss next meeting;
    • Agent can turn only for a limited range left or right (default 135.0 degrees); otherwise it could walk back and forth in stead of forward over the donut
    • If the agent would be located outside the bounds at the next step, its next location will be where it would touch the bounds. Other option could be: Game Over?
    • Several other options are configurable as parameters; 
      • step limit
      • agent's step size 
      • agent's maximum turn angle (in degrees)
      • agent's starting x, y, and angle
      • donut's x, y, radius and thickness
      • a single value reward or a gradual reward range when being on donut
      • amount of noise (observation and/or transition noise)
  • Here is again an overview of the environments that can be used to validate the algorithms. See references/weblinks for more information.
    • Sinus function optimization (upcoming; see [1])
      • 1 dimensional continuous observation and action space
      • step limit = 1
    • Six hump camel back function optimization (see [1])
      • 2 dimensional continuous observation and action space
      • step limit = 1
    • Donut world
      • 3 dimensional continuous observation space (x, y and alpha of the agent)
      • 1 dimensional continuous action space (amount of degree to turn left or right)
    • Cart pole (discrete actions variant: link)
      • 4 dimensional continuous observation space
      • 1 dimensional continuous action space
    • Helicopter hoovering
      • 12 dimensional continuous observation space
      • 4 dimensional continuous action space
    • Octopus arm
      • 82 dimensional continuous observation space
      • 32 dimensional continuous action space
  • Evaluation approach
    • All environments can be measured by the (cumulative) reward at each time step. These series of values can be benchmarked against other results. Results will be averaged for better approximations.
    • These rewards can be compared to other agents
      • Random agent
      • Lukas' agent(s)
    • Some environments are tested in literature which can be used for evaluation
      • The two function optimization functions are discussed in [1]
      • The Cart pole environment with continuous actions is discussed in [2]
[1] G. Van den Broeck and K. Driessens, “Automatic discretization of actions and states in Monte-Carlo tree search,” in Proceedings of the ECML/PKDD
2011 Workshop on Machine Learning and Data Mining in and around Games (T. Croonenborghs, K. Driessens, and O. Missura, eds.), pp. 1–12,
Sep 2011.
[2] H. Van Hasselt and M. A. Wiering, “Reinforcement learning in continuous action spaces,” in Approximate Dynamic Programming and Reinforcement Learning 2007 ADPRL 2007 IEEE International Symposium on, no. Adprl, pp. 272–279, 2007.

3 comments:

  1. I'll refer to your bullets in numbers:
    1.1 Good idea.
    1.2 Your choice. You may invent parameters for that too. Possibly, start on the disk and everything off the disk is game-over?
    1.3 Add noise as a parameter?
    2. Good overview, you could add information about state/action complexity in dimensions, possibly as a function of the planning horizont. You already ordered them by increasing complexity, so that's good.
    3.1 Not only at the end, but cumulative reward as a function of # steps?

    Keep up the good progress :)

    ReplyDelete
  2. Why do you need bounds in the donut world? Coordinates are simply doubles right? Eventually, you will run out of double values, but that should take a long time?

    ReplyDelete
  3. Thanks for the feedback. I've edited the post as a result of the comments.

    1.2: I'm not really sure what's best. I made it so the agent bumps against the bounds because it sometimes occured that the agent would not be visible anymore on the visualization. Besides, locations far away from the donut would be not interesting anyway.

    1.3: I had already implemented this. I've listed all the parameters on the post now.

    2 and 3.1: Changed.

    ReplyDelete