1 Introduction
The robot box pushing problem has been studied in both single and multiagent cases [7]. The problem entails one or more robotic agents working together to push a box from an arbitrarily starting location to the goal location, through an environment that is often filled with obstacles that the box and robotic agents cannot pass through and should instead avoid. Robotic agents cooperatively pushing an object through an obstacle filled environment has applications in warehouse automation and disaster relief. In the case of the box pushing problem, the benefit of multiagent collaboration is twofold: the box can sometimes be too heavy for a single robot to push by itself, and the overall number of actions performed by a robot team to move the box to the goal location can be greatly reduced through collaboration and jointactions.
Previous works on the robot box pushing problem can effectively be split into two categories: machine learning solutions and nonmachine learning solutions. Of the nonmachine learning solutions, there have been several attempts to perform box pushing. Genetic algorithms (GA) have been applied to attempt a Pareto optimal solution that minimizes time taken and energy expended to push a box
[1]. GAbased methods did not attempt to include obstacles in the environment and are not necessarily suitable for obstacle ridden environments. The noisy nondominated sorting bee colony algorithm was applied to a box pushing problem with few obstacles and agents, and focused more on dealing with a noisy environment than the overall box pushing problem [6]. It can also be noted that the results of the Qlearning algorithm are affected by uncooperative actions in a way that is similar to noise. Research has also focused on path planning for box pushing given complete knowledge of the environment by modeling objects in the environment as convex polygons and planning the box trajectory [5]. Having complete knowledge of the environment initially is not a realistic approach in most applications, since the position of each robotic agent can affect the environment, so assuming the environment is dynamic is preferred.Machine learning solutions to the robot box pushing problem consist primarily of reinforcement learning approaches; specifically Qlearning approaches. Singleagent Qlearning has been extended to allow three agents to push a randomly placed box to the goal location, but no obstacles are presented [8]
. Qlearning has been applied to the box pushing problem using decision trees for adaptive state aggregation, but the results show that the algorithm performance is not a significant improvement over singleagent Qlearning, and the environment only contains a single obstacle, making the test case trivial
[2]. Bayesiandiscriminationfunctionbased reinforcement learning (BRL) is applied to allow multiple agents to push a box across a test area, but the area contains no obstacles [9]. In [7], singleagent and multiagent Qlearning are applied to the box pushing problem in an obstacle ridden environment. The results show, counterintuitively, that multiagent Qlearning is not capable of reliably reaching the goal and that the agents cannot learn from their experience, while singleagent Qlearning is reliable and learns. Overall, machine learning solutions outperform nonmachine learning solutions, but results tend to show trivial cases or fail to address aspects of the box pushing problem.In this paper, the implementation and results in [7] are extended by applying an additional Qlearning algorithm to the box pushing problem, restructuring the action set for the agents, and changing the reward representation. For comparison, all of the algorithms tested in [7] are also tested. The results show that the extensions to the original method allow the multiagent team to learn and converge to a policy quicker than the previously implemented algorithms, while also showing an improvement over the singleagent case.
The remainder of the paper is structured as follows. In Section 2, a description of the Qlearning algorithms implemented for this problem is provided, including: statespace, actions, and reward function. In Section 3, the experimental environment is described and experimental results are provided, along with a discussion of the results. Finally, in Section 4, a conclusion is provided.
2 Methodology
Four Qlearning algorithms are implemented in this paper: singleagent Qlearning, multiagent Qlearning with separate Qtables for each agent, multiagent Qlearning with a shared Qtable, and cooperative Qlearning with more frequent updates. In this section, these algorithms will be described, along with a detailed description of the state space representation, the possible actions for each agent, and the method for assigning reward to the agents.
In this paper, the Qlearning algorithms [4]
are modeled as Markov decision processes (MDPs). An MDP is defined as a tuple;
, where is a finite set of states, is a finite set of actions, is the reward function, and is the discount factor.2.1 Reinforcement Learning Algorithms
The single agent Qlearning algorithm is given by
(1) 
where is the current state, is the selected action, is the learning rate, is the next state based on the transition function and the selected action, and is the next action from state .
The multiagent (independent) Qlearning algorithm where each agent has its own Qtable is given by
(2) 
where is agent ’s Qtable, is the state of agent where , is the action taken by agent where , is the next state of agent where , and is the next action taken by agent where .
Similarly, the multiagent (independent) Qlearning algorithm where all agents share a Qtable is given by
(3) 
with the difference being that is now simply .
Finally, the cooperative Qlearning algorithm [3] is given by
(4) 
where is the weight determining whether the agent Qtable or the neighboring agent Qtables have more impact on the update. is the number of neighbors of agent in its detection range, is the state of neighbor where , and is the action of neighbor where .
2.2 State Space Representation
Similar to other research in this area, the state space for the box pushing environment is represented by 13 bits. Bits 04 are the goal orientation , which is converted to a bit string through the equation
(5) 
where the function rounds toward negative infinity.
The remainder of the bits are coded based on the presence of obstacles within a respective agent’s sensing range. The circle representing the agent’s sensing range is split into eight sectors of equal size, with each sector corresponding to a bit in the state representation. If an obstacle is within the sensing range in that sector, then the corresponding bit is a 1, otherwise the bit is a 0. Fig. 1 shows a visual representation of the obstacle coding for the environment. Bits are ordered counterclockwise starting from zero degrees, and most significant to least significant. In this particular example, the coded state would be .
2.3 Action Representation
A representation of the actions is shown in Fig. (a)a. As an extension to the previous works in this area, the actions were redefined and simplified as follows: The actions 1 through 4 are only related to the movement of the box whereas actions 5 and 6 will rotate the box. The box is considered heavy enough that the results of the rotation actions will not influence the movement of the box. In other words, the rotation actions will simply rotate the box around its center point without changing the position of the center point. Since the actions can be applied to the box regardless of its orientation, a method needs to be developed to assure the correct movement. To mathematically show the result of the actions 1 through 4, these equations can be considered where and are the new location of the box center after taking either action, is the slope of the line that represents the box angle, is the amount of movement as a result of the action, and and are the old location of the box center point (see Fig. (b)b).
(6) 
2.4 Reward Function
Adopted from the previous works [7], the reward function for all the algorithms implemented in this work are defined by a series of conditions which calculate the total reward. The total reward consists of three parts:
Part 1) The first part of the reward is dedicated to finding the distance between the box and the final goal
(7) 
where the and are measured from the goal to the center point of the box.
Part 2) The second part of the reward will emphasize the rotation of the box
(8) 
where is the previous angle of the box and is the new angle of the box. The difference of indicates the number of degrees of the rotation of the box. The main benefit of this part is to discourage the rotations of more than degrees. The objective here to prevent the box from constantly rotating while still providing positive rewards for small rotations that are required to evade obstacles.
Part 3) The final part of the reward is concerned with avoiding the obstacles, , and is simplified as
(9) 
This part of the reward was previously defined as a set of more complicated conditions which in our tests did not show a significant difference.
The total reward is defined as follows
(10) 
where are weights that are assigned manually.
3 Experimental Results
3.1 Experimental Environment and the Setup
A simulation system was developed in MATLAB to implement the algorithms. A testing environment was defined as an area of . The location of the obstacles was limited to an area of on the xaxis and on the yaxis. The obstacles radius was defined as , and their locations were generated randomly by the simulation program. The locations of the obstacles were kept the same for all the algorithms to give a fair comparison of the performance of the methods. The position of the goal location was defined as with a radius of . The box dimensions was set as and in the beginning of each episode, the box was moved to the origin.
The simulation for each algorithm was run with maximum of iterations for episodes. Each episode ends when either the box reaches the goal or the maximum number of iterations is reached.
The following parameters were set for all the algorithms: , , , , , , , .
3.2 Singleagent Qlearning Results
As described in the methodology section of this paper, the singleagent Qlearning is implemented here, and the result of the final selected path can be seen in Fig. (a)a.
As it is shown in Fig. (a)a, the algorithm has managed to choose a path that is very close to a straight line. Although an exact straight line path is not possible in this situation because of the obstacles.
A better representation of the performance of the algorithm can be shown as the number of iterations that were taken in each episode before reaching the goal (see Fig. (b)b). This figure shows that in the beginning of the learning process, the algorithm has to try numerous iterations (in this situation, up to 700 at one point) to find a path to reach the goal. But as the learning method progresses, the algorithm finds the shortest path (in this case, after 20 episodes) to reach the goal.
A conclusion can be drawn based on these results that the singleagent Qlearning is effective and can find the best path in a very reasonable amount of time, but a comparison between this and the other algorithms is needed to find whether the singleagent Qlearning is the most effective algorithm.
3.3 Multiagent Qlearning Results with Separate Qtables
As discussed in the methodology section, this algorithm will use multiple robots (in this case, 3 robots) to perform the task. Each robot will take an optimum action based on the method defined in section 2.1 and will update its own Qtable based on the reward. Figure (a)a shows the final selected path using this approach.
Noting Fig. (a)a, it is clear that the the box did reach the goal on the final path but there are many redundant actions that have taken place. It can be argued that the best path was not selected here as the path can be considered as stepped shaped. A better representation of the performance of this algorithm is shown in Fig. (b)b. It can be seen that the algorithm does not converges to a line even after 80 episodes although the number of iterations is certainly decreasing. This shows that the 80episode limit was not sufficient here. The number of iterations in each episode is also noticeably high. This is another sign that the algorithm is not the most efficient one.
3.4 Multiagent Qlearning Results with a Shared Qtable
The main difference between this algorithm and the previous one is that in this case, only one Qtable is used for all the robots. Each robot takes an optimum action and updates this sole Qtable based on the reward. The Qtable then is shared between all other robots for future action selection. The final selected path for this algorithm is shown in Fig. (a)a.
Fig. (b)b examines the performance of this algorithm. This result shows a significant improvement in comparison to the multiagent Qlearning with separate Qtables. A convergence can be seen in the number of required iterations that happened after about 20 episodes. It can be said that using only one shared Qtable can result in significantly better performance than using separate Qtables for each robot.
It should be noted that although the box does reach to the goal in every episode, this cannot be counted as the only requirement to count an algorithm as efficient. Since the environment dimensions are not extremely large, there is always a factor of chance of getting to the goal after several random movements. Therefore, the efficiency of the algorithm should be assessed on how fast this was achieved.
3.5 Cooperative Qlearning with Frequent Updates
Finally, the proposed algorithm is implemented. The cooperative algorithm is superior to the multiagent Qlearning in that each robot uses the results from other robots to improve its action selection. The ”Frequent Updates” refers to the fact that the Qtable is also updated more frequently than either of the multiagent Qlearning algorithms because each robot updates its Qtable after each action instead of one Qtable update for all the robots as happens in multiagent Qlearning. The result of this algorithm can be seen in Fig. (a)a. The performance of this Cooperative Qlearning can be seen in Fig. (b)b.
The result shown in Fig. (b)b shows a significant improvement over both multiagent Qlearning algorithms. The algorithm needs significantly less iterations in each episode to reach the goal and converges to the best path after only 20 episodes. The number of iterations needed is also less than half of the same in the multiagent Qlearning algorithms.
Overall, it can be concluded that the Cooperative Qlearning with Frequent Updates presents significantly better results than the multiagent Qlearning algorithms (either with separate Qtables or with a shared Qtable).
3.6 Comparison Between the Algorithms
To better discuss the performance of these four algorithms, Fig. 7 shows a comparison between them.
It is evident from this figure that the Cooperative Qlearning with Frequent Updates is performing significantly better than the other algorithms. Not only the box reaches the goal in less iterations in each episode, but also the algorithm converges surprisingly fast.
It is noteworthy to mention that based on the comparison figure, the singleagent Qlearning performs better than the multiagent Qlearning.
A comparison between the multiagent algorithm and the singleagent algorithm was done too. Fig. (a)a shows all the paths that were examined in a multiagent algorithm before selecting the final one.
As a comparison, Fig. (b)b shows the same result for the singleagent algorithm.
It can be seen that significantly less paths are examined in the singleagent algorithm. The values of the Qtables are the determinant factor in either case and it is very hard to explain the reason of these behaviors mathematically.
4 Conclusion
This work presents our implementation and comparison of four reinforcement algorithms. These were: singleagent Qlearning, multiagent Qlearning with separate Qtables for each agent, multiagent Qlearning with a shared Qtable, and cooperative Qlearning with more frequent updates.
A framework was implemented in MATLAB to easily compare the performance of each of these algorithms. The results was shown graphically as the final selected path for each algorithm and also the number of required iterations in each episode before reaching the goal. The comparison between the results showed that the cooperative Qlearning with more frequent updates performed significantly better than the other algorithms. In any case, the multiagent Qlearning algorithms (either with separate Qtables or a shared Qtable) did not perform as well as the singleagent Qlearning algorithm.
The cooperative Qlearning with more frequent updates was able to reach the goal in significantly less number of iterations and also converged to the best path astonishingly fast.
The overall conclusion can be that the multiagent Qlearning algorithms should be replaced with the cooperative Qlearning whenever possible. Also, the number of times that the robots update their Qtables should be more frequent as this resulted in a remarkably better and faster convergence to the best path.
References

[1]
Chakraborty, J., Konar, A., Nagar, A., Das, S.: Rotation and translation selective pareto optimal solution to the boxpushing problem by mobile robots using nsgaii. In: 2009 IEEE Congress on Evolutionary Computation. pp. 2120–2126 (May 2009).
https://doi.org/10.1109/CEC.2009.4983203  [2] Hwang, K.S., Ling, J.L., Wang, W.H.: Adaptive reinforcement learning in boxpushing robots. In: 2014 IEEE International Conference on Automation Science and Engineering (CASE). pp. 1182–1187 (Aug 2014). https://doi.org/10.1109/CoASE.2014.6899476
 [3] La, H.M., Lim, R., Sheng, W.: Multirobot cooperative learning for predator avoidance. IEEE Transactions on Control Systems Technology 23(1), 52–63 (Jan 2015). https://doi.org/10.1109/TCST.2014.2312392
 [4] La, H.M., Lim, R.S., Sheng, W., Chen, J.: Cooperative flocking and learning in multirobot systems for predator avoidance. In: 2013 IEEE International Conference on Cyber Technology in Automation, Control and Intelligent Systems. pp. 337–342 (May 2013). https://doi.org/10.1109/CYBER.2013.6705469
 [5] ParraGonzalez, E.F., RamirezTorres, J.G., ToscanoPulido, G.: A new object path planner for the box pushing problem. In: 2009 Electronics, Robotics and Automotive Mechanics Conference (CERMA). pp. 119–124 (Sept 2009). https://doi.org/10.1109/CERMA.2009.54
 [6] Rakshit, P., Konar, A., Nagar, A.K.: Multirobot boxpushing in presence of measurement noise. In: 2016 IEEE Congress on Evolutionary Computation (CEC). pp. 4926–4933 (July 2016). https://doi.org/10.1109/CEC.2016.7744422
 [7] Wang, Y., Silva, C.W.D.: Multirobot boxpushing: Singleagent qlearning vs. team qlearning. In: 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems. pp. 3694–3699 (Oct 2006). https://doi.org/10.1109/IROS.2006.281729
 [8] Wang, Y., de Silva, C.W.: An object transportation system with multiple robots and machine learning. In: Proceedings of the 2005, American Control Conference, 2005. pp. 1371–1376 vol. 2 (June 2005). https://doi.org/10.1109/ACC.2005.1470156
 [9] Yasuda, T., Ohkura, K., Yamada, K.: Multirobot cooperation based on continuous reinforcement learning with two state space representations. In: 2013 IEEE International Conference on Systems, Man, and Cybernetics. pp. 4470–4475 (Oct 2013). https://doi.org/10.1109/SMC.2013.760
Comments
There are no comments yet.