## Training Partners: Computer Chemistry

# Accelerating Reinforcement Learning: Giving Computers Training Partners

Mihir Patel

Thomas Jefferson High School for Science and Technology

*This article was originally included in the 2018 print publication of Teknos Science Journal.*

The 3D humanoid rendered onto the screen struggles to maintain its balance as it learns to walk. Built on MuJoCo, a physics engine with the power to model the real world, the creation struggles to explore the world created by researchers from OpenAI (Brockman et al., 2016). Algorithms churn away, teaching the humanoid to walk forward through reinforcement learning. This technique allows learning through simulation, a process where each failure tells the humanoid how to better prevent falling the next time and each success helps it learn how to better maintain its balance. Reinforcement learning solves several key problems, such as the credit assignment problem, by assigning each action, such as the movement of a joint, a specific reward. By learning these reward values and attempting to maximize it, the humanoid can learn to walk.

In the last decade, reinforcement learning has exploded in popularity and surpassed all previous benchmarks in a wide variety of complex problems. Researchers have managed to create advanced systems that master skills ranging from reflexes to tactics and strategies. For example, Google recently published a paper revealing AlphaGo, a system capable of learning complex patterns to such a high degree that it could beat humans in games like Go, chess, and shogi. Unprecedented advancements have paved the way for massive leaps in advanced AI technology such as self-driving cars. However, more interesting is the ability of this algorithm to do research. Want to design a more stable building? Set up a physics simulator with earthquakes and reinforcement learning will create more durable structures through simulation. Want to create more aerodynamic planes? Just add wind, let the algorithm modify wing shapes, and watch it try to find the best design.

There remains one problem. Since reinforcement learning requires many simulations, it becomes a slow process. Even with advanced hardware such as modern graphical processing units (GPUs) and tensor processing units (TPUs), it still takes days to solve problems. AlphaGo, for example, took over a month and a half before it surpassed top-tier human Go players, even while running on Google’s high-performance compute clusters equipped with hardware so advanced it is not publicly available yet (Silver et al., 2017). The resource demands for this field of AI have restricted its usage to large corporations with the time and resources to run something this complex. It simply is not possible for most researchers or organizations to utilize this algorithm in solving real-world problems.

To increase the accessibility and utility of reinforcement learning, researchers have strived to optimize this technique by changing the algorithm and how it learns from a simulation. Originally, best optimization practices involved using DQN, which is noisy and tends to fluctuate excessively. As a result, researchers must conduct far more simulations to stabilize and understand the pattern robustly. Teams at OpenAI have developed a different approach, called policy gradients, which are able to handle tasks without as much variability (Li, 2017). One such policy gradient algorithm, TRPO, produces far more consistent and reliable results in less time (Schulman, Levine, Moritz, Jordan, & Abbeel, 2017).

Recently, researchers have developed a new way to improve reinforcement learning. In many situations where people use reinforcement learning algorithms, such as board games, there are two opposing competitors. Typically, an administrator sets one to a predetermined algorithm and treats it as a constant. However, the administrator can model the second player with a reinforcement learning algorithm so that the two versions in the same simulation are training simultaneously and each is fighting to defeat the other. This leads to an interesting scenario where game theory becomes applicable (Pinto, Davidson, Sukthankar, & Gupta, 2017). Essentially, this creates a zero-sum game whereby the two algorithms can learn tabula rasa, which means with no domain knowledge or information from human trainers.

When humans learn a game, it is challenging to understand things when playing at maximum difficulty. Instead, it is more beneficial to play a game at a difficulty equal to or slightly above one's level, similar to a sparring partner. This project aims to apply this concept into reinforcement learning algorithms by taking simulations where two versions are fighting against each other to train them incrementally. Early results have shown that such a method allows the algorithm to understand and learn the pattern far faster than when thrown against the hardest case. For example, when playing chess, if an algorithm starts at “grandmaster” level, it takes a long time to learn. However, if it slowly moves up the ranks in the opponent it faces, it learns far faster, greatly speeding up the training time.

This project will benchmark this idea to determine the extent of speedup it gives and how much this speedup varies among different types of patterns ranging from reflex actions to strategy and tactics through industry standard challenges (Duan, Chen, Houthooft, Abbeel, & Schulman, 2016). The benchmark values on common tasks including learning to hop, walk, and run are based on values and methods obtained from Lerrel, a researcher who has conducted similar work (personal communication, Jan. 12 2017). As of now, this project has obtained results showing 33% faster training time with final models being more robust to environmental variety l such as wind or friction. In the future, working on this research could decrease the time it takes for reinforcement learning to solve complex problems, aiding in solving today’s cutting-edge AI problems.

References

[1]Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). OpenAI Gym. arXiv. Retrieved from https://arxiv.org/abs/1606.01540

[2]Duan, Y., Chen, X., Houthooft, R., Abbeel, P., & Schulman, J. (2016). Benchmarking Deep Reinforcement Learning for Continuous Control. 33rd International Conference on Machine Learning (ICML). Retrieved from arXiv database.

[3]Li, Y. (2017). Deep Reinforcement Learning: An Overview. arXiv. Retrieved from https://arxiv.org/abs/1701.07274

[4]Pinto, L., Davidson, J., Sukthankar, R., & Gupta, A. (2017). Robust Adversarial Reinforcement Learning. arXiv. Retrieved from https://arxiv.org/pdf/1703.02702.pdf

[5]Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2017). Trust Region Policy Optimization. arXiv. Retrieved from https://arxiv.org/abs/1502.05477

[6]Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, L., Lai, M., Guez, A., . . . Hassabis, D. (2017). Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. arXiv. Retrieved from https://arxiv.org/abs/1712.01815