Towards Human-Level Safe Reinforcement Learning in Atari Library

Afriyadi Afriyadi; Wiranto Herry Utomo

doi:10.32736/sisfokom.v12i3.1739

Authors

Afriyadi Afriyadi Faculty of Computing, President University https://orcid.org/0000-0003-4711-8834
Wiranto Herry Utomo Faculty of Computing, President University https://orcid.org/0000-0002-9291-5819

DOI:

https://doi.org/10.32736/sisfokom.v12i3.1739

Keywords:

reinforcement learning, videogame environment, safety constraint, safe reinforcement learning

Abstract

Reinforcement learning (RL) is a powerful tool for training agents to perform complex tasks. However, from time-to-time RL agents often learn to behave in unsafe or unintended ways. This is especially true during the exploration phase, when the agent is trying to learn about its environment. This research acquires safe exploration methods from the field of robotics and evaluates their effectiveness compared to other algorithms that are commonly used in complex videogame environments without safe exploration. We also propose a method for hand-crafting catastrophic states, which are states that are known to be unsafe for the agent to visit. Our results show that our method and our hand-crafted safety constraints outperform state-of-the-art algorithms on relatively certain iterations. This means that our method is able to learn to behave safely while still achieving good performance. These results have implications for the future development of human-level safe learning with combination of model-based RL using complex videogame environments. By developing safe exploration methods, we can help to ensure that RL agents can be used in a variety of real-world applications, such as self-driving cars and robotics.

Author Biographies

Afriyadi Afriyadi, Faculty of Computing, President University

AfriyadiMaster student from Faculty of ComputingPresident University

Wiranto Herry Utomo, Faculty of Computing, President University

Wiranto Herry UtomoProfessor in Faculty of ComputingPresident University

References

Achiam, J., Held, D., Tamar, A. and Abbeel, P., 2017, July. Constrained policy optimization. In International conference on machine learning (pp. 22-31). PMLR.

Altman, E., 1999. Constrained Markov decision processes (Vol. 7). CRC press.

Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J. and Mané, D., 2016. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565.

Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Pieter Abbeel, O. and Zaremba, W., 2017. Hindsight experience replay. Advances in neural information processing systems, 30.

Badia, A.P., Piot, B., Kapturowski, S., Sprechmann, P., Vitvitskyi, A., Guo, Z.D. and Blundell, C., 2020, November. Agent57: Outperforming the atari human benchmark. In International conference on machine learning (pp. 507-517). PMLR.

Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A. and Blundell, C., 2020. Never give up: Learning directed exploration strategies. arXiv preprint arXiv:2002.06038.

Bakker, B., 2001. Reinforcement learning with long short-term memory. Advances in neural information processing systems, 14.

Berkenkamp, F., 2019. Safe exploration in reinforcement learning: Theory and applications in robotics (Doctoral dissertation, ETH Zurich).

Berkenkamp, F., Turchetta, M., Schoellig, A. and Krause, A., 2017. Safe model-based reinforcement learning with stability guarantees. Advances in neural information processing systems, 30.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J. and Zaremba, W., 2016. Openai gym. arXiv preprint arXiv:1606.01540.

Broekens, J., Jacobs, E. and Jonker, C.M., 2015. A reinforcement learning model of joy, distress, hope and fear. Connection Science, 27(3), pp.215-233.

Brunke, L., Greeff, M., Hall, A.W., Yuan, Z., Zhou, S., Panerati, J. and Schoellig, A.P., 2022. Safe learning in robotics: From learning-based control to safe reinforcement learning. Annual Review of Control, Robotics, and Autonomous Systems, 5, pp.411-444.

Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T. and Efros, A.A., 2018. Large-scale study of curiosity-driven learning. arXiv preprint arXiv:1808.04355.

Chow, Y., Nachum, O., Faust, A., Duenez-Guzman, E. and Ghavamzadeh, M., 2019. Lyapunov-based safe policy optimization for continuous control. arXiv preprint arXiv:1901.10031.

Dalal, G., Dvijotham, K., Vecerik, M., Hester, T., Paduraru, C. and Tassa, Y., 2018. Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757.

Deisenroth, M.P., Faisal, A.A. and Ong, C.S., 2020. Mathematics for machine learning. Cambridge University Press.

Gu, S., Yang, L., Du, Y., Chen, G., Walter, F., Wang, J., Yang, Y. and Knoll, A., 2022. A review of safe reinforcement learning: Methods, theory and applications. arXiv preprint arXiv:2205.10330.

Hafner, D., Pasukonis, J., Ba, J. and Lillicrap, T., 2023. Mastering Diverse Domains through World Models. arXiv preprint arXiv:2301.04104.

Hasselt, H., 2010. Double Q-learning. Advances in neural information processing systems, 23.

Hellaby, W.C.J.C., 1989. Learning from delayed rewards.

Jayant, A.K. and Bhatnagar, S., 2022. Model-based Safe Deep Reinforcement Learning via a Constrained Proximal Policy Optimization Algorithm. Advances in Neural Information Processing Systems, 35, pp.24432-24445.

Konda, V. and Tsitsiklis, J., 1999. Actor-critic algorithms. Advances in neural information processing systems, 12.

Ladosz, P., Weng, L., Kim, M. and Oh, H., 2022. Exploration in deep reinforcement learning: A survey. Information Fusion.

Li, Y., 2022. Deep reinforcement learning: Opportunities and challenges. arXiv preprint arXiv:2202.11296.

Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D. and Wierstra, D., 2020. Continuous control with deep reinforcement learning. US Patent, 15(217,758).

Lin, L.J., 1992. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8, pp.293-321.

Lipton, Z.C., Azizzadenesheli, K., Kumar, A., Li, L., Gao, J. and Deng, L., 2016. Combating reinforcement learning's sisyphean curse with intrinsic fear. arXiv preprint arXiv:1611.01211.

Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D. and Kavukcuoglu, K., 2016, June. Asynchronous methods for deep reinforcement learning. In International conference on machine learning (pp. 1928-1937). PMLR.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D. and Riedmiller, M., 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G. and Petersen, S., 2015. Human-level control through deep reinforcement learning. nature, 518(7540), pp.529-533.

Nichol, A., Pfau, V., Hesse, C., Klimov, O. and Schulman, J., 2018. Gotta learn fast: A new benchmark for generalization in rl. arXiv preprint arXiv:1804.03720.

Pathak, D., Agrawal, P., Efros, A.A. and Darrell, T., 2017, July. Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning (pp. 2778-2787). PMLR.

Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M. and Dormann, N., 2021. Stable-baselines3: Reliable reinforcement learning implementations. The Journal of Machine Learning Research, 22(1), pp.12348-12355.

Raschka, S. and Mirjalili, V., 2019. Python machine learning: Machine learning and deep learning with Python, scikit-learn, and TensorFlow 2. Packt Publishing Ltd.

Ray, A., Achiam, J. and Amodei, D., 2019. Benchmarking safe exploration in deep reinforcement learning. arXiv. arXiv preprint arXiv:1910.01708, 7.

Schmeckpeper, K., Rybkin, O., Daniilidis, K., Levine, S. and Finn, C., 2020. Reinforcement learning with videos: Combining offline observations with interaction. arXiv preprint arXiv:2011.06507.

Schulman, J., Levine, S., Abbeel, P., Jordan, M. and Moritz, P., 2015, June. Trust region policy optimization. In International conference on machine learning (pp. 1889-1897). PMLR.

Schulman, J., Moritz, P., Levine, S., Jordan, M. and Abbeel, P., 2015. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. and Dieleman, S., 2016. Mastering the game of Go with deep neural networks and tree search. nature, 529(7587), pp.484-489.

Srinivasan, K., Eysenbach, B., Ha, S., Tan, J. and Finn, C., 2020. Learning to be safe: Deep rl with a safety critic. arXiv preprint arXiv:2010.14603.

Sutton, R.S. and Barto, A.G., 2018. Reinforcement learning: An introduction. MIT press.

Sutton, R.S., Bowling, M.H. and Pilarski, P.M., 2022. The Alberta Plan for AI Research. arXiv preprint arXiv:2208.11173.

Sutton, R.S., McAllester, D., Singh, S. and Mansour, Y., 1999. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12.

Thomas, G., Luo, Y. and Ma, T., 2021. Safe reinforcement learning by imagining the near future. Advances in Neural Information Processing Systems, 34, pp.13859-13869.

Thumm, J. and Althoff, M., 2022, May. Provably safe deep reinforcement learning for robotic manipulation in human environments. In 2022 International Conference on Robotics and Automation (ICRA) (pp. 6344-6350). IEEE.

Van Hasselt, H., Guez, A. and Silver, D., 2016, March. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence (Vol. 30, No. 1).

Van Otterlo, M. and Wiering, M., 2012. Reinforcement learning and markov decision processes. Reinforcement learning: State-of-the-art, pp.3-42.

Wagener, N.C., Boots, B. and Cheng, C.A., 2021, July. Safe reinforcement learning using advantage-based intervention. In International Conference on Machine Learning (pp. 10630-10640). PMLR.

Wan, T. and Xu, N., 2018. Advances in experience replay. arXiv preprint arXiv:1805.05536.

Watkins, C.J. and Dayan, P., 1992. Q-learning. Machine learning, 8, pp.279-292.

Zhang, L., Shen, L., Yang, L., Chen, S., Yuan, B., Wang, X. and Tao, D., 2022. Penalized proximal policy optimization for safe reinforcement learning. arXiv preprint arXiv:2205.11814.

Zhang, Y., Vuong, Q. and Ross, K., 2020. First order constrained optimization in policy space. Advances in Neural Information Processing Systems, 33, pp.15338-15349.

Todorov, E., Erez, T. and Tassa, Y., 2012, October. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems (pp. 5026-5033). IEEE.

Zhao, W., Queralta, J.P. and Westerlund, T., 2020, December. Sim-to-real transfer in deep reinforcement learning for robotics: a survey. In 2020 IEEE symposium series on computational intelligence (SSCI) (pp. 737-744). IEEE.