عنوان مقاله [English]
Distributed processing environments, such as grids, are one of the most important platforms for meeting the user's processing needs. These environments have the potential to meet the needs of users, but they also have their own problems, including the failure of the jobs. Several attempts have been made to overcome this problem, which in general can be divided into two categories of resource side methods and job side methods. All these methods need some kind of prediction of the resources or jobs status in order to pursue a proactive approach to failures. However, due to the dynamics of these environments, the developed models quickly lose their quality and thus can not effectively help with the methods mentioned. In this paper, first, by identifying the reasons for reducing the quality of predictors in the grid environment, a solution has been proposed to deal with it, and then the proposed solution has been applied in the context of job failures. The results of experiments on the two experimental environments of AuverGrid and Grid5000 showed that the proposed method would increase the quality by 0.02 and 0.06 respectively in these two environments.
 D.A. Cieslak, D. Thain, N.V. Chawla, “Short paper: Troubleshooting distributed systems via data mining”, Proceeding of the IEEE/HDPC, pp. 309-312, Paris, France, June 2006.
 A.N. Duarte, F. Brasileiro, W. Crine, J.A. Filho, “Collaborative fault diagnosis in grids through automated tests”, Proceedings of the IEEE/AINA, Vol. 1, pp. 69-74, Vienna, Austria, April 2006.
 H. Li, D. Groep, L. Wolters, J. Templon, “Job failure analysis and its implications in a large-scale production grid”, Proceeding of the IEEE/SCIENCE, pp. 27–27, Amsterdam, The Netherlands, The Netherlands, Dec. 2006.
 K. Neocleous, M.D. Dikaiakos, V. Fragopoulou, E. Markatos, “Failure management in grids: The case of the EGEE infrastructure”, Parallel Processing letters (World Scientific Publishing), Vol. 17, No. 4, pp. 391-410, 2007.
 D. Zeinalipour-Yazti, H. Papadakis, C. Georgiou, M.D. Dikaiakos, “Metadata ranking and pruning for failure detection in grids”, Parallel Processing Letters, Vol. 18, No. 3, pp. 371-390, Sep. 2008.
 C. Dabrowski, “Reliability in grid computing systems”, Proceeding of the CCPE, Special issue on Open Grid Forum Vol. 21, No. 8, pp. 927–959, June 2009.
 A. Pellegrini, P. Di Sanzo, D.R. Avresky, “A machine learning-based framework for building application failure prediction models”, Proceeding of the IEEE/IPDPSW, pp. 1072-1081, Hyderabad, India, May 2015.
 L. Shrinivas, J.F. Naughton, “Issues in applying data mining to grid job failure detection and diagnosis”, Proceeding of the IEEE/HDPC, Boston, pp. 239-240, June 2008.
 Y. Yuan, Y. Wu, Q. Wang, G. Yang, W. Zheng, “Job failures in high performance computing systems: A large-scale empirical study”, Computers and Mathematics with Applications, Vol. 63, No. 2, pp.365–377, Jan. 2012.
 J. Gu, Z. Zheng, Z. Lan, J. White, E. Hocks, B.H. Park, “Dynamic meta-learning for failure prediction in large-scale systems: A case study”, Proceeding of the IEEE/ICPP, pp. 157-164, Portland, OR, USA, Sep. 2008.
 P. Garraghan, P. Townend, J. Xu, “An empirical failure-analysis of a large-scale cloud computing environment”, IEEE/HASE, pp. 113-120, Miami Beach, FL, USA , Jan. 2014.
 S. Fu, C-Z. Xu, “Exploring event correlation for failure prediction in coalitions of clusters”, Proceedings of the IEEE/ACM, pp. 1–12, Reno, NV, USA, USA, Nov 2007.
 Z. Lan, P. Gujrati, Y. Li, Z. Zheng, R. Thakur, J. White, “A fault diagnosis and prognosis service for teragrid clusters”, Proceeding of the IEEE/GRID, Tsukuba, Japan, June 2007.
 D. A. Cieslak, N. V. Chawla, and D. L. Thain. “Troubleshooting thousand of jobs on production grids using data mining techniques”, Proceeding of the IEEE/ACM, pp. 217-224, 2008.
 N.V. Chawla, D. Thain, R. Lichtenwalte, D.A. Cieslak, “Data mining on the grid for the grid”, Proceeding of the IEEE/IPDPS, pp. 1-8, Miami, FL, USA, April 2008.
 H. Saadatfar, H. Fadishei, and H. Deldari, “Predicting job failures in AuverGrid based on workload log analysis”, New Generation Computing, Vol. 30, No. 1, pp. 73-94, Jan. 2012.
 H. Saadatfar, H. Deldari, “A job submission manager for large-scale distributed systems based on job futurity predictor”, International Journal of Grid and Utility Computing, Vol. 5, No. 1, pp. 50-59, Dec. 2014.
 Rosa, Andrea, Lydia Y. Chen, W. Binder, “Predicting and mitigating jobs failures in big data clusters”, Proceeding of the IEEE/ACM, pp. 221-230, Shenzhen, China , May 2015.
 X. Chen, C.D. Lu, K. Pattabiraman, “Failure analysis of jobs in compute clouds: A google cluster case study”, Proceeding of the IEEE/ ISSRE, pp. 167-177,. Naples, Italy, Dec. 2014.