D.A. Cieslak, D. Thain, N.V. Chawla, “Short paper: Troubleshooting distributed systems via data mining”, Proceeding of the IEEE/HDPC, pp. 309-312, Paris, France, June 2006.
 A.N. Duarte, F. Brasileiro, W. Crine, J.A. Filho, “Collaborative fault diagnosis in grids through automated tests”, Proceedings of the IEEE/AINA, Vol. 1, pp. 69-74, Vienna, Austria, April 2006.
 H. Li, D. Groep, L. Wolters, J. Templon, “Job failure analysis and its implications in a large-scale production grid”, Proceeding of the IEEE/SCIENCE, pp. 27–27, Amsterdam, The Netherlands, The Netherlands, Dec. 2006.
 K. Neocleous, M.D. Dikaiakos, V. Fragopoulou, E. Markatos, “Failure management in grids: The case of the EGEE infrastructure”, Parallel Processing letters (World Scientific Publishing), Vol. 17, No. 4, pp. 391-410, 2007.
 D. Zeinalipour-Yazti, H. Papadakis, C. Georgiou, M.D. Dikaiakos, “Metadata ranking and pruning for failure detection in grids”, Parallel Processing Letters, Vol. 18, No. 3, pp. 371-390, Sep. 2008.
 C. Dabrowski, “Reliability in grid computing systems”, Proceeding of the CCPE, Special issue on Open Grid Forum Vol. 21, No. 8, pp. 927–959, June 2009.
 A. Pellegrini, P. Di Sanzo, D.R. Avresky, “A machine learning-based framework for building application failure prediction models”, Proceeding of the IEEE/IPDPSW, pp. 1072-1081, Hyderabad, India, May 2015.
 L. Shrinivas, J.F. Naughton, “Issues in applying data mining to grid job failure detection and diagnosis”, Proceeding of the IEEE/HDPC, Boston, pp. 239-240, June 2008.
 Y. Yuan, Y. Wu, Q. Wang, G. Yang, W. Zheng, “Job failures in high performance computing systems: A large-scale empirical study”, Computers and Mathematics with Applications, Vol. 63, No. 2, pp.365–377, Jan. 2012.
 J. Gu, Z. Zheng, Z. Lan, J. White, E. Hocks, B.H. Park, “Dynamic meta-learning for failure prediction in large-scale systems: A case study”, Proceeding of the IEEE/ICPP, pp. 157-164, Portland, OR, USA, Sep. 2008.
 P. Garraghan, P. Townend, J. Xu, “An empirical failure-analysis of a large-scale cloud computing environment”, IEEE/HASE, pp. 113-120, Miami Beach, FL, USA , Jan. 2014.
 S. Fu, C-Z. Xu, “Exploring event correlation for failure prediction in coalitions of clusters”, Proceedings of the IEEE/ACM, pp. 1–12, Reno, NV, USA, USA, Nov 2007.
 Z. Lan, P. Gujrati, Y. Li, Z. Zheng, R. Thakur, J. White, “A fault diagnosis and prognosis service for teragrid clusters”, Proceeding of the IEEE/GRID, Tsukuba, Japan, June 2007.
 D. A. Cieslak, N. V. Chawla, and D. L. Thain. “Troubleshooting thousand of jobs on production grids using data mining techniques”, Proceeding of the IEEE/ACM, pp. 217-224, 2008.
 N.V. Chawla, D. Thain, R. Lichtenwalte, D.A. Cieslak, “Data mining on the grid for the grid”, Proceeding of the IEEE/IPDPS, pp. 1-8, Miami, FL, USA, April 2008.
 H. Saadatfar, H. Fadishei, and H. Deldari, “Predicting job failures in AuverGrid based on workload log analysis”, New Generation Computing, Vol. 30, No. 1, pp. 73-94, Jan. 2012.
 H. Saadatfar, H. Deldari, “A job submission manager for large-scale distributed systems based on job futurity predictor”, International Journal of Grid and Utility Computing, Vol. 5, No. 1, pp. 50-59, Dec. 2014.
 Rosa, Andrea, Lydia Y. Chen, W. Binder, “Predicting and mitigating jobs failures in big data clusters”, Proceeding of the IEEE/ACM, pp. 221-230, Shenzhen, China , May 2015.
 X. Chen, C.D. Lu, K. Pattabiraman, “Failure analysis of jobs in compute clouds: A google cluster case study”, Proceeding of the IEEE/ ISSRE, pp. 167-177,. Naples, Italy, Dec. 2014.
 A. Iosup, H. Li, M. Jan, S. Anoep, C. Dumitrescu, L. Wolters, D.H.J. Epema, “The grid workloads archive”, Journal of Future Generation Computer Systems, Vol. 24, No. 7, pp. 672-686, Feb. 2008.