پیاده سازی الگوریتم خوشه بندی سلسله مراتبی تراکمی بصورت موازی با روش نگاشت و کاهش

نوع مقاله: مقاله پژوهشی

نویسندگان

1 دانشکده مهندسی کامپیوتر-واحد نجف آباد-دانشگاه آزاد اسلامی -نجف آباد- ایران

2 دانشکده مهندسی کامپیوتر- واحد نجف‌آباد، دانشگاه آزاد اسلامی، نجف‌آباد، ایران مرکز تحقیقات مه داده- واحد نجف‌آباد، دانشگاه آزاد اسلامی، نجف‌آباد، ایران

چکیده

: مدل نگاشت-کاهش یک مدل برای اجرای برنامه های کاربردی داده های بزرگ می باشد. همچنین این مدل، یک مدل برنامه نویسی موازی برای نوشتن برنامه‌هایی می‌باشد که می‌توانند بر روی ابر اجرا شوند. سازمان ها بطور فزاینده ای در حال تولید داده هستند که حاصل فرایندهای کسب وکار ، فعالیت های کاربران، ردیابی وب سایت ها، حسگرها، مالی، حسابداری و غیره تولید می شوند. الگوریتم های خوشه بندی داده، به عنوان ابزاری برای تجزیه و تحلیل حجم زیاد داده به کار می روند. هدف اصلی این الگوریتم ها، این است که داده ها را در خوشه هایی دسته بندی کنند، و اشیای داده در هر خوشه با یکدیگر شباهت دارند. در این مقاله، الگوریتم خوشه بندی سلسله مراتبی متراکم که یکی از تکنیک های داده کاوی می باشد با استفاده از طراحی نگاشت و کاهش پیاده سازی شده و سپس نتایج این الگوریتم با حالت بدون نگاشت و کاهش مورد مقایسه قرار می گیرد. آزمایش‌های انجام شده نشان می‌دهد با افزایش اندازه داده های ورودی، زمان اجرا کاهش می یابد. زمان اجرای الگوریتم به روش موازی نسبت به روش ترتیبی برای مجموعه داده‌ای به اندازه 200 شی داده، 16.80% و برای مجموعه داده‌ای به اندازه 1000 شی داده، 29.26% بهبود یافت. همچنین درصد استفاده از پردازنده کل سیستم در روش موازی از 22% به 94% ارتقاء یافت.

کلیدواژه‌ها


عنوان مقاله [English]

Implementation of Agglomerative Hierarchical Clustering Algorithm Applying the Map-Reduce Parallel Approach

نویسندگان [English]

  • Fahimeh Tavakoli 1
  • Faramarz Safi-Esfahani 2
1 Faculty of Computer Engineering, Najafabad Branch, Islamic Azad University, Najafabad, Iran.
چکیده [English]

The map-reduce model is a method for executing large data applications. It is also a parallel programming model for writing applications that can be executed on the cloud. Organizations are increasingly producing data that is generated by business processes, user activities, website tracking, sensors, finance, accounting, and more. Data clustering algorithms are used as tools for analyzing large volumes of data. The main purpose of these algorithms is to categorize data into clusters so that the data objects in each cluster are more similar. In this paper, a dense hierarchical clustering algorithm, one of the data mining techniques, is implemented using map-reduce design and then the results of this algorithm are compared with the usual one. Experiments show that runtime decreases with increasing input data size. The runtime of the algorithm improved by 16.80% for the 200 data-point dataset, and 29.26% for the dataset with 1000 data points. The percentage of CPU usage in the parallel system also increased from 22% to 94%.

کلیدواژه‌ها [English]

  • Map-Reduce
  • Hadoop
  • clustering algorithms
  • parallel programming
[1]     E. Begoli and J. Horey, "Principles for Effective Knowledge Discovery from Big Data", Proceeding of the IEEE/IFIP, Helsinki, Finland, pp. 1-4, Aug 2012 (doi: 10.1109/WICSA-ECSA.212.32).

[2]     A. Jain and R. Dubes,” Algorithms for Clustering Data”, NJ: Prentice-Hall, Inc., January 1988.

[3]     Jain, A.K., Murty, M.N. and Flynn, P.J., “Data clustering: a review”, ACM computing surveys (CSUR), vol. 31, no. 3, pp. 264-323, 1999 (doi: 10.1145/331499.331504).

[4]     Dean, J. and Ghemawat, S., “MapReduce: simplified data processing on large clusters”, Communications of the ACM, 51(1), pp.107-113 (doi: 10.1145/1327452.1327492).

[5]     Rohlf, F.J., "Hierarchical clustering using the minimum spanning tree Computer", The Computer Journal, vol. 16, pp. 93-95, 1973.

[6]     F. Murtagh, "Multidimensional Clustering Algorithms," Physica-Verlag, 1985.

[7]     J. Ward, "Hierarchical grouping to optimize an objective function," Journal of the American Statistical Association, vol. 58, pp. 236-244, 1963.

[8]     G. Lance and W. Williams, "A General Theory of Classificatory Sorting Strategies 1. Hierarchical Systems", The Computer Journal, vol. 9, no. 4, pp. 373-380, 1967.

[9]     W. Pengcheng, H. Fangcheng, L. Li, S. Chuanfu and J. Li, "Research on large data set clustering method based on MapReduce," Neural Computing and Applications, vol. 32, p. 93–99, 2020 (doi: 10.1007/s00521- 018-3780-y)

[10] V. S. Bawane and S. M. Kale, "Clustering algorithms in map reduce: A Review", International Journal of Computer Applications, vol. 975, pp. 15-18, 2015.

[11] D. C. Radhika. P. Lalita., "Distributed clustering for big data with map reduce," IOSR Journal of Computer Engineering, vol. 19, no. 3, pp. 25-28, May 2017 (doi: 10.9790/0661-1903032528).

[12] S. Alshammari, M. Binti Zolkepli, B. A. Rusli, "Genetic algorithm based parallel k-means data   clustering algorithm using MapReduce programming paradigm on Hadoop environment (GAPKCA)", International Conference on Soft Computing and Data Mining, SCDM 2020: Recent Advances on Soft Computing and Data Mining, Langkawi, Malaysia, Jan. 2020.

[13] Y. Miao, J. Zhang, H. Feng, L. Qiu and Y. Wen, “A fast algorithm for clustering with map reduce”, vol. 7951, pp. 25-38, Berlin, Heidelberg: Lecture Notes in Computer Science, 2013.

[14] P. P. Anchalia, A. Koundinya and S. Nk, "MapReduce design of K-means clustering algorithm", Pro­ceeding of the IEEE/ ICISA, Suwon, South Korea, pp. 1-5, June 2013 (doi: 10.1109/ICISA.2013.6579448).

[15] W. Zhao, H. Ma, Q. He, “Parallel K-means clustering based on MapReduce", Cloud Computing, CloudCom 2009. Lecture Notes in Computer Science, vol. 5931, p. 674–679, Berlin Heidelberg: Springer-Verlag LNCS, 2009 (doi: 10.1007/978-3-642-10665-1_71).

[16]  J. Ragaventhiran, D. M. Kavitha, "Map-optimize-reduce: CAN tree assisted FP-growth algorithm for clusters-based FP mining on Hadoop", Future Generation Computer Systems, vol. 103, pp. 111-122, Feb. 2020 (doi: 10.1016/j.future.2019.09.041).

[17] T. Habib Sardar, Z. Ansari, "An analysis of MapReduce efficiency in document clustering using parallel K-means algorithm", Future Computing and Informatics Journal, vol. 3, no. 2, pp. 200-209, Dec. 2018 (doi: 10.1016/j.fcij.2018.03.003).

[18] M. A. B. Haj-Kacem, C.-E. Ben N’cir. N. Essoussi., "One-pass MapReduce-based clustering method for mixed large-scale data", Journal of Intelligent Information Systems, vol. 52, pp. 619–636, June 2019 (doi: 10.1007/s10844-017-0472-5).

[19] S. Chowdam., N. Kasiviswanath. P. C. Reddy., "Clustering large datasets using K-means modified inter and intra clustering (KM-I2C) in Hadoop", Journal of Big Data, vol. 4, no. 1, pp. 1-27, 2017 (doi: 10.1186/ s40537-017-0087-2).

[20] D. Teffer, R. Srinivasan, J. Ghosh, "AdaHash: Hashing-based scalable, adaptive hierarchical clustering of streaming data on MapReduce frameworks", International Journal of Data Science and Analytics, vol. 8, pp. 257–267, 2019 (doi: 10.1007/s41060-018-0145-7)

[21] "wikipedia," [Online]. Available: https://en.wikipedia.org/wiki/Jaccard_index.

[22] M. Ngazimbi, "Data clustering using MapReduce", PhD Thesis, Boise State University, 2009.

[23] T. Sun, C. Shu, F. Li, H. Yu, L. Ma, Y. Fang, "An efficient hierarchical clustering method for large dat­as­e­t­s­ with Map-Reduce", Proceeding of the IEEE/PDCAT, pp. 494-499, 2009 (doi: 10.1109/PDCAT.2009.46).

[24] S. Papadimitriou, J. Sun, "DisCo: Distributed co-clustering with MapReduce:A case study towards petab­y­t­­e-scale end-to-end mining", Proceeding of the IEEE/ICDM, Pisa, Italy, Dec. 2008 (doi: 10.1109/ICDM. 2008.142).

[25] E. M. Rasmussen and P. Willett, "Efficiency of hierarchical agglomerative clustering using the ICL distributed array processor", Journal of Documentation, vol. 45, no. 1, pp. 1-24, 1989 (doi: 10.1108/eb0268 36).

[26] C. F. Olson, "Parallel Algorithms for Hierarchical Clustering", Parallel Computing, vol. 21, no. 8, pp. 1313-1325, Aug. 1995 (10.1016/0167-8191(95)00017-I).

[27] Z. Li, K. Li, D. Xiao, L. Yang, "An adaptive parallel hierarchical clustering algorithm", Proceeding of the HPCC, Springer, Berlin, Heidelberg, pp. 97-107, 2007 (doi: 10.1007/978-3-540-75444-2_15).

[28] S. Rajasekara, "Efficient parallel hierarchical clustering algorithms", IEEE Trans. on Parallel and Dist­ri­b­u­t­e­d Systems, vol. 16, no. 6, pp. 27-34, June 2005 (doi: 10.1109/TPDS.2005.72).

[29] M. Dash, S. Petrutiu, P. Scheuermann, "Efficient parallel hierarchical clustering", IEEE Trans. on Parallel and Distributed Systems, vol. 16, no. 6, pp. 497-502, 2005 (doi: 10.1109/TPDS.2005.72).

[30] S. Guha, R. Rastogi, K. Shim, "ROCK: A robust clustering algorithm for categorical attributes", Infor­mat­i­o­n System, vol. 32, no. 8, pp. 345- 366, 2000 (doi: 10.1016/S0306-4379(00)00022-3).

[31] G. Karypis, E. Han, V. Kumar, "CHAMELEON: A hierarchical clustering algorithm using dynamic modeling", Computer, vol. 32, no. 8, pp. 68-75, Aug. 1999 (doi: 10.1109/2.781637).

[32] E. L. Johnson, H. Kargupta, "Collective, hierarchical clustering from distributed, heterogeneous data", Large-Scale Parallel Data Mining, pp. 221-244, 2000 (doi: 10.1007/3-540-46502-2_12).

[33] N. Samatova, G. Ostrouchov, A. Geist and A. Melec, "RACHET: An efficient cover-based merging of clustering hierarchies from distributed datasets", Distributed and Parallel Databases, vol. 11, no. 2, pp. 157-180, 2002 (doi: 10.1023/A:1013988102576).

[34] H. Gao, J. Jiang, L. She, Y. Fu, "A new agglomerative hierarchical clustering algorithm implementation based on the map reduce framework", International Journal of Digital Content Technology and its Applic­ati­o­ns, vol. 4, no. 3, 2016.

[35] T. Zhang, R. Ramakrishnan, M. Livny, "BIRCH: An efficient data clustering method for very large databases", SIGMOD, vol. 25, no. 2, Association for Computing Machinery, Jun. 1996 (doi: 10.1145/ 235968.233324).

[36] "Apache Hadoop," [Online]. Available: http://hadoop.apache.org. [Accessed 2019].