A Clustering based Feature Selection Approach using Maximum Spanning Tree.

Md. Hasan Tarek; Suravi Akhter; Sumon Ahmed; Md Shariful Islam*

doi:https://doi.org/10.3329/dujase.v7i2.65094

Dhaka University Journal of Applied Science & Engineering

Issue: Vol. 7, No. 2, July 2022

Title:	A Clustering based Feature Selection Approach using Maximum Spanning Tree.
Authors:	Md. Hasan Tarek Institute of Information Technology Dhaka, Bangladesh Suravi Akhter Institute of Information Technology Dhaka, Bangladesh Sumon Ahmed Institute of Information Technology Dhaka, Bangladesh Md Shariful Islam* Institute of Information Technology Dhaka, Bangladesh
DOI:	https://doi.org/10.3329/dujase.v7i2.65094 PDF
Keywords:	Clustering, Maximum Spanning Tree, Feature Selection, Mutual Information
Abstract:	Mutual information (MI) based feature selection methods are getting popular as its ability to capture the nonlinear and linear relationship among random variables and thus it performs better in different fields of machine learning. Traditional MI based feature selection algorithms use different techniques to find out the joint performance of features and select the relevant features among them. However, to do this, in many cases, they might incorporate redundant features. To solve these issues, we propose a feature selection method, namely Clustering based Feature Selection (CbFS), to cluster the features in such a way so that redundant and complementary features are grouped in the same cluster. Then, a subset of representative features is selected from each cluster. Experimental results of CbFS and four state-of-the-art methods are reported to measure the excellency of CbFS over twenty benchmark UCI datasets and three renowned network intrusion datasets. It shows that CbFS performs better than the comparative methods in terms of accuracy and performs better in identifying attack or normal instances in security datasets.
References:	Cyber Security Report. https://docs.broadcom.com/doc/ istr- 22-2017-en. Accessed on: 2022-07-05. N. Magendiran and J. Jayaranjani, “An efficient fast clusteringbased feature subset selection algorithm for high-dimensional data,” International journal of innovative research in science, vol. 3, no. 1, pp. 405–408, 2014 P. Moradi and M. Rostami, “A graph theoretic approach for unsupervised feature selection,” Engineering Applications of Artificial Intelligence, vol. 44, pp. 33–45, 2015. Q. Song, J. Ni, and G. Wang, “A fast clustering-based feature subset selection algorithm for high-dimensional data,” IEEE transactions on knowledge and data engineering, vol. 25, no. 1, pp. 1–14, 2011 G. Brown, A. Pocock, M.-J. Zhao, and M. Lujan, “Conditional likelihood´ maximization: a unifying framework for information theoretic feature selection,” The journal of machine learning research, vol. 13, no. 1, pp. 27–66, 2012 M. Bennasar, Y. Hicks, and R. Setchi, “Feature selection using joint mutual information maximization,” Expert Systems with Applications, vol. 42, no. 22, pp. 8520–8532, 2015. M. Dash and H. Liu, “Feature selection for classification,” Intelligent data analysis, vol. 1, no. 3, pp. 131–156, 1997 S. Das, “Filters, wrappers and a boosting-based hybrid for feature selection,” in Icml, vol. 1, pp. 74–81, 2001 S. Sharmin, M. Shoyaib, A. A. Ali, M. A. H. Khan, and O. Chae, “Simultaneous feature selection and discretization based on mutual information,” Pattern Recognition, vol. 91, pp. 162–174, 2019 H. Yang and J. Moody, “Feature selection based on joint mutual information,” in Proceedings of international ICSC symposium on advances in intelligent data analysis. Citeseer, pp. 22–25, 1999 D. D. Lewis, “Feature selection and feature extract ion for text categorization,” in Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992, 1992 R. Battiti, “Using mutual information for selecting features in supervised neural net learning,” IEEE Transactions on neural networks, vol. 5, no. 4, pp. 537–550, 1994. N. X. Vinh, S. Zhou, J. Chan, and J. Bailey, “Can high-order dependencies improve mutual information based feature selection?” Pattern Recognition, vol. 53, pp. 46–58, 2016. P. Roy, S. Sharmin, A. A. Ali, and M. Shoyaib, “Discretization and feature selection based on bias corrected mutual information considering high-order dependencies,” in Pacific- Asia Conference on Knowledge Discovery and Data Mining. Springer, pp. 830–842, 2020 J. Yang and V. Honavar, “Feature subset selection using a genetic algorithm,” in Feature extraction, construction and selection. Springer, pp. 117–136, 1998 T. Naghibi, S. Hoffmann, and B. Pfister, “A semidefinite programming based search strategy for feature selection with mutual information measure,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 8, pp. 1529– 1541, 2014. M. R. Gary and D. S. Johnson, “Computers and intractability: A guide to the theory of np-completeness,” 1979. K. Z. Mao, “Orthogonal forward selection and backward elimination algorithms for feature subset selection,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 34, no. 1, pp. 629–634, 2004 L. Yu and H. Liu, “Feature selection for high-dimensional data: A fast correlation-based filter solution,” in Proceedings of the 20th international conference on machine learning (ICML-03), pp. 856–863, 2003. K. Kira, L. A. Rendell et al., “The feature selection problem: Traditional methods and a new algorithm,” in Aaai, vol. 2, pp. 129–134, 1992. I. Kononenko, “Estimating attributes: analysis and extensions of relief,” in European conference on machine learning. Springer, pp. 171– 182, 1994. M. A. Hall, “Correlation-based feature selection for machine learning,” 1999. W. Gao, L. Hu, and P. Zhang, “Feature redundancy term variation for mutual information-based feature selection,” Applied Intelligence, vol. 50, no. 4, pp. 1272–1288, 2020 H. Nkiama, S. Z. M. Said, and M. Saidu, “A subset feature elimination mechanism for intrusion detection system,” International Journal of Advanced Computer Science and Applications, vol. 7, no. 4, pp. 148– 157, 2016. T. O. Kvalseth, “Entropy and correlation: Some comments,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 17, no. 3, pp. 517– 519, 1987 D. Dua and C. Graff, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, and H. Liu, “Feature selection: A data perspective,” ACM Computing Surveys (CSUR), vol. 50, no. 6, p. 94, 2018. J. Alcala-Fdez,´ A. Fernandez,´ J. Luengo, J. Derrac, S. Garc´ıa, L. Sanchez, and F. Herrera, “Keel data-mining software tool: data set´ repository, integration of algorithms and experimental analysis framework.” Journal of Multiple- Valued Logic & Soft Computing, vol. 17, 2011 Nsl-kdd. [Online]. Available: https://www.unb.ca/cic/datasets/ nsl.html Awid-aegean wi-fi intrusion dataset. [ O n l i n e ] . Available: https://icsdweb.aegean.gr/awid/ Ids 2017. [Online]. Available: https://www.unb.ca/cic/ datasets/ids2017.html J. Demsar, “Statistical comparisons of classifiers over multiple data sets,”ˇ The Journal of Machine Learning Research, vol. 7, pp. 1–30, 2006 P. Nemenyi, “Distribution-free mulitple comparisons phd thesis princeton university princeton,” 1963 M. H. Tarek, M. M. H. U. Mazumder, S. Sharmin, M. S. Islam, M. Shoyaib, and M. M. Alam, “RHC: Cluster based feature reduction for network intrusion detections,” in 2022 IEEE 19th Annual Consumer Communications & Networking Conference (CCNC). IEEE, pp. 378–384, 2022. M. H. Tarek, M. E. Kadir, S. Sharmin, A. A. Sajib, A. A. Ali, and M. Shoyaib, “Feature subset selection based on redundancy maximized clusters,” in 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, pp. 521–526, 2021.