• Printed Journal
  • Indexed Journal
  • Peer Reviewed Journal
Journal of Applied Science & Engineering

Dhaka University Journal of Applied Science & Engineering

Issue: Vol. 7, No. 2, July 2022
Title:

A Clustering based Feature Selection Approach using Maximum Spanning Tree.

Authors:
  • Md. Hasan Tarek
    Institute of Information Technology Dhaka, Bangladesh
  • Suravi Akhter
    Institute of Information Technology Dhaka, Bangladesh
  • Sumon Ahmed
    Institute of Information Technology Dhaka, Bangladesh
  • Md Shariful Islam*
    Institute of Information Technology Dhaka, Bangladesh
DOI:
Keywords:

Clustering, Maximum Spanning Tree, Feature Selection, Mutual Information

Abstract:

Mutual information (MI) based feature selection methods are getting popular as its ability to capture the nonlinear and linear relationship among random variables and thus it performs better in different fields of machine learning. Traditional MI based feature selection algorithms use different techniques to find out the joint performance of features and select the relevant features among them. However, to do this, in many cases, they might incorporate redundant features. To solve these issues, we propose a feature selection method, namely Clustering based Feature Selection (CbFS), to cluster the features in such a way so that redundant and complementary features are grouped in the same cluster. Then, a subset of representative features is selected from each cluster. Experimental results of CbFS and four state-of-the-art methods are reported to measure the excellency of CbFS over twenty benchmark UCI datasets and three renowned network intrusion datasets. It shows that CbFS performs better than the comparative methods in terms of accuracy and performs better in identifying attack or normal instances in security datasets.

References:
  1. Cyber Security Report. https://docs.broadcom.com/doc/ istr- 22-2017-en. Accessed on: 2022-07-05.
  2. N. Magendiran and J. Jayaranjani, “An efficient fast clusteringbased feature subset selection algorithm for high-dimensional data,” International journal of innovative research in science, vol. 3, no. 1, pp. 405–408, 2014
  3. P. Moradi and M. Rostami, “A graph theoretic approach for unsupervised feature selection,” Engineering Applications of Artificial Intelligence, vol. 44, pp. 33–45, 2015.
  4. Q. Song, J. Ni, and G. Wang, “A fast clustering-based feature subset selection algorithm for high-dimensional data,” IEEE transactions on knowledge and data engineering, vol. 25, no. 1, pp. 1–14, 2011
  5. G. Brown, A. Pocock, M.-J. Zhao, and M. Lujan, “Conditional likelihood´ maximization: a unifying framework for information theoretic feature selection,” The journal of machine learning research, vol. 13, no. 1, pp. 27–66, 2012
  6. M. Bennasar, Y. Hicks, and R. Setchi, “Feature selection using joint mutual information maximization,” Expert Systems with Applications, vol. 42, no. 22, pp. 8520–8532, 2015.
  7. M. Dash and H. Liu, “Feature selection for classification,” Intelligent data analysis, vol. 1, no. 3, pp. 131–156, 1997
  8. S. Das, “Filters, wrappers and a boosting-based hybrid for feature selection,” in Icml, vol. 1, pp. 74–81, 2001
  9. S. Sharmin, M. Shoyaib, A. A. Ali, M. A. H. Khan, and O. Chae, “Simultaneous feature selection and discretization based on mutual information,” Pattern Recognition, vol. 91, pp. 162–174, 2019
  10. H. Yang and J. Moody, “Feature selection based on joint mutual information,” in Proceedings of international ICSC symposium on advances in intelligent data analysis. Citeseer, pp. 22–25, 1999
  11. D. D. Lewis, “Feature selection and feature extract ion for text categorization,” in Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992, 1992
  12. R. Battiti, “Using mutual information for selecting features in supervised neural net learning,” IEEE Transactions on neural networks, vol. 5, no. 4, pp. 537–550, 1994.
  13. N. X. Vinh, S. Zhou, J. Chan, and J. Bailey, “Can high-order dependencies improve mutual information based feature selection?” Pattern Recognition, vol. 53, pp. 46–58, 2016.
  14. P. Roy, S. Sharmin, A. A. Ali, and M. Shoyaib, “Discretization and feature selection based on bias corrected mutual information considering high-order dependencies,” in Pacific- Asia Conference on Knowledge Discovery and Data Mining. Springer, pp. 830–842, 2020
  15. J. Yang and V. Honavar, “Feature subset selection using a genetic algorithm,” in Feature extraction, construction and selection. Springer, pp. 117–136, 1998
  16. T. Naghibi, S. Hoffmann, and B. Pfister, “A semidefinite programming based search strategy for feature selection with mutual information measure,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 8, pp. 1529– 1541, 2014.
  17. M. R. Gary and D. S. Johnson, “Computers and intractability: A guide to the theory of np-completeness,” 1979.
  18. K. Z. Mao, “Orthogonal forward selection and backward elimination algorithms for feature subset selection,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 34, no. 1, pp. 629–634, 2004
  19. L. Yu and H. Liu, “Feature selection for high-dimensional data: A fast correlation-based filter solution,” in Proceedings of the 20th international conference on machine learning (ICML-03), pp. 856–863, 2003.
  20. K. Kira, L. A. Rendell et al., “The feature selection problem: Traditional methods and a new algorithm,” in Aaai, vol. 2, pp. 129–134, 1992.
  21. I. Kononenko, “Estimating attributes: analysis and extensions of relief,” in European conference on machine learning. Springer, pp. 171– 182, 1994.
  22. M. A. Hall, “Correlation-based feature selection for machine learning,” 1999.
  23. W. Gao, L. Hu, and P. Zhang, “Feature redundancy term variation for mutual information-based feature selection,” Applied Intelligence, vol. 50, no. 4, pp. 1272–1288, 2020
  24. H. Nkiama, S. Z. M. Said, and M. Saidu, “A subset feature elimination mechanism for intrusion detection system,” International Journal of Advanced Computer Science and Applications, vol. 7, no. 4, pp. 148– 157, 2016.
  25. T. O. Kvalseth, “Entropy and correlation: Some comments,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 17, no. 3, pp. 517– 519, 1987
  26. D. Dua and C. Graff, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml
  27. J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, and H. Liu, “Feature selection: A data perspective,” ACM Computing Surveys (CSUR), vol. 50, no. 6, p. 94, 2018.
  28. J. Alcala-Fdez,´ A. Fernandez,´ J. Luengo, J. Derrac, S. Garc´ıa, L. Sanchez, and F. Herrera, “Keel data-mining software tool: data set´ repository, integration of algorithms and experimental analysis framework.” Journal of Multiple- Valued Logic & Soft Computing, vol. 17, 2011
  29. Nsl-kdd. [Online]. Available: https://www.unb.ca/cic/datasets/ nsl.html
  30. Awid-aegean wi-fi intrusion dataset. [ O n l i n e ] . Available: https://icsdweb.aegean.gr/awid/
  31. Ids 2017. [Online]. Available: https://www.unb.ca/cic/ datasets/ids2017.html
  32. J. Demsar, “Statistical comparisons of classifiers over multiple data sets,”ˇ The Journal of Machine Learning Research, vol. 7, pp. 1–30, 2006
  33. P. Nemenyi, “Distribution-free mulitple comparisons phd thesis princeton university princeton,” 1963
  34. M. H. Tarek, M. M. H. U. Mazumder, S. Sharmin, M. S. Islam, M. Shoyaib, and M. M. Alam, “RHC: Cluster based feature reduction for network intrusion detections,” in 2022 IEEE 19th Annual Consumer Communications & Networking Conference (CCNC). IEEE, pp. 378–384, 2022.
  35. M. H. Tarek, M. E. Kadir, S. Sharmin, A. A. Sajib, A. A. Ali, and M. Shoyaib, “Feature subset selection based on redundancy maximized clusters,” in 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, pp. 521–526, 2021.