HYBRID GENE SELECTION METHOD BASED ON MUTUAL INFORMATION TECHNIQUE AND DRAGONFLY OPTIMIZATION ALGORITHM

S a r a h G h a n i m M a h m o o d Corresponding author Master in Mathematics Sciences, Assistant Lecturer* E-mail: sarahghanim@uohamdaniya.edu.iq R a e d S a b e e h K a r y a k o s Master in Mathematics, Assistant Lecturer* I l h a m M . Y a c o o b Master in Mathematics Application Sciences, Assistant Lecturer* *Department of Mathematise College of Education University of AL-Hamdaniya Erbil road, Al-Hamdaniya District, Nineveh, Iraq, 41006 One of the most prevalent problems with big data is that many of the features are irrelevant. Gene selection has been shown to improve the outcomes of many algorithms, but it is a difficult task in microar­ ray data mining because most microarray datasets have only a few hundred records but thousands of variables. This type of dataset increases the chan ces of discovering incorrect predictions due to chance. Finding the most relevant genes is generally the most difficult part of creating a reliable classifica­ tion model. Irrelevant and duplicated attributes have a negative impact on categorization algorithms’ accuracy. Many Machine Learning­based Gene Selection methods have been explored in the lite­ rature, with the aim of improving dimensionality reduction precision. Gene selection is a technique for extracting the most relevant data from a series of datasets. The classification method, which can be used in machine learning, pattern recognition, and signal processing, will benefit from further develop­ ments in the Gene selection technique. The goal of the feature selection is to select the smallest subset of features but carrying as much information about the class as possible. This paper models the gene selection approach as a binary­based optimization algorithm in discrete space, which directs binary dragonfly optimization algorithm «BDA» and veri­ fies it in a chosen fitness function utilizing preci­ sion of the dataset’s k­nearest neighbors’ classifier. The experimental results revealed that the proposed algorithm, dubbed MI­BDA, in terms of precision of results as measured by cost of calculations and clas­ sification accuracy, it outperforms other algorithms


Introduction
In recent years, molecular biology and genetics research has evolved away from studying individual genes and toward exploring the entire genome. DNA microarray is one of these techniques for measuring the expression levels of thousands of genes in a single experiment, making it ideal for comparing gene expression levels in tissues under various situations, such as healthy versus sick tissues [1]. Gene selection is frequently used to preprocess the original gene set for subsequent analysis because many genes in the original gene set are irrelevant or even redundant for a specific discriminant problem. Gene selection can improve the classifier's generalization capacity and minimize the computing complexity of the learning operation, according to discriminant analysis. Gene selection, according to biologists, results in more compact gene sets, which lowers diagnostics costs and makes it easier to comprehend the roles of linked genes [2]. In the high-dimensional space of a small number of observations, comparing gene expression profiles and picking those that are best related with the examined forms of data is a difficult issue in pattern recognition, which can be tackled utilizing specialized data mining approaches [3]. Despite the rapid advancements in this subject, there is always a need for further understanding and research development. Also, feature selection has been extensively studied and applied in the fields of data mining and machine learning [4]. A function, also known as an attribute or variable, is a process or system property that has been calculated or constructed from the original input variables in this context [5]. The aim of selecting feature is to locate the perfect set of features with «k-features» that produces the least generalization error, or alternatively, to find the best subset of features with k features that produces the least generalization error [6].

Literature review and problem statement
They discovered that filter-based gene selection, which selects the most useful features from a gene dataset for a more accurate diagnosis, produces better results, but the chosen set is not the best subset because the work was only to reduce the genes we suggest to work on swarm algorithms can be added to find the best subset of genes [7]. They use two classifiers to compare the mRMR-ReliefF selection algorithm to ReliefF, mRMR, and other feature selection approaches, using seven different datasets. Naive Bayes and SVM. The authors propose RFACO-GS, a hybrid filter wrapper-based gene selection algorithm based on the ReliefF algorithm and the improved ACO process. Using multiple public gene expression datasets, the experimental results show that the suggested methodology is very successful in lowering the dimensionality of gene expression datasets and choosing the most significant genes with high classification accuracy. The algorithm cannot ideally balance the size of the subset of specific genes and the classification accuracy in all high-dimensional gene expression datasets, and the suggested method has a severe disadvantage in providing sufficient biological interpretations of genes picked for cancer classification. As a result, more research into the aforementioned issues would be beneficial in building a gene expression data classification [8]. The researchers used our IGIS+algorithm to select genes from ten microarray data sets. They compare the performance of our proposed approach to that of previous gene selection techniques in terms of classification accuracy, number of selected genes, and number of envelope evaluations needed, but the results were only compared with KNN, and it is known that it is a traditional method. If the test was done with more modern methods, or the traditional method was also hybridized, the comparison would be classified according to our opinion [9]. A research study was also given that built a modular bioinformatics methodology that leverages publicly available human transcriptomics data to produce a score for each gene that indicates the overall relevance of each gene in representing transcriptional diversity, correlation with other genes based on expression profiling, and known pathway annotation using publicly accessible human transcriptomics data, perhaps if the genes contain some mutations, as they do not appear in publicly available human transcriptomics data that the researcher used here, we see if he used a correction mechanism for the genes that have a mutation to return them to their original form and work on it [10]. In this study, two novel binary variations of the GOA method were developed and used to FS problems. The first method is based on transfer functions, whereas the second strategy uses a unique mechanism that repositions the current solution by considering the position of the best solution thus far. The suggested binary GOA (especially BGOA-M) has strengths among current FS algorithms and is worthy of attention for tackling tough FS problems, according to the results, debates, and analyses, three algorithms were compared, but there is an algorithm that we believe will give good results if compared with the proposed algorithm, which is the bat algorithm, as it can also work on binary data [11]. To solve the FS difficulties, an asynchronous binary SSA technique with numerous update criteria was presented in this paper. The statistical results show that the suggested TCSSA3 is superior in dealing with feature space exploration and exploitation for the vast majority of datasets. The idea of asynchronous tuning of the major parameter of the SSA with distinct leading salp for different areas of the salp chain was advantageous in mitigating the possible shortcomings of the conventional algorithm, according to the discussions and analyses of the results. If the optimal number of update rules were also determined with the same algorithm, then the algorithm could be employed in more than one direction [12].
The performance of several feature selection techniques was investigated in this work utilizing two different datasets. The findings revealed a considerable performance difference across feature selection algorithms when employing datasets with varied amounts of features, with accuracy percentages varying from 10 to 20 %. Furthermore, the benefits of filter feature selection strategies should not be overlooked. In order to forecast student performance, the outcome of feature selection might have been examined through confusion and, better yet, the amount of mixed feature selection algorithms in student data sets might have been limited [13]. In our paper we used mutual information technique and binary dragonfly optimization algorithm «MI-BDA» to improving the selection of genes.
The «BDA» method and Mutual Information «MI» were used in this study to acquire subsets of features through two main phases: the first is to utilize the MI algorithm to define the characteristics affecting the data classification process by relying on an objective function. The BDA approach is used in the second phase to minimize the amount of characteristics found by the MI approach. The proposed algorithm's findings have shown efficiency and efficacy by achieving higher classification accuracy while using less features than standard approaches.

The aim and objectives of the study
The aim of the study is to extract the most relevant data from a collection of datasets. Further refinements to the feature selection technique would have a positive impact on the classification process, which can be used in a variety of applications including machine learning, pattern recognition, and signal processing. To achieve this aim, the following objectives are accomplished: -improving the method of selecting gene to suggest a new improving method based on algorithms for selecting the best genes; -the Gene selection approach in discrete space is modeled as a binary-based optimization algorithm, directing BDA and using the accuracy of the k-nearest neighbors classifier on the dataset to verify it in the chosen fitness function, in which the hybrid approach between the binary dragonfly algorithm and mutual information approach is shown.

Materials and methods
In this section, the methods used in conducting the research are presented and the work of each method is explained, as well as how to link and hybridize between the two methods.
Initially, information theory was developed to find fundamental limits on data compression and efficient communication [14]. Entropy is a crucial measure of knowledge in this theory. It has been commonly used in many fields because it is incapable of quantifying the variance of random variables and effectively scaling the volume of data shared by them. In order to preserve continuity, we will only address finite random variables with discrete values [15]. Let X is a random variable with discrete values, Entropy H(X) can be used to calculate its uncertainty, which is characterized by: The density function of probability X is where p(x) = = pr(X = x). Remember that the entropy does not depend on real values, but rather on the likelihood of random variable distribution.
Similarly, mutual entropy H(X, Y), X and Y are the same as Y and X.
The reduction of vector uncertainty is referred to as conditional entropy. If the variable y is defined and the others are known, the conditional H(X/Y) of X entropy with respect to Y is: Where the probabilities of the future X are p(x/y) given Y. As a result of this description, if X depends entirely on Y, then H(X/Y) is zero. This implies that when Y is understood, no more knowledge is needed to explain X. Otherwise, H(X/Y) = H(X) suggests that understanding Y would do little to observe X. A definition called mutual information I(X; Y) is defined as a quantification of how much information is exchanged by two variables X and Y: , log , .
If X and Y are closely related, the value of I(X; Y) will be very high; otherwise, the value of I(X; Y) will be zero, meaning that the two variables are totally unrelated. It's also possible to rewrite I(X; Y) as I(X; Y) = H(X)-H(X/Y). When Z is known, analogously, mutual conditional information of X and Y, referred as I(X; Y/Z) = H(X/Z)-H(X/Y, Z), refers to the total of knowledge that X and Y have in common. That is, I(X; Y/Z) means that Y offers knowledge about X that is not already found in Z [14,15].
The dragonfly optimization algorithm was inspired by dragonflies. It is a swarm intelligence technique for estimating the best solution (global) to a given optimization problem [11,16].
The mathematical models and dragonfly swarming behavior are depicted as follows [17].
The term «separation» refers to a strategy used by individuals to prevent colliding with their neighbors. This action is built mathematically, as in (5): where x -the current position; X j -the adjoining j-th of the position of x; N -the neighborhood's height. The orientation depicts the velocity of the individuals in relation to other individuals. This is a mathematically constructed action, as shown in (6): the individual neighborhood's speed is represented by V j , and the size of the neighborhood by N.
Individuals' propensity to congregate in the neighborhood's mass center is referred to as cohesion. This action is mathematically modeled in the same way as (7) [18]: X is the current location, X j is the j-th neighborhood X position, and N is the height of the neighborhood [19]. The (8) model is used to model the food attraction: X + is a food source's location, and X is the current individual's location: where X denotes an enemy's location and X denotes the position of the actual individual.
To solve optimization issues in the algorithm, the dragonfly optimization algorithm (DA) used two simple vectors: the vector phase and the location of the vector.
The move's vector as shown: r is a random where r ∈[ , ] 0 1 and T X t Δ + ( ) 1 is determined as in (12).
The following is a pseudo-programming for the Binary Dragonfly. Optimization Algorithm: Generate the initial population of DA, X j & ΔX j , j=1, …, N. Generate an initial value A, a, and c. Find the fitness function of each search agent.
While (t<Max iter ). For each DA Calculate the A, C & S by Eq. (5) to (7).
Update the E & F by (8) & (9) and the main coefficients.
A Proposed Hybrid Algorithm Overview The hybrid system MI-BDA uses the mutual knowledge technique dependency technique as an elementary stage to obtain a collection of genes, in which the genes of a data are organized according to their value in classification accuracy (from highest to lowest). After organizing and defining the genes, the BDA is used to select a subset of pre-selected genes using the MI technique. The «Binary Dragonfly Algorithm» (BDA) is an acronym for «Binary Dragonfly Algorithm.» genes are calculated by selecting the gene that corresponds to a value of one and ignoring the gene that corresponds to a value of zero from a vector of binary values (consisting of one and zero) that is formed at random and has the same length as the genes vector. As an example, in Fig. 1, consider the following: No selected genes Selected genes

Fig. 1. An exemplification of the genes in Binary Drag
To achieve classification precision, BDA uses the KNN classifier which then applies the following methods to the fitness function [20,21]: where AC denotes classification accuracy, G q denotes the selected function, G p denotes the entire dataset's features, and w 1 denotes the corresponding random parameter to AC weight. The proposed MI-BDA framework's pseudocode can be seen below: End for Set t = t+1 End Return the best position. The optimal genes. End

Result of hybrid algorithm (MIBDA)
MI-BDA has been applied in three different classification datasets for verification of the proposed algorithm (DLBCL, Prostate and Ovarian). All data sets that were used are binary from [22,23].

1. Dataset's description and average feature selection
In this section we show in Table 1 the dataset we used in our search. In Table 1, we have three datasets that contain (77, 102, 253) samples and every sample contains (7,129,12,600,15,154) features, respectively.
After we choose our data set, we use our suggested method to select genes, as shown in Table 2. Her in Table 2 we explained the feature selection for our method MI-BDA and BDA.

2. The experimental effects
At the end we compare our suggest method (MIBAD) by other method in Table 3. At the end we show in Table 3 our result for training datasets and testing datasets for two methods MI-BDA and BDA.  Tables 2, 3 show that the hybrid algorithm MI-BDA achieved better classification accuracy and chose less features than the BDA algorithm, resulting in a reduction in the cost of the calculations that the algorithm needs during the implementation phase, where the MI-BDA algorithm's training and testing dataset achieved the preferred result. The accuracy of the research dataset in dataset 2 is 94.1818 percent by MI-BDA, which is higher than 90.1905 percent by BDA.

Discussion of the research results of hybrid algorithm (MIBDA)
In this paper, researchers used a hybrid algorithm that selects genes in two phases. The MI method was used in the first stage, it only produced a subset of the gene, whereas the BDA method was used in the second stage to minimize the gene generated in the first stage, so you can see in Table 3 when we showed in training dataset according to MI-BDA, the accuracy of the study dataset in dataset 1 is 91.619 percent, which is greater than BDA's 88.1159 percent, for dataset 2 is 92.8141 percent for MI-BDA but in BAD is 65.4441, we can see the difference between the normal method with our method that MI-BDA given more better effective result.
In the fitness function, the dataset subsets were evaluated using the K-nearest neighbor (KNN) classifier. MI-BDA, a proposed hybrid algorithm, was compared to BDA, with MI-BDA demonstrating superior classification accuracy and performance across three datasets. A subset of the selected genes was obtained according to the improved algorithm and the experimental results showed that the proposed algorithm, which we refer to as MI-BDA, outperforms other algorithms in terms of the accuracy of the results represented in the cost of the calculations and the accuracy of classification. You can find ideas for developing and implementing a solution strategy as well as hybrid algorithms that can be used in conjunction with a genetic algorithm and other heuristic analysis methods in the following study topics.

Conclusions
1. The dataset was separated into 80 % training groups, with the total number of data utilized being a 30 % test group from datasets.
2. An improvement has been made to the gene selection method, the number of gene we got it in MI-BDA is 8.8 and in BDA is 2950 from 7129 features in dataset 1, this means that our proposed method succeeded in finding the best partial set of features. Our hybrid algorithm «MI-BDA» compared with «BDA» by using three data set «DLBCL, Prostate, Ovarian», where the algorithm showed high competence in terms of gene selection, and the accuracy of the research dataset in datasets is by MI-BDA, is higher than by BDA.