SIGN LANGUAGE DACTYL RECOGNITION BASED ON MACHINE LEARNING ALGORITHMS

A major advance in the field of information technology over the past ten years can be called the digitalization of human-computer interaction at the visual level. This achievement primarily solves communication problems of people with hearing disabilities and allows for rapid human-computer interaction. In this regard, the gesture is one of the main forms of visual communication of people. The actions and relative positions of body parts and their changes over time correspond to certain messages, and recently became also promising in the interaction of technical systems and humans. Thanks to the detection capabilities of visual communication primitives, gesture recognition has become one of the most widely researched topics in recent years [1, 2]. The results of automatic gesture recognition and classification are used to train people with hearing impairments and help them communicate with strangers using sign language. They can also be used as a quick message for digital smart devices. This is the social significance of sign language recognition. As video data has become ubiquitous in practical applications, the research and development of gesture recognition automation is finding application in many human-machine communication systems. The phonological structure of a sign language is usually divided into five elements: articulation point, hands configuration, movements type, hands orientation, and facial expressions [1]. Each gesture is perceived through a combination of these elements. These blocks represent valuable sign language elements and can be used by automated intelligent sign language recognition (SLR) systems. It should bear in mind, that in a sign language, one gesture means one whole word. In contrast, dactylology is a peculiar form of speech where the dactylic alphabet is used. Each hand gesture illustrates a specific letter of this language. Each natural language, like the Kazakh language, has its own dactylic language, which is also different from the dactylic language of other languages. How to Cite: Kenshimov, C., Buribayev, Z., Amirgaliyev, Y., Ataniyazova, A., Aitimov, A. (2021). Sign language dactyl recog-


Introduction
A major advance in the field of information technology over the past ten years can be called the digitalization of human-computer interaction at the visual level. This achievement primarily solves communication problems of people with hearing disabilities and allows for rapid human-computer interaction. In this regard, the gesture is one of the main forms of visual communication of people. The actions and relative positions of body parts and their changes over time correspond to certain messages, and recently became also promising in the interaction of technical systems and humans. Thanks to the detection capabilities of visual communication primitives, gesture recognition has become one of the most widely researched topics in recent years [1,2]. The results of automatic gesture recognition and classification are used to train people with hearing impairments and help them communicate with strangers using sign language. They can also be used as a quick message for digital smart devices. This is the social significance of sign language recognition. As video data has become ubiquitous in practical applications, the research and development of gesture recognition automation is finding application in many human-machine communication systems.
The phonological structure of a sign language is usually divided into five elements: articulation point, hands configuration, movements type, hands orientation, and facial expressions [1]. Each gesture is perceived through a combination of these elements. These blocks represent valuable sign language elements and can be used by automated intelligent sign language recognition (SLR) systems. It should bear in mind, that in a sign language, one gesture means one whole word. In contrast, dactylology is a peculiar form of speech where the dactylic alphabet is used. Each hand gesture illustrates a specific letter of this language. Each natural language, like the Kazakh language, has its own dactylic language, which is also different from the dactylic language of other languages.
Research on the development of a Kazakh sign dactyl language recognition system is currently insufficient for a complete representation of this language. When developing methods and systems for recognizing the Kazakh dactyl sign language, a number of difficulties arise, mainly associated with spelling, sign language and other features of the language [3]. The alphabet of the Kazakh language has 42 letters, of which 33 are borrowed from the Russian alphabet, the remaining 9 are specific to this language. This condition is also relevant for the Kazakh dactyl language. Since the Kazakh language belongs to the family of Turkic-speaking languages and most words, letters and sounds are similar, these tasks are relevant for the majority of the population of Turkic-speaking peoples, which now numbers more than 200 million people. But it should also be borne in mind, that the Kazakh language, unlike its relatives, is just beginning to move from the Cyrillic to the Latin alphabet.
Also, one of the problems is the division of the types of gesture into static, when there is no need to make any movement of the hands, the position of the hand and fingers is stationary in space during the considered time, as well as dynamic when gestures are reproduced by moving the hand. In most cases, systems that provide real-time gesture reading have only one of the proposed forms. That is, they rely in comparison on static or dynamic data in their database.

Literature review and problem statement
The paper [1] provides an overview to create a consistent taxonomy to describe recent research, divided into four main categories: development, structure, recognition of other hand gestures, and reviews. An analysis of glove systems for the characteristics of SLR devices was carried out, a technology development plan was developed, existing limitations were considered, and valuable information about technological environments was provided to help explore opportunities and challenges in this area. This paper uses the low-level features of the human hand, used for machine learning algorithms. It then applies this data in the recognition and classification process. The main disadvantage of this approach was that special gloves for recognizing the position of the hands cannot always be at your fingertips, and not everyone has them.
The article [2] aims to recognise sign language characters, trained using images of American sign language letters. The use of capsule networks for learning processes was proposed. The test results were compared with the results of the LeNet architecture. As a result of the study, it was noticed that for effective character recognition in sign language, capsule networks are useful and give a successful result than LeNet.
In [4], a technique for visual gesture recognition is proposed by combining several spatial and spectral representations of gesture images manually using a convolutional neural network. The technique, proposed in this paper, allows us to calculate the Gabor spectral representations of spatial images of hand gestures and uses an optimized neural network to classify gestures into appropriate classes. The authors of this paper have considered various ways to combine both types of modalities to determine a model that increases the reliability and accuracy of recognition. It should be noted, that the material of this work emphasizes the development of sign language recognition, gradually moving from a variety of auxiliary tools to more everyday ones, such as a smartphone or tablet, which a person can carry with them in everyday life, without additional cargo, to provide convenience and save the budget of the average consumer.
In the paper [5], a classification of the Turkish sign language was implemented using finite automata based on pose marking, which uses depth values in location-based functions. A grid-based signature space clustering scheme has been developed, and cluster numbers are used as objects for a set of connections. A pose marking algorithm for recognizing a predefined set of gestures in TSL is proposed. The labels, assigned to poses, are used to classify gestures concerning known vocabulary using the FSA. A set of complex gestures is selected to evaluate the technique; however, their scheme is also expanded for a new gesture, simply providing an appropriate FSA based on its poses. The general classification scheme deals only with position labels and not low-level and spatiotemporal features, reducing the space and time requirements.
The authors of [6] described the results of using the longterm, short-term memory (LSTM) model, which improved the machine translation of Google Translate.
One of the closest works on the study of the Kazakh sign language is [3], where the Kinect sensor is used for gesture recognition, the coordinates of the skeleton of the hand and key characteristics are processed through XML files using tools and calculations in MATLAB. It is easy to understand that this approach was implemented for the old Kazakh alphabet.
In the article [7], several real-time gesture recognition systems were compared using convolutional neural networks. The system, proposed in this paper, can recognize words from a natural language with gestures, using signs for each letter. The approach of this work was evaluated in the American and Russian sign languages. For the American sign language, a data set, prepared by Massey University and the Institute of Information and Mathematical Sciences, was used. Russian sign language recognition quality lagged behind the high result due to the complexity of the real data set for the Russian sign language. According to the results of the study, the accuracy of typing the American Sign Language showed a high result, which we took into account for the experience in the design of architecture.
The article [8] presents an effective framework for solving the problem of static gesture recognition based on data, obtained from web cameras and the Kinect depth sensor. In this paper, the video sequence is taken as input data. That is, the sequence of frames and the classification is performed separately without any frame information. The accuracy of the method, proposed by the authors, was estimated based on the collected images, consisting of 2700 frames.
In [9], an intelligent system for the Turkish sign language recognition was developed. It is based on 33 basic signs of the Turkish Sign Language. To determine the signals, a Microsoft Kinect v2 sensor was used. The proposed system is designed to help people with hearing and speech impairments and other people and solve communication problems between these people. We can apply this development for our own purposes, however, there is a linguistic difference between the Kazakh and Turkish languages, which is the main barrier.
The scientific works [10] present a review of the scientific literature on sign language recognition systems. In [11,12] they were identified and analyzed for their direct relevance to sign language recognition systems. In the article [13] the classification is considered based on six dimensions (data collection methods, static/dynamic signs, signature mode, one-handed/two-handed signs, classification technique, and recognition speed).
The research paper [14] analyzes statistics on the use of various data collection methods, used in sign language systems, and 12 sign languages were selected for this purpose. Among these languages, the American sign language is the first to be analyzed. For the review of this language, the literature was used [15,16].
The article [17] presents a method for recognizing gestures of the American sign language using the method of principal component analysis to minimize the similarity of gesture classes. The scientific work [18] implements a system for recognizing the letters of the American alphabet using surface electromyography to allow people to spell words. The developers of the recognition system, presented in the article [19], used the MAdaline network for image processing and classification.
The article [3] describes the sign symbols, used to record the structure of gestures in writing. The choice of L. S. Dimskis' sign notation in relation to the Kazakh sign language is also justified, the features of the representation of the Kazakh sign language using L. S. Dimskis' sign notation in the course of compiling a dictionary of frequently used gestures are revealed.
As a result of the analysis of the existing methods, proposed in the above works, most are characterized by insufficient accuracy and speed of recognized gestures. Also, many studies often require conditions, such as wearing special gloves and other devices, good lighting, etc.
Many scientific studies related to the recognition of the Kazakh sign language were conducted with the old alphabet, consisting of 42 letters [3,20]. Also, the recognition accuracy did not exceed 90 %, which requires updating and improving the recognition systems. The current Kazakh alphabet consists of 31 letters, changes were made to the spelling of specific Kazakh letters, digraphs were also introduced. Therefore, the development of an accurate and high-speed algorithm for recognizing the new Kazakh sign language in real time in order to facilitate communication with people with hearing disabilities is an urgent task.

The aim and objectives of the study
The aim of this work is to implement a recognition program with the highest accuracy of the Kazakh dactylic sign language with an updated alphabet using machine learning methods. The scientific novelty of this work is the development of a new system that provides a solution to the problems of gesture recognition, both dynamic and static, combined into one base for building practical systems for human-computer interaction.
To achieve this aim, it is necessary to solve the following tasks: -collect a dataset with images of each gesture for training and testing samples, using the MediaPipe framework, implement hand and finger tracking and identify key points of the hands in three-dimensional space; -implement a recognition program of the Kazakh dactylic sign language and classify gestures using machine learning algorithms; -conducting a numerical evaluation of the quality of algorithms in order to determine the best classifier in gesture recognition problems, build a three-dimensional model, containing metrics, such as precision, recall and f1-measure for all gesture classes.

Materials and methods
In the course of the work, the American, Russian and Turkish sign languages were analyzed [8,9,15], and a program for recognizing the Kazakh sign language was implemented on their basis. In this paper, classical algorithms for gesture recognition are applied, combining two types of data into one database, which is reflected in the architecture of the recognition system. Therefore, works, devoted to the study of sign dactyl language, mainly take into account only one specific language, determine the type of data (single frame or multiple frames) that we cannot use as proposed solutions for the Kazakh language.

1. Random Forest
The first method, chosen to classify fingerprint language gestures, is the random forest algorithm. Fig. 1 shows how the random forest algorithm works. By representing a set of decision trees, this algorithm combines them to produce a more accurate result. The training sample is divided into subsamples of a certain size, from which the trees are built. To build a split in the tree, the maximum value of the random functions was viewed. Each new partition of the tree is made by determining its random features, the best feature is selected, and the tree structure continues until the choice is exhausted, that is, until only one representative of the class remains. But in the latest implementations of this algorithm in our work, we see that there are parameters that limit the height of trees and the number of objects in the subsample when recognizing gestures.

2. Support Vector Machine
The next research method is the classifier of support vector machines. This algorithm can be divided into two parts: training the classifier and recognizing the characters, supplied to the input.
At the first stage, a software implementation of the mathematical apparatus of the support vector machine is developed to create a classifier model. The SVM model represents different classes on a hyperplane in a multidimensional space. This hyperplane is generated iteratively to minimize the error. The purpose of this classifier is to divide data sets into classes to find the maximum limit hyperplane. One of the important concepts in SVM is support vectors, a collection of data points, located closest to the hyperplane, and using these points to determine the dividing line. A hyperplane is a decision plane, a space, divided between a set of objects with different classes.
In the second stage, the recognition and classification process is implemented. The hyperplane that separates the classes correctly is selected.
The main advantages of the SVM classifier are the ability to show high accuracy and the ability to work well with a large space. SVM classifiers mostly use a subset of training points. Hence very little memory is used as a result.
They have a long learning time, so they are not suitable for large datasets in practice. Another disadvantage is that SVM classifiers do not work well with overlapping classes.

3. Extreme Gradient Boosting
The third algorithm, used in this research paper, is the XGBoost classifier. This algorithm is based on gradient boosting of decision trees. First, we construct an ensemble of weak predictive models, in this case, decision trees. The training of the ensemble is performed sequentially. At each iteration, the deviations of the predictions of the already trained ensemble on the training sample are calculated. By adding the new tree's predictions to the trained ensemble's predictions, the average deviation of the model, which is the goal of the optimisation problem, is reduced. New trees will be added to the ensemble as long as the error is reduced.
The XGBoost algorithm is designed for classification tasks that work with structured and tabular data. Using the gradient descent architecture, the algorithm enhances the performance of weak classifiers. The main parameters of the algorithm are the number of trees, the step size to prevent overfitting, the change in the value of the loss function to divide the leaf into subtrees, the maximum depth of the tree, and the regularisation coefficient. To support the parallelisation of the tree building process, a block structure is used. It is possible to continue training for additional training on new data. The parallelisation of the algorithm is possible due to the interchangeable nature of the loops, used to build the training base: the outer loop lists the leaves of the trees, the inner loop calculates the features. Finding a loop inside another one prevents the algorithm from parallelizing, since the outer loop cannot start its execution if the inner one has not finished its work yet. Therefore, to improve the running time, the order of the loops is changed: initialisation takes place when reading data, then sorting is performed using parallel threads. This replacement improves the performance of the algorithm by distributing the calculations across threads.

1. Creating a dataset
To create a dataset, first of all, a real-time image output program was used. After that, the hand was detected, that is, the area of interest for further classification. After detecting and tracking the hand, the frame of the hand is drawn by the key points of the hands. The frame is displayed on an empty frame, which will be saved in the dataset in the corresponding, predefined folder.
The first step of our research is to get an image from a webcam, since the program works in real-time. This is followed by the process of hand detection by the MediaPipe neural network framework, as shown in Fig. 2. The ability to perceive the shape and movement of the hands is used to understand sign language and control hand gestures. Reliable real-time hand perception is a challenge for computer vision, as the hands often close together and do not have high-contrast patterns. The MediaPipe framework, by creating multi-modal machine learning pipelines, returns accurate three-dimensional key points of the hand. Fig. 3 shows drawing a hand frame and drawing a three-dimensional hand reference using 21 key points from just one frame. The Fig. 4 shows the process of selecting the area of interest, that is, the area of the detected hand (these examples in the pictures were made in the laboratory by our scientists, with their permission). The frame of the hand (Fig. 5) is moved to another empty window, and the image is saved. The program code for saving the image is shown in Fig. 6.
The first version of the Kazakh alphabet consists of 42 letters. The development of a database for the Kazakh sign language, consisting of a dactylic alphabet of 42 gestures, is the initial step in creating a system for automatic recognition of individual hand gestures. The dactylic alphabet for the first Kazakh sign language is shown in Fig. 7.
In 2017, a decree was signed on the transition of the Kazakh alphabet from Cyrillic to Latin. After the changes, the new Kazakh alphabet includes 31 characters of the Latin alphabet, which completely covers all the sounds of the Kazakh language. This article is relevant because research related to gesture recognition of the updated Kazakh alphabet has not yet been conducted. Fig. 8 shows a dataset of 31 gestures, each gesture corresponds to one letter of the new Kazakh alphabet. After forming the dataset, we proceed to the recognition process. The collected data is divided into training and test data. For the correct recognition and classification of the Kazakh sign language, the machine learning algorithms, specified in section 4, were applied.

2. Development of a program for recognizing the Kazakh sign language
After the dataset is collected from the image for each gesture, a program for recognizing the Kazakh sign language using machine learning methods will be implemented. As shown in the pseudocode of the algorithm (Fig. 9), real-time streaming video is accepted as input. Then the process of reading the capture from the camera comes. If the hand area is in the frame, the coordinate calculation function is performed. The coordinates are calculated based on the key points found. After the coordinates are determined, the function of drawing the frame of the hand is performed. The resulting image is converted to an array of data and goes to the classification function. As a result of the classification, you will get a text with a label about the gesture. Otherwise, when the hand is out of the frame, the label "None" is displayed since no gesture will be detected.   . 10 shows the output results of the labels corresponding to each class of hand frames shown. Fig. 11 shows the result of detecting each gesture class in the dataset.
The dataset consists of 31 classes. Each class contains more than 5000 drawings of the hand frame for a single gesture. As shown in the figure, the recognition of gestures, corresponding to the letters of the Kazakh alphabet, occurs in real time.
That is, it is responsible for the ability to distinguish a given class from other classes. When the model makes many incorrect positive classifications, the value of this metric decreases.
Recall measures the model's ability to detect samples that belong to the positive class. It is responsible for the ability to detect a particular class. Recall takes into account the correctness of the prediction of all positive samples.
However, it ignores the erroneous classification of representatives of negative ones, predicted as positive. And the f1-measure contains information about these two metrics, defined as their average harmonic value.
Accuracy is a metric that describes the overall accuracy of the model classification across all classes.

Accuracy
. Table 1 shows the value of these metrics for each class, and it shows that the overall accuracy of each class is at least 98-99 %. Fig. 12 shows the accuracy and completeness diagram (x and y axes, respectively) and their corresponding F1 score (z-axis) for the Random Forest classifier. When the precision value reaches one, and the completeness is zero, the F1 measure remains 0, ignoring the precision. If one parameter is small, then the second parameter will not matter, since the F1 measure emphasises the smallest value. Using the colour indicator, shown on the right side of the picture, you can see the ratio of accuracy and completeness for each class.   Another metric for assessing the classification quality is the ROC curve, which represents a graph of the relationship between true-positive and false-positive indicators.

TP TN TP FP FN TN
The quantitative interpretation of this curve gives an indicator of the area of the AUC (Fig. 14), bounded by the ROC curve and the axis of the proportion of false-positive classifications. The higher the AUC result, the better the classifier works. Table 2 shows the numerical AUC values of the Random Forest classifier for each class. Table 3 shows the quality metrics of the Support Vector Machine classifier. According to the data, you can see that this algorithm was slightly mistaken in the classification of objects of the 1 st and 7 th class. In other cases, it showed good results. Fig. 15 shows the accuracy and completeness diagram (x and y axes, respectively) and their corresponding F1 score (z-axis) for the Support Vector Machine classifier. When the precision value reaches one, and the completeness is zero, the F1 measure remains 0, ignoring the precision. If one of the parameters is small, then the second parameter will not matter, since the F1 measure emphasises the smallest value. Using the colour indicator, shown on the right side of the picture, you can see the ratio of accuracy and completeness for each class.  Another metric for assessing the classification quality is the ROC curve, which represents a graph of the relationship between true-positive and false-positive indicators.
The quantitative interpretation of this curve is given by the AUC area indicator (Fig. 17), which is bounded by the ROC curve and the axis of the proportion of false-pos-itive classifications. The higher the AUC result, the better the classifier works. Table 4 shows the numerical AUC values of the Support Vector Machine classifier for each class. Table 5 shows the quality metrics of the XGBoost classifier. The Recall metric reaches the lowest value in detecting Class 1 objects. The ability to distinguish one class from other classes, the Precision metric, showed good results. Fig. 18 shows the accuracy and completeness diagram (x and y axes, respectively) and their corresponding F1 score (z-axis) for the XGBoost classifier. When the precision value reaches one, and the completeness is zero, the F1 measure remains 0, ignoring the precision. If one of the parameters is small, then the second parameter will not matter, since the F1 measure emphasises the smallest value. Using the colour indicator, shown on the right side of the picture, you can see the ratio of accuracy and completeness for each class. Another metric for assessing the classification quality is the ROC curve, which represents a graph of the ratio between true-positive and false-positive indicators.
The quantitative interpretation of this curve is given by the AUC area ( Fig. 20) indicator, which is bounded by the ROC curve and the axis of the proportion of false-positive classifications. The higher the AUC result, the better the classifier works. Table 6 shows the numerical AUC values of the XGBoost classifier for each class.
As shown in the Table 6, the values of the AUC ROC metrics are in the range between 0.92 and 0.99, which proves the good quality of the algorithm. In this section, an assessment of 5 metrics was made to check the quality of algorithms.

Discussion of experimental results of comparative analyzes of algorithms, obtained during the study
In this paper, a system for recognizing the Kazakh dactylic sign language, consisting of the dactylic alphabet of 31 gestures in real time, has been developed.
The In other scientific studies [6,20,21] of gesture speech recognition, the support vector method was also used, which we also used in our work. But compared to previous works, our recognition accuracy is high. The peculiarity of the method, proposed in our work, is the combination of static and dynamic data types into one database, which makes it possible to interpret gestures in real mode (dynamic gestures), as well as in cases when there is no need to track hands (static gestures).
For this task, there are such limitations as the quality of camera visibility, the quality of illumination of the recognition zone, also the main problem is the moderate use of resources, since it has limitations on computing devices, etc.
The advantage of this work is the high recognition accuracy, which is very important for use in human-machine communication systems. Also, our research work is one of the first works that implemented a gesture recognition system for the updated Kazakh alphabet.
As a lack of research, we can note FPS drawdowns, which affect the speed of recognition of machine learning algorithms. In the future, parallelization is planned in order to improve the performance and increase the speed of the algorithms. To do this, we consider solutions to problems due to the training time that arise when working with a large number of training examples.

Conclusion
1. The presented research work is aimed at the correct recognition of the Kazakh sign language. To achieve this goal, a dataset was created that contains more than 5000 images for each 31 gestures. With fewer photographs used, our results were less accurate. Lighting also affects the quality of recognition, and we took this parameter into account in order for our development to give a satisfactory result.
2. The classification of gestures was carried out according to three classification algorithms. The average accuracy of the Random Forest classifier was 98.86 %, the SVM algorithm showed 98.68 % accuracy, and XG-Boost has a result of 98.54 % correct recognition. In addition, the classifier's quality is evaluated by the speed of execution and the performance of the algorithm. In terms of training time, Random Forest was faster than the support vector machine and XGBoost. To check the accuracy, cross-validation was performed, where the data was divided into five blocks. As for the speed of predicting in a real-time task, Random Forest, although it won in the learning speed, is inferior in the execution speed, as FPS drawdowns begin. Thus, the prediction accuracy in the three methods is about the same. However, SVM and XGBoost have shown themselves to be better due to execution speed when working in real-time.
3. The conducted research allowed us to draw the following conclusions based on the estimates of the algorithms: the average precision for the RF algorithm was 0.859, for the SVM algorithm was 0.895 and for XGBoost was 0.794. The average recall for the RF was 0.825, for the Ccontinuation of SVM algorithm was 0.797, and for XGBoost was 0.773. For most classes, these metrics showed good results, here is the average value of these metrics for each algorithm.
In the future, it is planned to improve the performance of these classifiers by parallelizing them using CUDA and OpenCL technologies.