Development of a system for graphic captcha systems recognition using competing cellular automata

Peculiarities of the use of competing cellular automata for problems of recognition of complex captcha systems have been explored. For this purpose, the concept of competing cellular automata has been introduced and a mathematical model of their functioning and interaction has been developed. The mathematical model of competing cellular automata based on the set theory has been described to specify moving cellular automata, which shift to the neighboring states of characters and implement their transition rules in such a way. Based on this mathematical model, a recognition system for captcha images implemented in the code by means of JavaFX 2.0 technology has been developed, which allowed reaching the crossplatformness and correct functioning on different operating systems. The libraries of cellular automata have been developed for the English language. Each symbol of the alphabet is represented in the form of a state system, which is aligned with a cellular automaton with states describing the given symbol. We used Java programming language for development and OpenCV library for the ability to handle images which allowed us to achieve high-quality recognition results. The architecture of the developed system of recognition of complex captcha images in the form of diagrams of classes of the main blocks with detailed descriptions of each class has been considered. Computer experiments have been carried out with different sets of distorted characters used in actual captcha systems and recognition quality indices of the developed software obtained. It has been shown that the probability of obtaining the correct result of captcha image recognition exceeds 80 % with a degree of deformation of characters up to 20 %. With a degree of deformation of characters over 30 %, there is a high probability of false character recognition. The advantages of the method of text character recognition based on competing cellular automata include simplicity of rules of engagement, ability to parallelize the process of recognition easily, capability of recognition of distorted and partially overlapping characters that are the basis of modern captcha systems


Introduction
Captcha is most often used to prevent the use of online services by bots, in particular, to prevent automatic mailbox registrations, messaging, file downloads, mass mailings, etc. [1].
The relevance of the use of Captcha can be seen, for example, from the statistics of spam volumes. According to global e-mail service providers, spam volume reaches 97 % of the total number of emails. The use of Captcha protection can complicate the task of registering mailboxes by bots and thus reduce the volume of spam mailings.

Literature review and problem statement
The use of cellular automata is very attractive in terms of developing new recognition systems in them. In the paper [2], the author has shown indisputable advantages of using cellular automata (CA) in problems where there is a need of parallel computing that enables the simple implementation of complex image processing algorithms and does not require significant computing resources. Despite these advantages, the cellular automata concept is not so often involved in recognition. The only thorough research in this area is the work [3], the main part of which is devoted to the study of the characteristics of the CAs in the processes of text recognition. The author uses sequences of different CAs to distinguish the characteristic signs of the text characters: loops, intersections, positions of the ends. The work [4] studies the CAs that makes the recognition of handwritten characters possible. The main drawbacks of such approaches are crockhood and the need for system training.
In addition, in another work [5], the author proposed a new algorithm for the recognition of JPEG watermark images based on cellular automata. One more paper [6] represents specialized cellular automaton structures for the analysis of contour image [5,6].
In other papers [7,8], the authors proposed an approach to segmentation of bound symbols in a text CAPTCHA and obtained the result of the recognition of characters that are not separated. The author also [9] evaluates the latest research on the recognition of Captcha systems.
The world-famous service for online recognition [10] of Captcha systems, which actually is a plugin for popular browsers Chrome and Firefox, successfully recognizes characters, the structure of which is not modified by deformation, and the overlapping characters are generally ignored. The average detection time of one captcha is 8 seconds.
However, there is a real way to use the new type of CA in the process of character recognition suggested by the authors [11]. This approach is based on movable CAs, which must realize all their states on the corresponding symbol of the text. The variety of interpretations of characters, which arises in this case, is compensated by the developed mechanism of competition, when the CA with a maximum number of viable states "wins". This CA is the most correct reflection of the symbol under recognition.
At the same time, all the examined methods and systems of recognition work ineffectively on typed and partially distorted characters, and partially overlapped characters, which form the basis of modern Captcha systems, as shown in the following Fig. 1.   . 1 shows the variant of Captcha systems used by such Internet giant as Google. These Captcha systems are characterized by non-linear distortions of the text, shifting symbols one by one, close symbols location, and different fonts. Noises are not applied, but characters are not always merged without spaces, which complicates the recognition process itself.

The aim and objectives of the study
The aim of the work is to develop a system for recognition of deformed and partially overlapped characters based on competing cellular automata. Similar to Google Captcha engines, which, however, are unable to recognize the existing systems.
To achieve this aim, it is necessary to accomplish the following objectives: -to develop a mathematical model of movable cellular machines suitable for use in recognition tasks; -to develop a mechanism of competition of cellular automata for increasing the efficiency of recognition; -to develop the architecture and interface of the recognition system based on competing CA; -to study the effectiveness of the developed software.

Mathematical model of movable competing cellular automata
The number of cellular automata used in this work can be written as follows: where N is the number of symbols in the alphabet. Each machine has its own set of states U, label ξ (color, position in the alphabet, etc.), and depends on the discrete time: K is the number of states of the current automaton. The shift of the automaton from the current state k to the next k+1, which can be described as follows: The transition of the CA to the new state is controlled by the transition function φ, so we can do the following: The movable CA moves from the current state to the next, using the rules generated by the transition function.
Let us assume that the image of the characters to be recognized is presented in the form of a set of states similar to the states of the CA. Then is the set of these states. The number of states Р of the character is unknown.
When getting on the character, the automaton will move through its states ω p the number of which Р, generally speaking, is not equal to the number of states of the current automaton K, Р≠K.
The transitions will be executed by the CA according to the states of the character, that is: u u U such CA has no allowed transitions and is removed from the cellular-automatic field.
If the CA can realize all its states, specified by the transition function on the current character, that is, if Ù ∀ Î ∃ω Î k p u U we will assume that this CA "successfully" describes the current character.
If Ù ∀ Î ω Î k p u U that is, for at least one state of the CA there is no analogous state of the character, such automaton is removed from the CA field.
There may be several CAs that implement all their possible states on the current character. In order to choose from them the one that exactly matches this character, the competition mechanism is used.
Let 3 CAs move on a single character: Each of them realizes all its states on it. We will assume that the competition is "won" by the CA, the number of states of which is the largest. So, it is necessary to find max (K, L, S).
Let max (K, L, S)=S. Then the CA σ = σ η ( , , ) h h S t will be considered the one that describes the current character most "successfully". Reading its label η, we find out which character has been recognized.
The algorithm of recognition itself is based on the construction of the CA and its graph of transitions, the states of which the given automaton moves through.

Architecture of the recognition system
The proposed recognition system has been implemented as a software product using JavaFX 2.0 technology. The libraries of cellular automata were developed for the English language. We used Java programming language for development and OpenCV library [12] for the ability to handle images, which allowed us to achieve high-quality recognition results.
The diagram of the main classes [13], which is responsible for working with the device camera, is shown in Fig. 2.
The description of these basic classes of the block of working with the camera is provided in Table 1.
The unit of working with the camera consists of one main class: CameraManager, which describes the work of the camera in the mode of taking images for OCR, and three auxiliary classes, which describe the logic of camera settings and operating. The main classes diagram, which is responsible for pre-processing the image received from the camera, or from the scanning device is shown in Fig. 3. The description of the basic classes of the image pre-processing unit is provided in Table 2. The image preprocessor consists of one main class: ThinningAlgorithm, which describes the basic algorithms for image processing. Three auxiliary classes that are responsible for choosing the appropriate image processing algorithm according to its format, for obtaining the matrix A simplified diagram of classes of the developed software is shown in Fig. 4. It only shows the interaction of classes that describe cellular automata, that is, only that which is important for the recognition process. which implements the work with cellular automata Table 3 describes the classes of the unit of cellular automata. Table 3 Description of classes of the unit of cellular automata The system consists of two basic classes: CellularAutomata and AutomataSequence, which describe the work of the CAs and their sequences to launch the competition mechanism. The cellular automaton interactivity includes such elements as separating the image into separate characters, checking the conditions in the transition graph, etc. The states of each automaton that corresponds to a certain letter of the alphabet and the rule of transition through the graph are implemented in the RuleItem, RuleCell and SymbolA_Z classes. To simplify the diagram, descriptions of all letters are shown in one class, although they are actually implemented separately. These classes are responsible for implementation of movement of the CAs and describe the transition graph. An index of the type of automaton that allows identifying it unambiguously is a color label. By reading this label, we can determine the recognized letter. The RuleCell-Result, LabelsChecker, ColorChecker classes correspond to this process. The result of this work will be a text recognized by competing cellular automata.
Interaction with the hardware and the output of the recognized text is performed using the standard functions of the Windows API. Image processing was performed using OpenCV open source libraries.

Description of the interface of the recognition system
The developed software has a very simple interface, since it is designed for testing the developed algorithms and methods of recognition only, and not for commercial use. The software consists of one window, which is divided into two blocks. The upper block loads the captcha image obtained from Captcha generators or simply saving the image from the browser. The lower block displays the result.
The main window of the program in the recognition mode is shown in Fig. 5. The system is designed to recognize the displaced objects and close arrangement of characters, drawn in different fonts, and fuzzy images, that consist of deformed characters, united by several groups.

Discussion of captcha image recognition research results
The quality of the work of the developed system was evaluated by means of captcha image recognition on the personal computer of the following configuration: 1. Processor -Intel(R) Core(TM) i7-3612QM CPU @ 2.10 GHz.
4. HDD -Seagate ST1000LM (931GB, SATA II). 5. DVD-RW -LG DVD+-RW. This computer is running Windows 10 Professional. Captcha generators [14] have been used to generate the incoming images. The results of the research are shown in Fig. 6 in the form of a graph of the dependence of recognition quality on the degree of deformation.
Recognition quality is the averaging of the data obtained from ten independent experiments. Analysis of Fig. 6 shows that the dependence of the degree of recognition of CAPTCHA characters by the developed software on the degree of deformation is almost linear and consists of three sections. The first section -from 0 to 20 % deformation, where the system recognizes more than 80 % of the provided characters, that is, shows rather good results. The second section -from 20 % to 80 % of deformation, where the slope of the graph increases, shows a gradual decrease in the probability of recognition, and at 45 % deformation it recognizes only 50 % of the characters. Further, the decrease continues and the system almost ceases to correctly recognize Captcha with 80 % deformation (showing only 5 % of correctly recognized characters). With a further increase in the degree of deformation, the probability of correct recognition decreases to zero.
Thus, one can argue that the developed method can be successfully (up to 70 % probability of correct recognition) used for Captcha recognition, the characters of which are deformed not more than for 30 %. It should be noted, however, that the existing Captcha recognition systems cannot work with deformed characters at all, which is why the deformation has been introduced. The fact that the system developed in this paper can handle partially deformed and superimposed characters can be considered as a significant advantage.
The disadvantages include the fact that the degree of confident recognition is limited to 30 % of character deformations. Of course, this is not enough. Further improvement of the mechanism of competing CAs should increase this area; however, it is clear that successful recognition of completely deformed characters (or those having common lines) is impossible without the involvement of Machine Learning, which is not the subject of this work. In future, the combination of the theory of CAs with the means of Machine Learning can lead to a significant breakthrough in recognition systems, working in extreme conditions with low-quality recognition objects.

1.
A new class of movable CAs has been introduced. The motion of the CAs is described by the transition rules: A mechanism of competition of the CAs is developed, which consists in the fact that the automaton with a maximum number of the implemented states on a particular character "wins" the competition among all that simultaneously move through the character and is recognized as the most correctly describing the current character.
3. The architecture and streamlined single-window interface system for Captcha characters recognition based on competing CAs were developed for the study of the adequacy of the model and quality of recognition.
4. It is shown that the developed system demonstrates a high probability of correct recognition of Captcha characters (up to 70 %) at low degrees of deformation (up to 30 %). At higher degrees of deformation, the probability of correct recognition significantly decreases.

Introduction
Domain dictionaries (DD) are widely used in software design [1]. In particular, when determining the roles of