DEVELOPMENT OF AN ALGORITHM TO PROTECT USER COMMUNICATION DEVICES AGAINST DATA LEAKS

In today’s Internet space, huge amounts of information are circulating, most of which is user-sharing data as a result of their interaction with various Internet services. Structuring and analyzing these data make it possible to identify seemingly hidden patterns, to predict, and with a system approach to form behavioral trends of the Internet audience. This situation is exacerbated by the energetic efforts of high-tech IT companies to introduce user digital data collection and analysis systems, which leads to the unspoken monopolization of the market of users’ digital data. At the same time, the regulatory role of various state institutions in respecting the rights to privacy of users, namely, the secrecy of correspondence and activities in the Internet space, is steadily decreasing. The growing trend is of increasing concern to Internet users, IT companies’ employees, and non-governmental organizations. They draw attention to the inadmissibility of unauthorized collection and monetization of data without any consent of users of the Internet space [1]. Requirements for the implementation of measures to increase user privacy are regulated by IISO/IEC 24760-1:2019 (E) IT Security and Privacy. In this regard, it is a relevant task to undertake research aimed at developing new approaches and developing tools to protect users’ data on the Internet. Users have the right not only to know what information about them can be collected by Internet services but also to have the opportunity to choose the level of privacy in the Internet space. MoreCopyright © 2021, A. Zadereyko, Y. Prokop, O. Trofymenko, N. Loginova, О. Plachinda


Introduction
In today's Internet space, huge amounts of information are circulating, most of which is user-sharing data as a result of their interaction with various Internet services. Structuring and analyzing these data make it possible to identify seemingly hidden patterns, to predict, and with a system approach to form behavioral trends of the Internet audience.
This situation is exacerbated by the energetic efforts of high-tech IT companies to introduce user digital data collection and analysis systems, which leads to the unspoken monopolization of the market of users' digital data. At the same time, the regulatory role of various state institutions in respecting the rights to privacy of users, namely, the secrecy of correspon-dence and activities in the Internet space, is steadily decreasing. The growing trend is of increasing concern to Internet users, IT companies' employees, and non-governmental organizations. They draw attention to the inadmissibility of unauthorized collection and monetization of data without any consent of users of the Internet space [1]. Requirements for the implementation of measures to increase user privacy are regulated by IISO/IEC 24760-1:2019 (E) IT Security and Privacy.
In this regard, it is a relevant task to undertake research aimed at developing new approaches and developing tools to protect users' data on the Internet. Users have the right not only to know what information about them can be collected by Internet services but also to have the opportunity to choose the level of privacy in the Internet space. More-over, the development of tools to control and manage privacy in order to prevent unwanted processing of personal identification information is predetermined by the standard ISO/IEC 29100:2011.

Literature review and problem statement
Work [2] reports the results of a study showing that both government and IT companies are interested in collecting user data in the Internet space. Data collected while tracking Domain Name System (DNS) requests by the MoreCowBell subsystem as part of the PRISM project were involved in the management of various public processes [3]. IT companies are very interested in collecting and analyzing user data, trying to monetize their ads as efficiently as possible [4]. In addition, various IT companies collect statistics about the use of Internet resources and process them automatically [5]. That confirms the hypothesis that information about their actions is tracked and collected from communication devices connected to the Internet without the knowledge of users. However, the cited works do not offer ways to protect users' data that could exclude monitoring by IT companies.
The most promising direction in terms of ensuring the maximum possible accuracy of collecting information about user actions in the Internet space is the analysis of DNS traffic of DNS clients installed on communication devices [6]. Paper [7] shows that the DNS traffic analysis can identify software installed on communication devices. One can also obtain data on the history of geolocation, accounting records in Internet services, interests, religious preferences, financial status, medical needs, etc. The result of this analysis is the creation of a database of unique digital profiles of communication devices [8,9], and, as a result, accurate prediction of the behavioral response of the Internet audience and the development of possible scenarios for influencing its behavior [10]. That causes users to be insecure from monitoring their network traffic and makes it impossible to choose their level of privacy when communicating with the Internet space.
Researchers studied ways to prevent user data leaks. For example, work [11] identified and generalized the data collected by the Windows family operating systems on the user's communication devices, sent to Microsoft servers. This trend naturally leads to the search for solutions that give users the choice of what data they can access in the Internet space. In particular, article [12] presents the interface URetail in the form of radar, allowing the user to choose which of his/her personal data can be disclosed. However, the implementation of this approach is narrowly focused on the data collected in retail when shopping in online stores.
Many scientists have been involved in the development of methods for analyzing DNS traffic, issues of its encryption in order to protect users' DNS requests from monitoring and censorship. For example, the authors of paper [13] have concluded that the existing standard DNS traffic schemes are ineffective. Works [14][15][16][17] emphasize the relevance of DNS traffic protection and point to the need for a thorough analysis of possible leaks. For example, article [14] explores the principles of DNS operation and analyzes Namecoin, GNU, and RAINS systems. Work [15] looks at the vulnerabilities of the DNS protocol and how malicious software exploits these vulnerabilities. Study [16] identified the problem of DNS privacy leakage and analyzed the use of HTTPS/TLS (DoH/DoT) and SNI (ESNI) encryption technologies. The DNS traffic leaks were evaluated in [17]. Papers [18][19][20] analyze the pros and cons of DNS traffic encryption using DNS over TLS (DoT) protocols, DNS over HTTPS (DoH). Study [18] found that even when encryption is enabled, users' data outflow through their DNS queries. In addition, it was found in [19] that doT and DoH protocols are supported by only a small number of DNS servers. Significantly, encryption requires additional computing resources and slows down the processing of DNS queries [20]. Work [21] analyzes the vulnerabilities of the DoT protocol. Article [22] explores the performance of the DoH protocol and the impact of DNS traffic encryption protocols on Internet space participants. However, the issue of implementing measures to increase the privacy of users when communicating on the Internet remains unresolved.
Ways to increase user privacy are discussed in works [23,24], which propose the introduction of filtering network traffic of communication devices. However, packet filtering, due to its specificity and the peculiarities of individual protocols to which filters are applied, is not a sufficient means to ensure the protection of user data. Network traffic filtering can be used as one of the means of blocking incoming and outgoing IP packets.
Redirecting traffic through an additional intermediate DNS server, implemented between the DNS client and the remote DNS server, is proposed in [25,26]. Thus, the idea of using Smart DNS Proxy Server is considered in [25] to gain access to Internet resources to sites that are not available due to geographical constraints. Study [26] considers building the architecture of the network service that functions as UDP Proxy. However, filtering and cryptographic transformation mechanisms are not used to protect DNS requests from monitoring by Internet providers.
The systematic results of the above papers suggest that there is an insufficient study of how data are collected from user communication devices when DNS clients interact with the domain namespace. All this allows us to argue that it is appropriate to conduct a study on the development of tools that can simultaneously localize DNS traffic leaks, hide the actual IP address of the communication device, and block the collection of user data.

The aim and objectives of the study
The aim of this study is to develop an algorithm to protect communication devices from unauthorized collection and leakage of user data on the Internet. The practical application of the developed algorithm would give users the opportunity to determine the level of their privacy.
To accomplish the aim, the following tasks have been set: -to analyze the process of data sharing between DNS customers and the Internet services they interact with to identify leaks and ways to collect data from users' communication devices; -to develop an algorithm to block data leaks collected by developers of the software installed on a communication device to enable users to choose their privacy when interacting with various Internet services; -to audit the TCP/UDP traffic of various communication devices in order to identify services that send requests for user data collection; -to check the proposed algorithm for the absence of DNS traffic leaks from the communication device.

Exploring the process of data exchange between DNS customers and Internet services
The physical connection of the user's communication device to the Internet space and its subsequent access to Internet resources begins with DNS sending requests to various Internet services. At the same time, any software installed on users' communication devices, such as web browsers, file managers, email clients, messengers, etc., which execute DNS requests, can act as a client. DNS customers interact with the Internet space and process DNS queries in the domain space in a strictly defined order [27].
In practice, to reduce response times to a DNS query and reduce the load on the server's root DNS, providers create their own DNS server cache [28]. If the DNS request previously recorded in the DNS server cache is met, an IP address is issued (Fig. 1).
Thus, all queries from DNS customers are accumulated in the DNS logs of the provider's server. Structuring and analyzing DNS query data can provide comprehensive information about a user's online activities. Various state security structures, advertising and analytical units of IT companies, as well as representatives of organized cybercrime, are becoming increasingly interested in their collection, storage, and analysis. That is why user data is increasingly referred to as "digital gold". Given the above, a scheme for data exchange of a communication device with the Internet space is proposed (Fig. 2). Its analysis leads to the conclusion that the ultimate beneficiaries of user data, one way or another, are IT-companies.
The organization of mass scale and continuity of the process of data collection from communication devices is achieved by IT companies' introduction of free access to internet statistics collection and analysis services: Google Analytics, Yandex Metrika, Liveinternet, Rambler, etc. This approach allows IT companies, introducing systems for automated processing of collected data, to carry out not only digital profiling of communication devices but also to create unique digital profiles for each of their real users [8,9].
Not surprisingly, this trend is a concern for the leadership of a number of democratic countries. For example, EU countries at the legislative level have tightened control and responsibility for infringements on the personal data of EU citizens on its territory and beyond, adopting the GDPR (General Data Protection Regulation) Act [29]. However, even these strict measures do not, in fact, solve the main problem. They do not give users the ability to determine their own level of privacy by managing the collection of their data while doing any actions in the Internet space in real-time.

Developing an algorithm to block data leaks from the user's communication device
The set of measures to prevent data collection from communication devices, and therefore reduce the likelihood of their digital profiling, includes two modules: 1. DNS traffic leakage protection module by: -sending DNS queries under the DoH protocol; -redirecting DNS traffic to a DNS proxy server with a predefined level of privacy.
2. The data collection lock module by: -locking dataset plugins integrated into the Content Management System (CMS) of online resources; blocking the DNS traffic of system-wide and application software.
The first DNS traffic leak protection module is key. This is due to the fact that Internet providers connecting users to the Internet domain space perform it through DNS servers controlled by them, keeping mandatory logs of records of DNS requests of each user. It is obvious that ISPs can: link each user's IP address to all the domain names they've been asked for; store the accumulated data indefinitely; -to provide the accumulated data to authorized government agencies.
Thus, the users cannot be sure of their privacy by conducting Internet communication through the provider's DNS server.
In addition, ISPs by default set their users a mode of forced connection to their DNS server if the user changes the settings to use a third-party DNS server. If such DNS settings are found in a communication device, ISPs use a transparent DNS proxy that redirects user traffic to DNS. Thus, the provider is masking the real route of the user's DNS traffic. This technique makes it possible to secure the Another important factor in controlling user traffic for DNS is that the default DNS protocol does not encrypt DNS queries. Attempts to implement DNS traffic cryptographic encryption have been reflected in the development and implementation of DNScrypt, DoT, and DoH protocols. These protocols encrypt DNS traffic, creating a cryptographically secure channel between DNS customers and servers. It was this circumstance that prompted IT companies to declare support for implemented DNS traffic encryption technologies and to create the same public DNS servers controlled by them with the support of DNSCrypt, DoT, DoH protocols ( Table 1).
The number of IT companies supporting DNS traffic encryption using these cryptographic protocols continues to increase, which unequivocally allows the following: to counter DNS substitution of responses at DNS transit hubs; to bypass the blocking (censorship) of DNS traffic by providers; to make it impossible to log and then inspect DNS traffic; to reduce the role of providers connecting communication devices to the Internet space; to reduce the role of root DNS servers; -to redistribute data collection on DNS users' traffic in favor of large IT companies.
In addition, most web browser developers have not only implemented the DoH protocol in their software products but have also implemented the ability to connect to the public DNS servers of leading IT companies [30]. The tendency to monopolize DNS traffic by IT companies significantly stresses the urgency of the issue of ensuring the real privacy of users, as it is these IT companies that own the services of collecting and analyzing Internet statistics. Examples of such services are Google Analytics, Yandex Metrika, Liveinternet, Rambler TOP, etc. In addition, it cannot be ruled out that IT companies may provide third parties or authorized government agencies with access to the DNS traffic history of users who have used the services of public DNS servers.
To ensure the privacy of Internet users, along with the use of the specified DNS traffic encryption technologies, it is suggested that DNS requests be redirected through DNS proxy servers of a different class of anonymity. These DNS proxies must have a fixed lifespan and should not log DNS requests. The advantage of this approach makes it possible to exclude the possibility of accumulating data on DNS user requests not only from providers and authorized government agencies but also from IT companies.
However, the use of DNS proxy servers that support the DoH protocol is a prerequisite to ensure that DNS requests are secure. And to ensure the highest user privacy, DNS traffic from the communication device should be redirected through the HIA (High anonymous) proxy DNS. These proxy servers hide the actual IP address of the DNS client and prevent the requested DNS server from determining the use of DNS proxies [31,32].
A scheme is proposed to redirect DNS communication device traffic through a DNS proxy server (Fig. 3).
The redirection of DNS queries from a communication device is executed by changing the route of DNS requests from the DNS client to the requested domain. Its distinctive features are: creating a local DNS server; -redirecting DNS queries of DNS clients from a communication device to a local DNS server; redirecting DNS queries from a local DNS server to a pre-selected DNS proxy server using the proposed algorithm (Fig. 4).  3) create a regularly updated list of DNS proxy servers for later testing: 3. 1) check if DNS queries can be sent under the HTTPS protocol. To create a list of operating DNS proxy servers that meet the specified requirements of anonymity, one needs to implement the process of checking them (testing) for perfor-mance. To this end, one needs to consistently execute DNS requests through each DNS proxy server of the following form: -  b) Anonymous ANM (Anonymous) that hides the real IP address of a DNS client but allows the requested DNS server to determine the use of DNS proxy servers; c) High anonymity HIA (High anonymous) that hides the IP address of a DNS client and does not make it possible for the requested DNS server to determine the use of DNS proxy servers; 3. 5) connect to a DNS proxy server that provides the highest possible class of anonymity.
The criteria for distributing DNS proxy servers based on a multitest's results (p. 3. 3) for the assignment of an anonymity class are listed in Table 2. The second module of the user's data collection lock algorithm blocks connections between DNS customers of the communication device and specialized Internet data collection services. In addition, it blocks connections to third-party services and services of system and application software developers (Fig. 5).

Fig. 5. Sharing data between a communication device and Internet space
This is executed by organizing the TCP/UDP traffic filtration process, which is responsible for communicating with Internet services: collecting user data; -system software; -application software.
In practice, firewalls are used for filtering, capable of working at the network packet level and ensuring that all incoming and outgoing DNS communications device requests that match the following are blocked: -IP addresses of user data collection services; -IP addresses of service and third-party traffic system and application software. Table 3 gives the results of comprehensive monitoring of stationary and mobile TCP/UDP traffic from communication devices over a long time. Table 3 User data collection, analysis, and monetization services  Table 3 gives the identified domain names and IP addresses of system and application software, Internet services of data collection, analysis, and monetization, which establish a connection to the communication device. They are arranged in accordance with the affiliation of IT companies.

TCP/UDP traffic audit results
Our analysis of DNS traffic related to the system and application software has made it possible to establish those domains among the Internet resources that are accessed by system and application software ( Table 2). Domain data were obtained from open sources: the sl-reverse.com domain is owned by CSC Digital Brand Services, an IT company specializing in digital brand management and digital marketing; the cloudfront.net domain is owned by Amazon, an IT company that specializes in providing a wide range of services in cloud services based on DNS traffic analysis; the domain te.net.net is owned by IT firm Bodis, LLC, which provides monetization and domain traffic management services; the domain host.hit.gemius.pl is owned by Gemius, an IT company that does media research and develops tools used to optimize advertising campaigns; the 1e100.net domain is owned by Google's IT company; -the compute-1.amazonaws.com and eu-central-1.amazonaws.com domains are owned by the Amazon IT company.
The data related to domain owners (Table 3) suggest that mobile application software such as Facebook, Instagram, Viber, and Telegram establishes connections to Internet services owned by the IT companies Google, Amazon, and Cloudflare.
To ensure user privacy, all connections to IP addresses listed in Table 3 should be blocked, which is determined by the functionality of the second module of the proposed algorithm.

Discussion of results of applying the algorithm that determines the absence of DNS traffic leaks from a communication device
We have proposed a data-sharing scheme between communication devices and Internet space (Fig. 2), which helped establish that DNS customer requests are accumulating in the DNS logs of the provider's server. After structuring and analyzing DNS queries, DNS logs can be used by various  Table 3 government security agencies, advertising and analytics units at IT companies, as well as organized cybercrime, to obtain private information about users. The proposed algorithm for blocking data leaks from the user's communication device consists of two modulesthe DNS traffic leakage protection module and the data collection lock module. The first module sends DNS requests using the DoH protocol and redirects DNS traffic to a DNS proxy server with a predefined anonymity class. The second module blocks data collection plugins integrated into the Content Management System (CMS) of Internet resources and blocks third-party TCP/UDP traffic from system and application software. Our analysis of the public DNS servers of IT-companies that supported the implementation of DNScrypt, DoT, and DoH protocols (Table 1) revealed that IT companies can counteract the substitution of DNS responses at DNS transit nodes and bypass DNS traffic blocking by providers. In addition, the inability to log and then inspect DNS traffic reduced the role of providers connecting communication devices to the Internet space. A significant feature in the redistribution of DNS user traffic is the decreased role of root DNS servers. As a result of the verification of the developed algorithm, it is proposed to redirect DNS traffic through DNS proxy servers of different classes of anonymity (Fig. 3). That has made it possible to exclude the possibility of accumulating DNS user requests from providers. The advantage of the proposed algorithm is to change the route of DNS queries from a DNS client to the pre-selected DNS proxy server with the highest possible class of anonymity (Fig. 4). The DNS proxy server class of anonymity is determined by applying a devised multi-test to meet the testing criteria ( Table 2). The second module of the developed algorithm blocks connections between DNS communication device customers and specialized Internet data collection services. Connections to third-party services and services of system and application software developers (Fig. 5) are also blocked. The combination of the two modules of the proposed algorithm has allowed users to choose the level of their privacy when interacting with the Internet space.

Continuation of
Our comprehensive TCP/UDP audit of the traffic from various communication devices has revealed the IT companies' services involved in user's data collection ( Table 3).
The proposed algorithm has been checked for the absence of DNS traffic leaks from a communication device. Its results showed no DNS traffic leaks when using an arbitrarily selected HIA class DNS proxy server (Table 4).
Thus, the task formulated for this study was solved with the help of the developed algorithm to protect communication devices from unauthorized collection and leakage of user data on the Internet. The combination of DNS redirection of communication devices' traffic through DNS proxy servers and the simultaneous filtering of TCP/UDP traffic in this algorithm is an advantage of the current research over the papers reviewed above [23][24][25][26]. At the same time, the application of the algorithm to block data leaks from communication devices showed no loss of operability of the system and application software. Users were able to choose their own level of privacy, managing the collection of their data while doing any actions in the Internet space in real-time.
The disadvantages of the proposed algorithm include the implementation of the process of sequential scanning of each of the DNS proxy servers, which leads to a temporal delay before its operation, which is defined experimentally and is from 300 to 900 seconds depending on the number of DNS proxy servers derived from open Internet resources. That, in turn, makes it impossible to instantly provide the required level of user privacy due to the actual lack of tested and sorted NOA, ANM, HIA DNS proxy servers.
In addition, the DNS proxy testing process increases the total amount of DNS traffic generated by a communication device, which may not be acceptable to users paying for a fixed amount of Internet traffic.
Reducing the total testing time of DNS proxy servers can be achieved by organizing the multi-threading (parallel) process of their scanning. Moreover, the reduction in the total testing time of DNS proxy servers would decrease in direct proportion to the increase in the number of testing threads.
Further prospects for improving the proposed algorithm may include: introducing a User-Agent ID for DNS customers who communicate under the HTTP protocol; introducing a check time installation feature for a DNS proxy server tested; introducing the DNS proxy recognition feature An-chorFree, CoDeen, TinyProxy, owned by IT companies providing private surfing services; introducing the anchorFree, CoDeen, TinyProxy proxy servers excluding function from the work server list.
Implementing these features could reduce the time to test DNS proxy servers and improve user privacy.

Conclusions
1. We have analyzed the process of data exchange between DNS clients and the Internet services with which they interact. The study of the scheme of data exchange between a communication device and the Internet space has revealed the ways of data leakage from communication devices. Because all DNS customer requests are accumulated in the provider's DNS logs, DNS query analysis makes it possible to form a digital profile of the communication device.
2. An algorithm has been developed to block data leaks collected by developers of the software installed on a communication device, in order to give users the ability to choose their privacy level. The practical application of the developed algorithm has made it possible to exclude the logging of DNS traffic by Internet providers and thus block the collection of user data from communication devices. The proposed algorithm could significantly reduce the accuracy of digital profiling of the user's communication devices. A significant advantage is the ability to give the user the choice of the desired level of privacy in the Internet space.
3. TCP/UDP traffic from various communication devices has been audited over a long time. The analysis revealed the domains and IP addresses of Internet resources that the system and application software of communication devices refers to. Internet data collection and monetization services that perform requests for user data are organized in accordance with the affiliation of IT companies.
4. Checking the proposed algorithm for the absence of DNS traffic leaks from a communication device showed no loss of operability of the system and application software. The selective blocking of Internet traffic was carried out by setting up a list of prohibited IP addresses of the network firewall in accordance with the experimentally obtained data.