AND PRELIMINARY ANALYSIS OF DATA ON ENERGY CONSUMPTION BY MUNICIPAL BUILDINGS

32 Solving a task on energy efficiency implies accounting of energy resources, analysis of energy-consuming modes, as well as decision-making and management. Practical implementation is, as a rule, based on constant monitoring and parameterization of the system, which leads to the accumulation of a large amount of historical data on energy consumption, obtained from the nodes of commercial accounting and electronic weather-dependent regulators. Solving the problems on energy efficiency and energy saving requires a comprehensive approach, which predetermines the need to improve already existing approaches, PREPARATION AND PRELIMINARY ANALYSIS OF DATA ON ENERGY CONSUMPTION BY MUNICIPAL BUILDINGS


. Introduction
At present, much attention is paid to the problems of efficient use of energy resources and energy efficiency. Measures that are implemented with the aim of energy-saving are not typically effective enough. This is related to the absence of complete information about objects that utilize energy and about the specificity of their operation, the lack of the required amount of funds to conduct activities aimed at increasing energy efficiency of buildings, etc. or to search for the new ones, in order to analyze data on energy consumption [1][2][3]. Processing of large volumes of data predetermines the application of specialized algorithms and methods of statistical and intelligent analyses [4][5][6].
Data collection, preparation and conducting the initial analysis, make it possible to obtain valuable information on energy consumption in order to build a model using methods of machine learning. An initial analysis exerts a significant impact on the adequacy of the models being constructed. There are objective difficulties associated with the specificity of a subject area, the combination of data from multiple sources and the lack of a unified procedure for data analysis, whose application results could be implemented into existing systems.
The development of information technology has led to an increase in the number of sectors where methods of machine learning are employed. Among them is the management of power consumption and the prediction of an energy resource cost. Methods of machine learning help in the analysis of parameters for a power management system and support making effective decisions by energy managers.
There is a need to improve already existing approaches, or to search for the new ones, in order to analyze data on energy consumption, with the aim of making decisions directed at increasing the efficiency of energy resources use.
Implementation of decisions related to the automated monitoring and control over heating systems makes it possible to reduce the overall consumption of thermal energy and to accumulate large amounts of information on the functioning of the system and on the decisions made by an energy manager. That solves the tasks associated with the need to ensure regulatory sanitary and hygienic conditions in the premises that are heated, taking into consideration the influence of external factors.

Literature review and problem statement
Paper [7] considered the possibility to predict heat consumption in the system of centralized heat supply based on the Elman neural network using a genetic algorithm, which makes it possible to optimize the weights of models. Research [8] confirms that the results obtained by using neural networks are more precise compared with the ensemble random forest method when solving a problem on predicting energy consumption. However, both models can be used for an hourly prediction. The authors of [9] analyzed characteristics of energy consumption and factors affecting it, based on data on the operation of shops' facilities. Since the authors employed a variety of methods for intelligent analysis of a significant volume of data, the findings obtained make it possible to optimize construction costs and business equipment costs based on the built prediction model.
Despite the practical significance of such results, the authors failed to sufficiently consider the process of implementation of the obtained results into existing decision support systems. The description of the actual process of analysis has not been sufficiently documented. In addition, having been developed only to examine data on the energy consumption by shops, it implies the use of specific factors in its assessment. Studies into the accuracy of models that employ artificial neural networks for a short-term forecasting of supplying heat to objects within a social-budget-funded sector have found that the best results had been demonstrated by a model of the NARX type [10]. Because the data under con-sideration are typically of a mixed type, incorrectly structured, and/or have missing values, paper [11] synthesized the models for data imputation based on several methods of machine learning. The authors compared them with a standard method that is included in most statistical packages.
Paper [12] presented a method for diagnosing energy efficiency based on a limited amount of data acquired from the accounts for electricity and data about weather conditions. The obtained solutions provide an opportunity, based on the results of a multi-parametric regression, to determine volume of the resource consumed by each system, as well as detailed information about the physical properties. The authors also proposed ways to obtain the necessary data. However, despite the advantages of the method, the issue of data preparation and preliminary analysis remains unexplored.
The development of aspects related to research and, in fact, to datasets about energy consumption, is outlined in paper [13]. The authors categorized existing datasets according to the three defined strategies for the generation and development of the catalogue of available data on energy consumption. The authors of this work confine themselves to collecting data and their grouping, as well as to the general presentation of their classification apparatus.
Researchers do not pay enough attention to the issue of data collection, preparation and conducting an initial analysis, which would make it possible to acquire information about power consumption prior to constructing a model and could significantly affect the adequacy of the model. Therefore, there is a need to build a unified methodology to examine data on the energy consumption in buildings with different functional purpose.

The aim and objectives of the study
The study that we conducted set the goal of improving the quality of analysis of energy consumption by municipal buildings through automating the procedure of processing primary data that were acquired from commercial accounting nodes or from the installed systems for monitoring power consumption.
To achieve the set aim, the following tasks have been solved: -to identify the features of approaches to examining data on energy consumption, to adapt existing methodologies in accordance with the characteristics of the subject area and to construct a procedure for initial analysis of data on energy consumption by buildings; -to substantiate the structure of information flows of data on the energy consumption by different buildings, to implement a set of software solutions for retrieving data from different sources while providing the necessary level of their informativeness and quality, and to test the procedure for initial analysis of data on energy consumption by buildings of the educational institution; -to develop a web-based, application software in order to visualize an analysis of thermal energy consumption by different apartments with different techniques for accounting and regulation of heat consumption.

1. Information data flows
A sector that consumes a significant portion of energy resources is the system of heat supply to facilities and resi-dential objects in cities: residential fund, buildings of social, educational purpose, etc.
The objects whose data were investigated were conditionally divided into four groups: the first group is composed of buildings with a heat meter, the second group includes buildings without a heat meter, the third group consists of private houses, the fourth group includes administrative premises.
We selected as administrative premises the academic buildings of Kremenchuk Mykhailo Ostrohradskyi National University (KrNU, Ukraine). KrNU has developed and installed an automated system for operative control and management (ASMU) over thermal points at separate academic buildings. The system is composed of standard industrial equipment. The developed software has made it possible to execute current control over temperature modes of heating systems and to operatively change setpoints for a heat carrier's temperature depending on the outside temperature and working conditions at buildings [14]. The result of ASMU work to handle the heat consumption by academic premises is the availability of data about actual operational modes of heating systems, as well as on energy consumption by separate buildings.
An objective and accessible source for obtaining initial data on buildings without an installed ASMU is the bills for utilities.
As regards the administrative and budget-funded buildings, the consumption of thermal energy is based on a building's meter, of electricityon a building's meter, the outer volume of buildings and the cost of thermal energy are known.
An analysis of the bills has revealed that for buildings with a heat meter there are available data on the area of an apartment, the cost of thermal energy, share of an apartment in the heating area of a building, consumption based on an apartment's meter, the building's meter, and the amount to be paid.
Similarly, for apartments in buildings without heat meters there are available data on the area of an apartment, the tariff, and the charged amount to pay for a service.
An analysis of the bills for private houses has showed that known data include the volume of consumed gas, heating area, the cost of gas.

2. Procedure for initial analysis of data
Data Mining is a process whose goal is to find significant correlations, models, and trends, by processing a large volume of data using a variety of statistical and mathematical methods. It is typical to employ the sectoral standard for data examination, CRISP-DM. In accordance with it, a life cycle must be composed of six phases: understanding business processes, understanding the data, data preparation, modelling, evaluation, implementation.
When analyzing data on energy consumption, there is a need to solve certain business tasks: -Which buildings are the most energy efficient? -Which premises are the most comfortable for people? -Which buildings are the most investment attractive? -Which buildings need priority modernization? -What is the quality of the provided heating service? Therefore, it is advisable, when examining data, to be guided by one of the existing analysis methodologies. Given that analytical data processing on energy consumption is a typical project within Data Mining, it was decided, based on CRISP-DM, taking into consideration the specificity of the subject area, to construct a unified procedure for initial analysis of data on energy consumption.
The constructed procedure implies the following stages: -collecting source data: request to access data from a list of the project's resources and downloading them; -description of data: examination of the basic, or the "surface", properties of the collected data and description of results: the amount of data, types of values, encoding schemes; data examination: analysis of "grey" data by using queries, visualization, and reporting; -checking the quality of data: existence of encoding errors and missing values; data selection: selection of elements and attributes; -conducting a series of syntactic transformations due to the specificity of functioning of the selected tools for analysis; data selection and merge: merging data tables, which contain different or the same information about the same objects; -generation of attributes: overall power consumption, specific heat-, electricity-, and overall energy consumption, generated attributes reduced to the regulatory values for external and internal air temperature, the number of degree-days, a mark of the heating season (if necessary); -analysis of general descriptive statistics for a dataset examined at a given stage; -check for normality of distribution: conducting a Shapiro-Wilk test; if required, undertaking a separate study into asymmetry and excess taking into consideration the sample size; -conducting a correlative analysis for the quantitative estimation of relationship among data; -building boxplots or a nuclear density estimation for comparing the probability distribution between the groups of data and an analysis.
In accordance with the constructed procedure, we provide the following descriptive statistics: arithmetic mean, mean square deviation, median, minimum, maximum, asymmetry, excess, standard error of the mean.
To check the normality of distribution, we perform the Shapiro-Wilk test. A null hypothesis that is tested using the Shapiro-Wilk test, is stated as follows "The examined sampling is derived from a general totality that has a normal distribution". If the error probability, obtained using the test, turns out to be less that the predefined level of significance (for example, 0.05), the null hypothesis is rejected.
Considering the results on testing the normality of distribution, we define a correlation method: a Pearson coefficient is used for normally distributed data, the non-parametric Spearman coefficient is applied for data whose distribution is different from normal.

1. Structures of information data flows on energy consumption by different buildings
In order to simplify the data and make them more convenient for preparation, we developed structures of information flows for different objects.
Thus, for a building with a heat meter (Fig. 1), we first determine the total heating area: 100 % , where A t is the total heating area, m 2 ; A ap is the area of an apartment, m 2 ; dA ap is the share of an apartment in the overall heating area. We then determine specific heat consumption: where q is the specific heat consumption of a building, kW . hour/m 2 ; A b is the building's area, m 2 ; ∆E b is the consumption based on a meter, Gcal. Energy efficiency class is: where CEE is a coefficient of energy efficiency; E max is the maximum specific need for heating, kW . hour/m 2 ; q is the specific heat consumption, kW . hour/m 2 . Heating tariff is: where T is the heating tariff, UAH/m 2 ; C te is the cost of thermal energy, UAH/Gcal; A t is the total heating area, m 2 ; ∆E b is the consumption based on a meter, Gcal.
For buildings without a heat meter (Fig. 2), we determine the amount of heat energy to heat premises: where ∆E ap is the amount of heat energy used to heat a premise, Gcal; Am ap is the amount to be paid, UAH; C te is the cost of heat energy, Gcal.
The following stages imply determining specific heat consumption (2) and the coefficient of energy efficiency (3).
Since the overall heating area is unknown, the indicators are determined for an apartment (Fig. 3).
In order to determine indicators of heat consumption by private houses with a natural gas meter (Fig. 3), we first must reduce initial data to stan-dard conditions by multiplying the volume of gas consumed by its respective coefficient: where V st_c is the volume of gas reduced to standard conditions, m 3 ; V w_c is the volume of gas under working conditions, m 3 ; C st is the coefficient to reduce gas to standard conditions. The application of one or another coefficient depends on the location of a gas meter, region or oblast, and a corresponding month. Because a natural gas counter registers the total volume of gas used by consumers, there is a task to determine the volume of gas used to heat a building: where V h_n is the volume of gas that is used for household needs. V h is the volume of gas used for heating, m 3 ; V c is the volume of gas reduced to standard conditions, m 3 . We then determine the absolute heat consumption to heat a private house: where ∆E b is the absolute heat consumption to heat a house, Gcal; GCH is the gas combustion heat (caloric content), kcal/m 3 ; V h is the volume of gas used for heating, m 3 . Caloric content is the primary indicator of quality of natural gas. The higher the caloric content, the less the amount of gas that is required to meet specific needs. The values for caloric content are fixed by service providers and are usually available for use.
Heating tariff is: where T is the heating tariff, UAH/ m 2 ; C g is the cost of gas, UAH/m 3 ; A b is the building's area, m 2 ; V h is the volume of gas used on heating, m 3 . The obtained results make it possible to determine specific heat consumption and an energy efficiency class (2)-(3). For the administrative and budget-funded facilities (Fig. 4) we perform calculation similar to buildings with a heat counter (2)(3)(4), and thus derive values for specific heat consumption, tariff, and the class of energy efficiency.

2. Verification of the procedure for initial analysis of data
In order to verify the devised procedure, we employed data on the energy consumption by seven academic buildings at KrNU over the period from 2012 to 2016.
For the present study, we collected three datasets: -Energy consumption by buildings of the educational institution over the period from 2012 to 2016, indicating the number of a building, month of the year, consumption of heat energy (expressed in Gcal), electricity consumption (expressed in kW ⋅ hour) ( Table 1).
The volume of data: columns -5, rows -420. The format of data: the number of a building is the categorical variable; the rest of the data are numeric. Table 1 The encoding scheme of the set "Energy consumption and indoor temperature" -Volume of the heat load, indicating the number of a building, the volume of heat load (expressed in Gcal/hour) and a building's volume ( Table 2).
The volume of data: columns -3, rows -7. The format of data: the number of a building is the categorical variable; the rest of the data are numeric.
-The average monthly ambient temperature over the period from 2012 to 2016, indicating the year, month, and the average monthly temperature ( Table 3).
The format of data: all data are numeric. Table 2 The encoding scheme of the set "Volume of heat load" Number of a building k Volume, m 3 V_m3 Volume of heat load, Gcal/year Qop Table 3 The encoding scheme of the set "Temperature" Year year Month month Average monthly ambient temperature, °C Tout_C The initial study is limited to data over 2012-2016 (420 elements), which is why we must adjust the filters so that data from other periods are not in the selected data. The examined datasets do not contain mistakes and missing values, and the encoding schemes converge. After checking the quality, we shall proceed to the data preparation stage.
All of the above attributes were included in the dataset. Since the source collected data on energy consumption are represented in the first normal form, we made a transition to the third normal form, in order to eliminate transitive dependences. The result of merging the three tables is the unified table with monthly data on each building over 2012-2016, which contains the attributes selected at the previous steps of analysis.
In order to evaluate the initial data, there is a need in additional parameters for a comparative analysis of different objects. To perform a comparative analysis of the energy consumption by different buildings it is advisable to use the same indicators, for example, specific heat consumption (q1, kW . hour/m 3 ), specific electricity consumption (q2, kW . hour/m 3 ), etc.
The normative value for degree-days during heating period (DDHP): (10) where Dd N is the normed number of degree-days; Т in =20 °C is the indoor temperature; Т a_n_am =-0.8 °С is the average The actual value for degree-days of a heating period (DDHP): (11) where Dd is the actual number of degree-days; Tin_C, °C is the indoor temperature; Tout_C, °C is the average actual outdoor temperature; Z act is the actual duration of a heating period.
A DDHP coefficient is: where K Dd is a DDHP coefficient; Dd is the actual number of degree-days; Dd N is the normed number of degree-days. Absolute heat consumption in kW . hour: where E_kWh is the absolute heat consumption, kW . hour; E_Gkal is the heat energy consumption, Gcal. Specific heat consumption is: where q1 is the specific heat consumption, kW . hour/m 3 ; E_kWh is the heat energy consumption, kW . hour; V is the volume of a building, m 3 . Specific heat consumption is: where q2 is the specific electricity consumption, kW . hour/m 3 ; W_kWh is the consumption of electrical energy, kW . hour; V is the volume of a building, m 3 . Total energy consumption is: where Esum_kWh is the total energy consumption, kW . hour; W_kWh is the consumption of electrical energy, kW . hour; E_kWh is the consumption of heat energy, kW . hour. Specific total energy consumption is: where q3 is the specific total energy consumption, kW . hour/m 3 ; Esum_kWh is the specific energy consumption, kW . hour; V is the volume of a building, m 3 . Specific heat consumption, reduced to the normative values for outdoor and indoor air temperature, is: where q1t is the specific heat consumption taking into consideration temperatures, kW . hour/m 3 ; q1 is the specific heat consumption, kW . hour/m 3 ; K Dd is a DDHP coefficient.
Specific electricity consumption, reduced to the normative values for outdoor and indoor air temperature, is: where q2t is the specific electricity consumption taking into consideration temperatures, kW . hour/m 3 ; q2 is the specific electricity consumption, kW . hour/m 3 ; K Dd is a DDHP coefficient. Specific total energy consumption, reduced to the normative values for outdoor and indoor air temperature, is: where q3t is the total specific energy consumption taking into consideration temperatures, kW . hour/m 3 ; q3 is the specific total energy consumption, kW . hour/m 3 ; K Dd is a DDHP coefficient.
In the case of analysis of energy consumption by buildings, in the transition from general data to data on the heating and non-heating periods, we convert the sample on consumption data, where each entry corresponds to monthly indicators, into a new sample, where each building is matched with a series of sorted records with characteristics for a separately taken heating (non-heating) period. In order to analyze seasonal energy consumption, we form a unified new attributea tag for a heating season.
Next, we convert the table with data on monthly energy consumption by buildings of the educational establishment, where each entry corresponds to one month, into a new table.
In it, each consumer of energy is matched with a single record with general characteristics of his/its energy consumption (for example, the total amount of electricity consumed over a heating period, the number of degree-days over this period, etc.). Fig. 6 shows charts for general data correlation matrices (Fig. 5), monthly data on heating (Fig. 6) and non-heating seasons (Fig. 7), and summary data on heating (Fig. 8) and non-heating seasons (Fig. 9).
The next stage is to analyze the spread diagrams. Fig. 10-15 show charts on the seasonal energy consumption during heating seasons (heat consumption and total energy consumption).  (Fig. 11).
Accounting of specific heat consumption, whose magnitude is reduced to the normative indoor and outdoor air temperature (Fig. 12), makes it possible to determine changes in the trend. Over the heating season of 2013-2014, at a maximal average temperature of the heating season (minimum number of degree-days), there is a substantial increase in specific heat consumption by buildings (a kink for building No. 1).
Specific heat consumption over the heating seasons of 2013-2014 and 2015-2016 with a close enough number of degree-days is significantly different. The analysis of the absolute total energy consumption over the heating periods from 2012 to 2016 allows us to argue about a trend towards lower energy consumption by buildings. Despite this, the value for the absolute total energy consumption of building No. 1 is a kink (Fig. 13).
Over the heating season of 2012-2013, consumption by building No. 5 exceeded the consumption by building No. 2, starting at the season of 2013-2014 energy consumption by building No. 2 exceeds that by building No. 5.
In the transition to specific total energy consumption, the trend is maintained, and the dependence of seasonal data on temperature manifests itself slightly.
We observe a significant jump in the value for specific energy consumption by building No. 3, which is observed somewhat weaker in the data on absolute total energy consumption (over the period from 2012 to 2014) (Fig. 14).
Over the heating seasons of 2013-2014 and 2014-2015, consumption of kW ⋅ hour per unit volume by the building No. 3 exceeded this indicator for building No. 1, which has the largest volume (Fig. 15).

3. Development of a web-based solution to analyze heat consumption by apartments
To simplify the procedure for analysis of data on the heat consumption by apartments, we developed an appropriate web application. The proposed tool for implementing the client-server architecture to examine data on heat consumption is the framework Shiny, which makes it possible to rapidly and conveniently develop a web-interface using open-source libraries and the programming language R.
A user that has any device connected to the Internet can, by following the procedure of authorization or registration, get an opportunity and a toolset to process data on heat consumption by apartments. The data entered by the user are stored in a MySQL database and, given the existence of several types of users, the set of functions that can be performed by the user is limited. When trying to perform a function that is not included in the list provided for a given user, there is an error message.
Depending on the type of a building whose data on heat consumption will be examined, based on the constructed structures of information flows, various forms of data are implied.
We have implemented a possibility for the user to download a file with monthly data or to input them to the developed electronic form, as well as to convert them into values for seasonal consumption and to create additional parameters for comparison: specific consumption, energy efficiency coefficient, etc. Additionally, the software stores and displays in the additional chart values for the outdoor air temperature, obtained from the Internet resources. Parameters for setting the visualization make it possible to not only choose the Cartesian axes, but also perform additional splitting based on quality variables: the number of rooms, availability of a counter, etc. Fig. 16 shows a fragment of the web-interface of the developed software that demonstrates seasonal indicators.
There is a possibility to additionally display the norms of consumption of heat energy for the system of centralized heating (heat supply). Solving the task on the generation and storing generated reports predetermines the necessity of ensuring printing quality in the assigned format. In this case, such reports would be easily understood by someone who is not necessarily a specialist in energy management. The implementation is based on using the language of data markup -Markdown.