Implementation of a Parallel Algorithm of Image Segmentation Based on Region Growing

In computer vision and image processing, image segmentation remains a relevant research area that contains many partially answered research questions. One of the fields of most significant interest in Digital Image Processing corresponds to segmentation, a process that breaks down an image into its different components that make it up. A technique widely used in the literature is called Region Growing, this technique makes the identification of textures, through the use of characteristic and particular vectors. However, the level of its computational complexity is high. The traditional methods of Region growing are based on the comparison of grey levels of neighboring pixels, and usually, fail when the region to be segmented contains intensities similar to adjacent regions. However, if a broad tolerance is indicated in its thresholds, the detected limits will exceed the region to identify; on the contrary, if the threshold tolerance decreases too much, the identified region will be less than the desired one. In the analysis of textures, multiple scenes can be seen as the composition of different textures. The visual texture refers to the impression of roughness or smoothness that some surfaces created by the variations of tones or repetition of visual patterns therein. The texture analysis techniques are based on the assignment of one or several parameters indicating the characteristics of the texture present to each region of the image. This paper shows how a parallel algorithm was implemented to solve open problems in the area of image segmentation research. Region growing is an advanced approach to image segmentation in which neighboring pixels are examined one by one and added to an appropriate region class if no border is detected. This process is iterative for each pixel within the boundary of the region. If adjacent regions are found, a region fusion algorithm is used in which weak edges dissolve, and firm edges remain intact, this requires a lot of processing time on a computer to make parallel implementation possible.


Introduction
Region Growing offers several advantages over classic segmentation techniques.Unlike the gradient and Laplacian methods, the edges of the regions that are found by the growing region are perfectly thin and connected.The algorithm is also very stable concerning noise [1][2][3].
The main objective of image segmentation is to obtain the independent division of the domain of an image and convert it into a set of disjoint regions that are visually different concerning some characteristics or properties calculated as the level of grey, texture or colour, with them, it is possible to perform simple image analysis [4].
Segmentation, in concept, is a straightforward idea [5].Only by looking at an image, one can say which regions are contained in an image [6].
The relevance of our work here presented is to apply the classic parallel processing techniques used to the growing region techniques to process large and high-quality images with the biggest number of pixels.

Literature review and problem statement
Image segmentation based on growing regions begins with the selection of a pixel, often referred to as a seed, that is within the object of interest.Usually, the seed is chosen manually.Hsiao and Tynan present the results of research where the seed pixel (first point of the region) will begin to extend the region by processing its neighbours and adding them using a predefined criterion [7,8].
Felzenszwalb showed that there were unresolved issues related to the insertion criterion uses common characteristics; these are the intensity, the colour, the texture of the seed and the points that belong to the region [9,11].
Each time a new point is inserted into the region, the characteristic that is being used to perform the insertion will be recalculated, for example, if the grey level is used, the average value of the grey levels that will be recalculated there is in the region generated so far.In this way, the region will expand by adding neighbours until it finds one that does not meet the insertion condition imposed by the criteria, this approach was used in [10].
If a point has not been added to any region, it can be added to a region near if the difference between, for example, the grey level of this point and the average grey level of the region does not exceed a given threshold value.
This technique is useful when the intensity of the background and the object are very similar but are separated by an edge or another region.
The segmentation of any image would be straightforward if factors such as image noise and the number of pixels in the process were not involved.In the last ten years, the power of computing has increased with the use of GPU.
A way to overcome these difficulties was proposed by [10,11].Their research shows that the use of this type of cards has become accessible for research, and their uses are in different fields.
The GPU has doubled its performance every six months on average.At present, the computing power of high-performance GPUs can reach Teraflops, which is much higher than the central processing unit of a computer [12], all this suggests that it is advisable to conduct a study on the architectures and behaviour of the memory and improve all techniques of computer vision.

The aim and objectives of the study
The aim of the study is to implement a parallel algorithm to solve general problems in the area of image segmentation research.In our first approximation, the computing power is applied to Region growing technique, which is an advanced approach to image segmentation in which neighbouring pixels are examined one by one and added to an appropriate region class if no border is detected, this algorithm is viral and very used in the literature.
However, to achieve this aim, the following objectives are accomplished: -understand and apply all the techniques of parallel computing to the computer vision field; -establish a parallel strategy to implement the Region growing technique; -validate the correct function of the algorithm; -show the measures of performance and improvements.

GPU and their programmation
NVIDIA proposed a free CUDA (Compute Unified Device Architecture) GPU program development tool in 2007, which allowed programming any type of GPU and turned the technology into general-purpose [13,14].
It has contributed to high-speed scientific operations, so the scientific fields benefited are: statistical engineering, Monte Carlo statistical simulation, financial engineering, global climate change simulation, 3D multimedia, biomedicine, national defence science, oil exploration, civil engineering, CAM, CAE, CAD [15,16].
From an architectural point of view, the first generations of GPUs had a relatively small number of cores, but it quickly increased until today, when we talk about many-core devices with hundreds of cores on a single chip [17].
This increase in the number of cores caused that in 2003 there was a significant jump in the capacity of calculation in a floating-point of the GPUs concerning the CPUs, as shown in Fig. 1.Fig. 1 shows the evolution of the floating-point performance of Intel and NVIDIA based technology during the last decade.
It can be seen that GPUs are much more ahead than CPUs in terms of performance improvement, especially since 2009, when the ratio was approximately 10 to 1 [18,19].The significant differences between CPU and multi-core GPU performance are mainly due to a design philosophy issue.While GPUs are designed to exploit parallelism data level with mass parallelism and relatively simple logic, the design of a CPU is optimized for efficient sequential code execution [20].
The CPUs use sophisticated control logic that allows instructional and out-of-order parallelism and use quite large cache memories to reduce data access time in memory.
There are also other issues, such as power consumption or memory access bandwidth.Current GPUs have memory bandwidths around ten times higher than those of CPUs, among other things because CPUs must meet requirements inherited from operating systems, applications or output-input devices.
There has also been a very rapid evolution from GPU programming that has changed the purpose of these devices.GPUs in the early 2000s used programmable arithmetic units (shaders) to return the colour of each pixel on the screen.Electronic copy available at: https://ssrn.com/abstract=3703301 Since the programmer could fully control the arithmetic operations that were applied to the input colours and textures, the researchers observed that the input colours could be any type of data.
Thus, if the input data were numerical data that had some meaning beyond a colour, the programmers could execute any of the calculations they needed on that data through the shaders [21].
Despite the limitations that programmers had to develop high-performance applications with arithmetic operations, many efforts were devoted to developing general-purpose application programming interfaces and environments for GPUs.Some of these programming interfaces have been widely accepted in various sectors, although their use still requires some specialization [22].
OpenMP has been successfully implemented in small to medium scale shared memory systems and large scale systems.OpenMP was developed by the Architecture Review Board (ARB) in November 2004.Its main evolution allowed the use of the C, C++ and FORTRAN languages.As computer components decrease in size, architects have begun to consider different strategies to exploit space on a chip.A recent trend is to implement Multi-Threading Chip (CMT) in hardware [23,24].
This term refers to the simultaneous execution of two or more threads within a processor.It can be implemented through several cores of physical processors on a chip, a single-core processor with feature replication to maintain the status of multiple threads simultaneously or the combination of CMP and SMT.OpenMP support for these new microarchitectures needs to be evaluated and possibly improved [25].

Parallel implementation
Region-based segmentation consists of dividing an image into similar and equal areas delimited by connected pixels through standard criteria between sets of sample pixels.Each pixel of a region is similar in some features usually like colour, intensity and texture.
If these criteria are not adjusted, the results will be incorrect and undesirable.The following problems are: 1) the segmented region is smaller or larger than the current one; 2) pseudo objects arise; 3) fragmentation.
As the main objective of segmentation is to divide an image into regions, segmentation methods such as threshold achieve this objective partially by seeking boundaries between regions based on discontinuities.However, region-based segmentation is a technique that works well to determine the region directly [2].
The basic formulation is defined as follows: Image segmentation partitions the set X into the subsets R(i), where i=1, 2, 3, …, N having the following properties: R i is a connected region, i=1, 2, ..., n. ( R i⋂R j=∅, i≠j. ( 3 ) P(Ri) is a logical predicate defined over the points in set R i: 1) segmentation must be completed; for this to happen, each pixel must be in a region; 2) the points in a region must always be connected; 3) regions should be disjoint; 4) the properties that pixels in a segmented region must meet must be clearly defined; 5) indicates that regions R i are different in the sense of predicate P.
With this information, the algorithm was developed shown in Listing 1.
As shown in Listing 1, all the instructions within the while statement correspond to the structure that can be parallelized, because this block of sentence can be sent to the different GPUs.
With the previous information, it is possible to parallelize the algorithm using OpenMP.Fig. 2 shows how the parallelization strategy is applied.Parallelization.
In the end, there is a synchronization called the barrier, which expects the tasks to be completed and then accumulate the sub-results obtained.

Discussion of experimental results
The results presented in this work cover three critical stages: 1) verification of algorithm operation; Electronic copy available at: https://ssrn.com/abstract=3703301 2) execution and parallel performance and parallel speedup.

1. Checking the operation of the algorithm
The algorithm performed was tested by different professional tomography and x-ray images.In the image analyzed by tomography of a slice of the brain, Fig. 3 shows the original image, Fig. 4 and the segmented image and finally, in Fig. 5 both overlapping images.Fig. 6 shows the resulting histogram.
Another test was performed directly on an X-ray plate on the side of a human head.Fig. 7 shows the original image; Fig. 8 shows the original and segmented image, both overlapping images.
Two fundamental problems that were observed are the selection of the appropriate seeds to define the regions of interest; and the choice of properties that allow adding pixels.
The selection of the starting points depends on the image to be segmented.
The selection of similarity criteria depends on the problem and the type of image available.
Another test was performed directly on a stream of 4k video to know the real-time parameters of the parallelized processing.Fig. 9 shows the stream images and the processed images.

2. Execution and parallel performance
The CT images from cancer imaging archive with contrast and patient age dataset obtained from kaggle.com is used to perform the parallel performance tests.
The sequential program developed in C language and its equivalent developed in OpenMP with GPU interface NVIDIA Tegra was executed.See Fig. 9. Table 1 shows the execution time of GPU+OpenMP time vs. Sequential time.
The speedup is defined as the linear relationship between the execution time of serial processing over the best parallel algorithm used to solve a problem.It is necessary to emphasize that there will always be a serial time that cannot be parallelized.
The equation of the speedup is defined in (4) where S is the speedup, Ts the processed sequential time and Tp is the processed parallel time.Using this equation is shown in Table 2.
. Electronic copy available at: https://ssrn.com/abstract=3703301As can be seen, in Table 1, 2, the calculation of the performance limits of the parallelization is critical since it represents how the software will manifest when the hardware is scaled.The previous performance considerations indicate that it has many chances of success in the application with multi-core hardware.
Our studio is limited to 4K images with maximum modifications of 4,096×2,160.However, it is possible to use pictures and video with more resolution by scaling the proposed hardware architecture.Our method is necessary to remember that it is local and does not take into account the global vision of the problem.It is necessary to emphasize that the algorithm is sensitive to noise, for this it is necessary to apply a threshold function to the image, there can also be a continuous route of points related to colour, which connects any of the two points of the image which allow affecting the performance.

Conclusions
1. Understand and apply all the techniques of parallel computing to the computer vision field: In past times, the supercomputers were the most significant system that used precious resources and the parallel computing only could be processed in a computer cluster.Now, using a GPU system to do computing parallel the image analysis to the most significant image can be processed using a sophisticated algorithm.In our experience, 256-core Maxwell in 4 CPU core arm Cortex A57 was used and each process was processed in 20ns.
2. Establish a parallel strategy to implement the Region growing technique: The proposed study showed that the improvement of technologies and mechanisms for obtaining data from images has led to the obtaining of more detailed raster information and with a higher level of resolution and spatial precision, which has generated a gradual growth in the volume of the models.However, each model in their analysis has the most significant computational complexity, and with this, the processing time could be computable or hard computable.Our strategy works fine, and it is possible to reach scalability for the application of the algorithm in clusters.
3. Validate the correct function of the algorithm: The Complete result of the algorithm applied in the segmentation applied was performed of the image, x=198; y=359, max-dist=0.2 in an image kind DICOM in the first algorithm and excellent result was obtained in grey images and colour image, but the algorithm is robust to be implemented in satellite images and ecological images of most significant dimensions.
4. Show the measures of performance and improvements: Table 2 shows the excellent results of the functionality test of the proposed algorithm.The number of cores 256 was sufficient for executing the images of any size, and it is possible to use the hardware architecture proposed for another algorithm more complexes in all the fields of computer vision.

Fig. 2 .
Fig. 2. The strategy of the algorithm shown in Listing 1

Table 1
Executing time