Defining Asymptotic Parallel Time Complexity of Data-dependent Algorithms

The scientific research community has reached a stage of maturity where its strong need for high-performance computing has diffused into also everyday life of engineering and industry algorithms. In efforts to satisfy this need, parallel computers provide an efficient and economical way to solve large-scale and/or time-constrained problems. As a consequence, the end-users of these systems have a vested interest in defining the asymptotic time complexity of parallel algorithms to predict their performance on a particular parallel computer. The asymptotic parallel time complexity of data-dependent algorithms depends on the number of processors, data size, and other parameters. Discovering the main other parameters is a challenging problem and the clue in obtaining a good estimate of performance order. Great examples of these types of applications are sorting algorithms, searching algorithms and solvers of the traveling salesman problem (TSP). This article encompasses all the knowledge discovery aspects to the problem of defining the asymptotic parallel time complexity of data-dependent algorithms. The knowledge discovery methodology begins by designing a considerable number of experiments and measuring their execution times. Then, an interactive and iterative process explores data in search of patterns and/or relationships detecting some parameters that affect performance. Knowing the key parameters which characterise time complexity, it becomes possible to hypothesise to restart the process and to produce a subsequent improved time complexity model. Finally, the methodology predicts the performance order for new data sets on a particular parallel computer by replacing a numerical identification. As a case of study, a global pruning traveling salesman problem implementation (GP-TSP) has been chosen to analyze the influence of indeterminism in performance prediction of data-dependent parallel algorithms, and also to show the usefulness of the defined knowledge discovery methodology. The subsequent hypotheses generated to define the asymptotic parallel time complexity of the TSP were corroborated one by one. The experimental results confirm the expected capability of the proposed methodology; the predictions of performance time order were rather good comparing with real execution time (in the order of 85%).


§1 Introduction
Computational science (CS ) is often referred to as the third science, complementing both theoretical and laboratory science 29) .As shows Fig. 1, there is a symbiotic relationship between the three sciences: theoretical findings guide the experimentalists, experimental data is used to build and validate computational research, and computational research provides the theorists with new directions and ideas.In particular, as is shown in Fig. 2, the components of CS are applications, algorithms, and architectures.CS is a scientific endeavor (an application) that is supported by concepts and skills of mathematics (algorithms) and computer science (architecture).Central to any computational science problem, there is a model of the problem.Building models for abstracting the components of a real problem let us to make predictions of what might happen.CS takes advantage of not only the improvements in computer hardware, but also improvements in computer algorithms and mathematical techniques.It allows resolving issues that were previously too difficult to do due to the intricacy of the mathematics, the large number of calculations involved, or a combination of both.Besides, it permits visualization, analysis, and interpretation of large data sets in ways that can inform some complex problems.For a data-dependent parallel algorithm, similar input data sets may cause significant execution time variations.Measuring the time required to reach the solution (the fundamental performance metric) of any parallel algorithm for all possible input values would allow to answer any question about how the algorithm will respond under any set of conditions.Nevertheless, it is impossible to make all of these measurements.
The asymptotic parallel time complexity of a data-dependent algorithm does not depend only on the number of processors (P ) and on the problem size (N ) but also on unknown parameters.Unfortunately, the other parameters are unpredictable and seemingly dependent on the properties of the data.Discovering these properties is a challenging problem and the clue to obtain a good estimate of performance order of data-dependent parallel algorithms.Great examples of these types of problems are sorting algorithms, searching algorithms, solvers of the satisfiability problem, solvers of the graph partition, solvers of the knapsack problem, solvers of the bin packing, solvers of the motion planning, and solvers of the traveling salesman problem, amongst many others.There are a lot of interesting related works in the literature that deal with one implementa-tion that solve a problem mentioned above and define its complexity.However, there are no studies showing how to evaluate the asymptotic time complexity of data-dependent algorithms.
The complexity of a sorting algorithm is established in terms of the size of a problem and in terms of the disorder of the given problem instance 12) .As a consequence, different measures of disorder have been presented in these kinds of algorithms 9,12,35) .The complexity of the problem of deciding whether a given propositional formula in conjunctive normal form is satisfiable and its variations has been widely studied 32,2) .The asymptotic complexity of the knapsack problem is defined by O(nW ) where n is the number of items and W the knapsack capacity.Many articles exist that discuss different implementations.Most of them talk about the partially-ordered knapsack 23) .It is well-known that an exhausted search for a traveling salesman problem TSP has an exponential time complexity 36) .An exact solution takes O(n!) time, which is prohibitively long.For this reason, polynomial-time heuristic search approaches are proposed 21,22) .
This paper provides a general methodology to cover all the issues to the problem of defining the asymptotic parallel time complexity of data-dependent algorithms, using a computational approach.This is a good starting point for understanding some facts related to the non-deterministic algorithms.Any minimum contribution in this sense represents a great advance in this subject.The scientific experimental methodology begins by designing a considerable number of examples and measuring their execution times.A well-designed example involves the worked hypotheses at that time.It guides the experimenters in selecting what experiments actually need to be performed.A datamining tool then explores the collected data in search of hidden patterns and/or relationships detecting the main parameters that affect performance.The patterns and/or relationships are modelled numerically in order to generate an analytical formulation of the execution time required.Knowing the key parameters which characterise time complexity, it becomes possible to hypothesise to restart the process and to produce a subsequent improved time complexity model.Finally, the methodology predicts how the algorithm will perform on a particular parallel computer when it is given new data sets.The entire process is neither more nor less than the scientific method, a way to ask and answer scientific questions by making observations and doing experiments 20) , computational experiments in this case.
The traveling salesman problem (TSP ) has been chosen as a case study.
The TSP is of considerable importance not only from a theoretical point of view but also from a practical one.The drilling of printed circuits boards, the assignment of routes for planes of a specified fleet, the handling of materials in a warehouse are some examples.Therefore, there is a concrete need for defining TSP asymptotic time complexity.As a representative of these practical problems, a parallel global pruning traveling salesman problem implementation (GP-TSP ) has been analyzed.The goals of this research are both to examine the influence of indeterminism in performance prediction of data-dependent parallel algorithms and to show the usefulness of the defined scientific experimental process for this kind of algorithms.A significant number of experiments confirm the expected capability of the proposed methodology.The remainder of this article is organized as follows.The next section presents a general "knowledge discovery" methodology to the problem of defining the asymptotic parallel time complexity of data-dependent algorithms.Section 3 describes the case study: the traveling salesman problem.Besides, it provides detailed coverage of a TSP implementation and their underlying ideas behind the discovery process of the significant input parameters.Section 4 summarizes and draws the main conclusions of this work.§2 Knowledge discovery methodology The scientific experimental knowledge discovery methodology attempts to define the asymptotic parallel time complexity of data-dependent algorithms.
The process of knowledge discovery is certainly not new.It is typical of the experimental sciences.An experimental science is based on observation of performing repeated controlled experiments.Before computers were used to automate this process, math, physics or statistics researchers were using probability techniques to model historical data.
The following subsections describe how the scientific experimental methodology works.

Design of experiments to propose a model
First of all, it is important to understand the algorithm domain and the relevant prior knowledge, and to analyze its behavior step by step, in a deep way.It is a try-and-error method that requires specialists to manually or automatically identify the relevant parameters that can affect the execution time of the studied algorithm.Discovering the key parameters is the basis to obtain a good prediction capacity.Including too many parameters may lead to an accurate but too complex or even unsolvable model.Hence, great care should be taken in selecting parameters and a reasonable trade-off should be made.The CS takes advantage of not only the improvements in computer hardware, but also, probably more importantly, the improvements in computer data-mining algorithms and mathematical techniques.
Creating a well-designed experiment involves articulating a goal, choosing an output that characterizes an aspect of that goal, and specifying the data that will be used in the study taking into account the worked hypotheses at that time.The experiments must provide a good training data set first to measure the quality of the model / hypotheses and then to fit the model.After the necessary training data set has been defined, the studied parallel algorithm must process each training data obtaining their corresponding execution time, left side of the Fig. 3.
The term knowledge discovery in databases (KDD) refers to the process of analyzing data from different perspectives and summarizing it into useful information.Technically, KDD is the process of finding correlations or patterns among dozens of fields in large relational databases.A KDD process involves data preparation (cleaning, preprocessing, and transformation), defining a datamining study, reading the valuable data and building a model, understanding the model, and finally predicting.Every step is shown in Fig. 3. Understanding previous models and the final model will help researches to improve their general knowledge about the interest problem.It is an interactive and iterative process, surrounding numerous steps with many decisions that the end-user carries out 18) .
The stage of defining a data-mining study encompassing both the decision of choosing a data-mining technique (classification, regression, clustering, dependency modeling, summarization of data, or change and deviation detection), and the selection of the particular technique and the algorithm to apply according to the chosen technique.It is important for these techniques to have the ability to identify relevant attributes and to disregard the superfluous ones.
Regarding the analysis of the problem, a clustering study could be performed to potentially identify groups.Current clustering techniques can be broadly classified into two categories: partitional and hierarchical.Given a set of records, partitional clustering obtains a partition of the records into k subsets (clusters), so that the data in each subset (ideally) share some common traits (often proximity according to some defined distance measure).Hierarchical clustering algorithms produce a nested sequence of partitions.These algorithms work on both bottom up and top down approaches.Therefore, some important parameters values that affect performance for the studied algorithm together with their corresponding measured execution times are analyzed with a k-means clustering algorithm (partition algorithm) included in a data-mining tool in order to summarize these into a useful information (patterns).A more concise description of the result including only a handful of necessary attributes is more desirable than hundreds of attributes, see right side of the Fig. 3. Studying these attributes which characterise time complexity, it becomes possible to suspect new hypotheses to restart the process and to produce a subsequent improved time complexity model (dashed lines in Fig. 3).In summary, Fig. 3 shows the high-level steps that are involved in the overall knowledge acquisition process which includes designing of the training data set, executing of the studied parallel algorithm, data preparation, defining a data-mining study and building a model, understanding the model, and finally predicting.There is no doubt that the design of experiments is directly related to the suspected hypotheses.The solid lines represent the compulsory path to follow in the methodology and the dashed lines represent paths of refinement.

Validation of the model
A new data set is proposed to be able to validate the created model (left side in Fig. 4).Although the validation data set constitutes a hold-out sample, it has not been considered in the building of the model.This enables to estimate the error in the predictions without having the assumption that the execution times follow a particular distribution.Instantiating the analytical formulation (called asymptotic time complexity in Fig. 4) with validation data and particular values of a parallel computer architecture, it is possible to make predictions of performance order.The quality analysis is an important issue in this stage and has to include relevant measurements.Each prediction is then compared to the execution time obtaining the prediction error, right side in Fig. 4. The average of the square of this prediction error enables to compare different models and to assess the accuracy of the model in making predictions.

Predicting the performance order
The objective of predicting modeling is always to build a model which best predicts outcomes for future data.In this study, building a model means giving a formulation of complexity order.Then, every prediction of performance order for new input data in a particular parallel computer is obtained instantiating the asymptotic time complexity.

The entire methodology
The entire methodology is shown in Fig. 5.The higher part of the picture describes how is built and fit the model in order to define the final asymptotic time complexity formulation, sections 2.1 and 2.2.The lower part of the picture shows the prediction framework, section 2.3.
Every stage in the defined methodology can implicate a backward motion to previous steps (dashed lines in Fig. 5) in order to obtain extra or more precise information.
Fig. 5 The entire novel methodology.§3 A case of study: the traveling salesman problem The traveling salesman problem (TSP ) is one of the most famous issues in the field of the combinatorial optimization.In spite of the apparent simplicity of their formulation, the TSP is a complex solving data-dependent problem.Not only the complexity of its solution has been a continue challenge to many researchers 7,33,36,3) but also the prediction of its performance to reach the solution.
This section gives an overview of the existing TSP problems as well as many practical issues that can be formulated as TSP problems.In addition a parallel Euclidean TSP implementation is presented.The objective is to show the usefulness of the general knowledge discovery process.

TSP problem statement
The TSP for C cities is the problem of finding a tour visiting all the cities only once and returning to the starting city so that the sum of the distances between consecutive cities is minimized.The requirement of returning to the starting city does not change the computational complexity of the problem 13) .

TSP practical problems
The TSP is of considerable importance not only from a theoretical point of view but also from a practical one.Issues having the TSP structure occur in the analysis of the structure of crystals 8) , in handling materials in a warehouse 30) , in clustering of arrays data 25) , in sequencing jobs on a single machine 15) , in physical mapping problems 1) , in genome rearrangement 31) , and in phylogenetic tree construction 24) amongst many others.Related variations on the TSP include the resource constrained traveling salesman problem which has applications in scheduling with an aggregate deadline 28) .The prize collecting traveling salesman problem 5) and the orienteering problem 17) are special cases of this.The max TSP objective is finding a tour of maximum length 6) .The maximum scatter TSP is the problem of computing a path on a set of points in order to maximize the minimum edge length in the path.This kind of algorithm is motivated by applications in manufacturing and medical imaging 4) .Most importantly, the TSP often emerges as a subproblem in more complex combinatorial problems.The problem of determining for a fleet of vehicles which customers should be served by each vehicle and in what order each vehicle should visit the customers assigned to it is the best known combinatorial problem 11) .
But why is it so important to define the asymptotic time complexity in any of the above examples?For instance, a model for the vehicle routing problem can provide answers to questions such as 'How much computing time is needed to finish daily scheduling?'; 'What happens to the complexity if data and machine sizes are modified?';'How many machines are needed to finish the work within a given time order?' Besides, the purpose of parallel complexity theory is to explore the extent to which efficient and fast parallel computation is possible 10) .

GP-TSP implementation
An Euclidean global pruning TSP * 1 implementation (called GP-TSP ) is presented to obtain the exact TSP solution in a parallel machine.It is used to analyze the influence of indeterminism in defining the asymptotic time complexity and also to show the usefulness of the methodology.The implementation is a branch-and-bound algorithm which recursively searches all possible Hamiltonian paths and prunes large parts of the search space by maintaining a global variable containing the length of the shortest path found up to that moment.
Each city is represented by two coordinates in the Euclidean plane then, for each city pair exists a symmetrical distance.Considering C different cities, the Master defines a certain level L to divide the tasks.Tasks are the evaluations of the possible permutations of C − 1 cities in L elements.The granularity G of a task is the number of cities that defines the task sub-tree: G = C − L. At the execution, the Master sends tasks with a variable containing the length of the shortest path found until that moment.
A diagram of the possible permutations for 5 cities, considering the salesman starts and ends his trip at the city 1, can be seen in Fig. 6.The Master can divide this problem into one task of level 0 or four tasks of level 1 or twelve tasks of level 2 for example.The tasks of the first level would be represented by the cities 1 and 2 for the first task, 1 and 3 for the second, followed by 1 and 4 and 1 and 5. * 1 Since any distance measure (Euclidean, Manhattan, Chebychev, ...) would be equivalent, it was decided to use the Euclidean.Besides, for simplicity, it was considered cities in R 2 instead of R n .Finding an optimal Hamilton Circuit with the least weight in a complete weighted graph (where the vertices, the edges, and the weights represent the cities, the roads, and the cost or distance of that road respectively) would be an equivalent formulation.Consequently, the ideas of this article can be generalized.

Fig. 6
Possible paths for the salesman considering 5 cities.
Workers are responsible for calculating the distance of the permutations left in the task and sending the minimum path and distance of these permutations to the Master.Since the performance of the GP-TSP is not only I/O, but also CPU -bound, it is very important to minimize the number of distance computations.Pruning conditions must be applied not only to avoid accessing irrelevant sets of cities, but also to minimize the number of distances computed.
If the length of a partial path is bigger than the current minimal length, this path is not expanded further and a part of the search space is pruned in the GP-TSP.Hence, the estimate of times becomes a complex function.
The GP-TSP algorithm internally works on a symmetric matrix of Euclidean distances.Fig. 7(a) shows an example of a strictly lower triangular matrix of Euclidean distances; meanwhile, Fig. 7(b) shows a pruning process.Analyzing Fig. 7(b), each arrow has an associated distance coefficient between the two cities it connects.The total distance in the first path (in the left) is of 40 units.The distance between 1 and 2 on the second path (in the right) is already of 42 units.It is then not necessary for the GP-TSP algorithm to continue calculating distances from the city 2 on because it is impossible to improve the distance for this branch.

Discovering the significant GP-TSP input parameters
Using simple experiments, varying one or two values at a time, it is possible to infer that time required for the parallel GP-TSP algorithm depends on the number of processors (P ), on the number of cities (C ), and on other parameters.The value of these other parameters are data-dependent.As a result of the investigation, right now the sum of the distances from one city to other cities (SD) and the mean deviation of SDs values (MDSD) are the numerical parameters characterizing the different input data beyond the number of cities.
But, how have these final been obtained?Next, it is described way to discover the above dependencies and Specification the parallel machine: The have been reached with a node homogeneous PC Cluster (Pentium IV 3.0GHz.,DDR-400Mhz., Gigabit Ethernet) at the Computer Architecture and Operating Systems Department, University Autonoma of Barcelona.All the communications have been accomplished using a switched network with a mean distance between two communication end-points of two hops.The switches enable dynamic routes in order to overlap communication.

[ 1 ] First hypothesis: location of the cities (geographical pattern)
Given a number of cities with its pattern of distribution, the initial experiments have provided evidence that times required for the completion of the algorithm are dissimilar.In order to understand the general process and to show its progress and results, it has been chosen an example data-set to follow along this section.It consists of 5 different geographical patterns (named G1 to G5) of 15 cities each (C1 to C15).The GP-TSP algorithm receives the number of cities (C ) and their coordinates, the level (L), and the number of processors (P ).It proceeds recursively searching all possible paths, applying the pruning strategy whenever possible and, finally, generating the minimal path and the time spent.Table 1 shows the execution times (ET , in sec.) by pattern (columns G1 to G5) and starting city (SC, C1...C15) for the GP-TSP using only 8 nodes of the parallel machine.It is important to observe the dispersion of times while maintaining constant the number of processors and the number of cities.
Clustering, a data mining technique, has been applied to discover the internal information of these values and then to decrease the data-dependence.
Its divides the data set into distinctive data clusters, so that the data in each cluster share some common trait (similar execution time).The general action has been done using the well-known k-means clustering algorithm 26) included in Cluster-Frame 16) .Cluster-Frame is a dynamic and open environment of clustering.
To obtain quite similar groups with respect to the patterns used at the beginning, k has been fixed in 5 (k is the number of searched clusters).The initial centroids (one for each cluster) have been randomly selected by the clustering application.The k-means algorithm aims at minimizing a squared error function.The objective function with n data points and k disjoint subsets is: i − c j | 2 is a chosen distance measure between a data point x (j) i and the cluster centroid c j .The Eq. 1 is an indicator of the distance of the n data points from their respective cluster centroids.
Columns Cl in Table 1 show the assigned cluster for each sample.For the clusters 1 to 5, the centroids values have been 92.23,16.94, 37.17, 10.19, and 7.94 seconds, respectively.
The quality evaluation involves the validation of the above mentioned hypothesis.For each experiment, the assigned has been confronted with the defined graphic pattern previously.The percentage of hits expresses the capacity of prediction.A simple observation is that the execution times have been clustered in a similar way to patterns fixed at starting (Fig. 8).In this example, the capacity of prediction has been of 65% for the GP-TSP.This means that there is a close relationship between the patterns and the run times.

Conclusions:
The initial hypothesis for the GP-TSP has been corroborated; the capacity of prediction has been greater than 60% for the full range of experiments tested.The expressed percentage has given evidence of the existence of other significant parameters.Therefore, a deep analysis of results revealed an open issue remained for discussion and resolution, the singular execution times by pattern.Another major hypothesis was formulated.At this stage, the asymptotic parallel time complexity was defined as O(P, C, pattern).

[ 2 ] Second hypothesis: location of the cities and starting city
The data-set is the same previously used.Comparing each chart of Fig. 8 with its corresponding column in Table 1, it is easy to infer some important facts: • The two furthest cities (1, 2) in Fig. 8(a) correspond with the two higher time values of starting city C1 and C2 in Table 1 (G1).• The four furthest cities (1, 4) in Fig. 8(b) correspond with the four higher execution time values of starting city C1 to C4 in Table 1 (G2).
• The six furthest cities in Fig. 8(c) correspond with the six higher time values of Table 1

(G3).
• The cities in Fig. 8(d) are distributed among two zones therefore, the times turn out to be similar enough, see Table 1 (G4).
• Finally, the cities in Fig. 8(e) are close enough, in consequence, their times are quite similar, see Table 1

(G5).
Another important observation is that the mean of execution times by pattern decreases as the cities approach, see again Table 1.

Conclusions:
Sampling demonstrates the location of the cities and the starting city (SC) play an important role in run times; the hypothesis has been corroborated.However, an open issue remains for discussion and solution: how to relate a pattern (in general) with a numerical value which means execution time.This relationship establishes a numerical characterization of patterns.On this basis, an original hypothesis was formulated.At this point, the asymptotic time complexity for the GP-TSP was redefined as O(P, C, pattern, SC).where d(p, q) is the distance between p and q and the summation extends over all vertices q of G.This measure is an inverse measure of centrality.At this stage, the worked inputs are the sum of the distances from one city (x, y) to the other cities (SD, as shown on Eq. 2), and the mean deviation of SDs values (MDSD).
The greater is the sum of the distances, the lower is the centrality.If a particular city is very far from the others, its SD will be considerably greater than the rest and consequently the execution time will also increase.This can be observed in Table 2. Therefore, the SD value is an index time.Why is it needed to consider MDSD in addition to SD as a significant parameter?Quite similar SD values from the same pattern (same column) of Table 2 imply similar execution times.In G1, the SD values for starting city C4 and C10 are 230.11and 234.84, respectively.Their execution times (ET ) are similar, 72.64 and 74.96 seconds respectively.But, this relation is not true considering similar SD values from different patterns (different experiments).The SD value for G1 and starting city C3, and the SD value for G2 and starting city C10 are similar (315.51 and 323.12, respectively) but the execution times are considerably different.The distinct values of MDSD for G1 and G2 explain the variation of execution times for these similar SD values.
Conclusions: asymptotic parallel time complexity for the GP-TSP algorithm should be defined as O(P, C, SD, MDSD).A new fact was discover in the process.By choosing the city which has minimum SD associated value, it is Paula FRITZSCHE and Dolores REXACHS and Emilio LUQUE possible to obtain the exact TSP solution investing less amount of time.Much better results it would be reached if the algorithm begins considering the closer L cities to that city.

Prediction of GP-TSP performance order
To start, the concepts of geometric pattern matching under Euclidean motion were applied to measure the mismatch between a new data set and historical patterns.The Hausdorff distance as a function of relative position was the distance used 19) .Then, the comparison module was definitely replaced by an identification number in order to make the prediction.Comparing graphics patterns provides a higher level of uncertainty respect of using a good analytical expression.
The redefinition of the asymptotic parallel time complexity for the nondeterministic TSP algorithm represents a significant qualitative change.At this moment, the GP-TSP has a time complexity of O(P, C, SD, MDSD).The analytical formulation allows making predictions for a new data set on a particular parallel computer.Fig. 9 shows the prediction framework.
Fig. 9 The prediction of performance order framework.

Two relevant GP-TSP experiments
Additional TSP experiments have been tested to prove certain hypotheses.Tests show how important the geographical patterns of cities are in comparison to knowing only their coordinates.Two groups of experiments which follow a specific pattern each one have helped to confirm the strong compliance of our hypotheses.Both groups are presented to illustrate this section.
[ 1 ] Importance of the geographical pattern Make geometric transformations (shifting, scaling, and rotation) to patterns is without a doubt a trivial test.This is an excellent case study to understand the importance of geographical patterns.Applying each one of the transformations to a cities set, similar times are expected using the same algorithm.This leading to conclude, the time required to reach the solution of the GP-TSP algorithm is constant to certain transformations into the geographical patterns.
• The coordinates of a city shifted by △x in the x-dimension and △y in the y-dimension are given by where x and y are the original and x ′ and y ′ are the new coordinates.
• The coordinates of a city scaled by a factor S x in the x-direction and y-direction (the distances among cities are enlarged when S x is greater than 1 and reduced when S x is between 0 and 1) are given by • The coordinates of a city rotated through an angle θ about the origin of the coordinate system are given by x ′ = x cos θ + y sin θ y ′ = −x sin θ + y cos θ (5) ilar starting by each city.The mean deviations of execution times were smaller than 1%.This article shows that a knowledge discovery methodology based on simplicity, including only the major factors in performance, can define the asymptotic parallel time complexity of data-dependent algorithms with useful accuracy.Nevertheless, it is important to understand that the performance of a parallel application is resultant from at least factors of algorithm, implementation, underlying processor architecture, and interconnected technologies.
In short, the general knowledge discovery methodology begins by designing a considerable number of experiments and measuring their execution times.
A well-designed test guides the researcher in choosing what experiments actually need to be performed in order to provide a representative sample.A data-mining tool then explores these collected data in search of patterns and/or relationships detecting the main parameters that affect performance.Knowing the key parameters which characterise performance, it becomes possible to hypothesise to restart the process and to produce a subsequent improved time complexity model.Finally, the methodology predicts the performance order for new data sets on a particular parallel computer by replacing a numerical identification.
As a case of study, a global pruning TSP (named GP-TSP) algorithm has been examined.It has been used to analyze the influence of indeterminism in performance prediction and also to show the practicality and the advantages of the methodology.The asymptotic time complexity for the parallel GP-TSP algorithm depends on the number of processors (P ), the number of cities (C), and other parameters.As a result of the investigation, right now the sum of the distances from one city to the other cities (SD) and the mean deviation of these

Fig. 1
Fig.1Computational science as the third science.

Fig. 7
Fig. 7 (a) Matrix of distances (b) and a pruning process in the GP-TSP algorithm.

Fig. 8
shows the five patterns used for the 15 Paula FRITZSCHE and Dolores REXACHS and Emilio LUQUE cities.

[ 3 ]
Third hypothesis: sum of distances and mean deviation of sum of distances What parameters could be used to quantitatively characterize different geographical patterns in the distribution of cities?In graph theory, the distance of a vertex p, d(p), of such a connected graph G is defined by d(p) = ∑ d(p, q)

Table 4
Mean and mean deviation of execution times (in sec.) by pattern and starting cities.