##
Dağıtık veri tabanlarında sorgu optimizasyonu

Dağıtık veri tabanlarında sorgu optimizasyonu

##### Dosyalar

##### Tarih

1995

##### Yazarlar

Tezel, Banu

##### Süreli Yayın başlığı

##### Süreli Yayın ISSN

##### Cilt Başlığı

##### Yayınevi

Fen Bilimleri Enstitüsü

##### Özet

Bu çalışmada, Dağıtık Veritabanı Sistemlerinin Yönetimi' nde önemli problemlerden biri olarak yer alan "Sorgu Optimizasyonu" konusu incelenmiştir. Merkezi bir veritabanı sistemi sözkonusu olduğunda bile veritabanı üzerinde sorgulamayı optimize yapmak zor bir problemdir.Farklı veritabanları, veritabanı yönetim sistemleri, işletim sistemleri, ağ protokolleri vs. bileşenlerinden oluşan dağıtık bir veritabanı yönetim sisteminde bu problemin daha zor ve kompleks bir hal alacağı aşikardır. Tez konu itibari ile iki ana bölümden oluşmaktadır.İlk bölümde,dağıtık sorgu optimizasyonu ile ilgili temel kavram ve teknikler üzerinde durulmuştur.Daha sonra bu Merkezileşmiş Sorgu Optimizasyon algoritmalarından INGRES, System R incelenmiştir.Ikinci bölümde ise Dağıtık INGRES,R*,SDD-1 ve AHY algoritmaları incelenmiş ve karşılaştırmaları yapılmıştır.Bölümler içinde konularla ilgili örnekler bulunmaktadır.

Query optimization represents both a challenge and an opportunity for relational systems; Incidentally, the advantage of optimization is not merely that users do not have to worry about how best to state their queries (i.e., how to phrase requests in order to get the best performance out of the system). On the contrary, there is a real possibility that the optimizer might actually do better than a human programmer. There are several reasons for this state of affairs. 1. A good optimizer will have a wealth of information available to it that human programmers typically do not have. To be specific, it should have certain statistical information, such as the cardinality of each domain, the cardinality of each table, the number of values in each column, the number of times each different value occurs in each column, and so on. (This information will be kept in the system catalog, of course.) As a result, the optimizer should be able to make a more accurate assessment of the efficiency of any given strategy for implementing a particular request, and thus should be more likely to choose the most efficient implementation. 2. Furthermore, if the database statistics change significantly (e.g., if the database is physically reorganized), then a different choice of strategy may be desirable; in other words, reoptimization may be required, in a relational system, reoptimization is trivial - it simply involves a reprocessing of the original relational request by the system optimizer. In a nonrelational system, by contrast, reoptimization involves rewriting the program, and will probably therefore not be done at all. 3. Third, the optimizer is a program, and therefore by definition is much more patient than a typical application programmer. The optimizer is quite capable of considering literally hundreds of different implementation strategies for a given request, whereas it is extremely unlikely that a human programmer would ever consider more than there or four. 4. Fourth, the optimizer can be regarded in a sense as embodying the skills and services of "the best" human programmers. As a consequence, it has the effect of making those skills and services available to everybody which means, of course, that it VI is making an otherwise scarce set of resources available to a wide range of users, in an efficient and cost effective manner. All of the above should serve as evidence in support of the claim, made at the beginning of this 2nd section that optimizability -i.e., the fact that relational requests are optimizable- is in fact a strength of relational systems. The overall purpose of the optimizer, then, is to choose an efficient strategy for evaluating a given relational expression. In the second section two of the most popular query optimization techniques for centralized systems are presented. This presentation is a prerequisite to understanding distributed query optimization for there reasons. First, a distributed query is translated into local queries, each of which is processed in a centralized way. Second, distributed query optimization techniques are often extensions of the techniques for centralized systems. Finally, centralized query optimization is a simpler problem; the minimization of communication costs makes distributed query optimization more complex. In this section some common techniques for query decomposition, two popular relational database systems are compared: INGRES and System R. Furthermore, both systems have distributed versions whose optimization algorithms are extensions of the centralized version. The optimization techniques of these systems differ significantly. INGRES employs a dynamic optimization algorithm and System Ruses a static optimization algorithm based on exhaustive search using statistics about the database. We note that most commercial relational DBMSs ikhmekerii vadiariiz of üne exhaustive search approach for its efficiency and compatibility with query compilation. In the third section the basic concepts and techniques for distributed query optimization are presented. Their application in the basic query optimazition algorithms of Distributed INGRES, R*, SDD-1, and AHY are reviewed. With the availability of faster communication networks (e.g., local area networks), more recent distributed DBMSs consider local processing costs as well. Performance evaluation of R* has shown significant contributions of local costs, even for general networks, which have become faster. The main factor affecting the performance of an execution strategy is the size of the intermediate relations that are produced during the execution. When a subsequent operation is located at a different site, the intermediate relation must be VII transmitted over the network. Therefore, it is of prime interest to estimate the size of the intermediate results of relational algebra operations in order to minimize the size of data transfers. This estimation is based on statistical information about the base relatioons and formulas to predict the cardinalities of the results of the relational operations. Important inputs to the query optimization problem are the database statistics and the formulas used to estimate the size of intermediate results. There is a direct trade -off between performance and the accuracy (and maintenance cost) of statistics. The critical operation is generally the join of distributed relations. For most frequent joins that are not on foreign keys, join selectivity factors should be of great benefit. The use of statistics can be avoided by applying a simple algorithm based on heuristics to trasform a query. However, it has been recognized in that exhaustive search of the solution space, based on statistics, significantly outperforms the heuristic approaches. When done statically ( at compile time ), the overhead incurred by exhaustive search is rapidly amortized if the query becomes complexs or is executed frequently. As a prerequisite to understanding distributed query optimization, we have introduced centralized query optimization. The two most popular algorithms used by INGRES and System R were presented. These algorithms form the basis of a distributed optimization algorithm for distributed INGRES and R*,respectivly. Two approaches to solve distributed join queries, which are usually considered the most important type of queries have been seen. The first illustrated by Distributed INGRES and R*, considers join ordering. Semijoins are beneficial only when a join has good selectivity, in which case the semijoins act as powerful size reducers. The first system that make extensive use of semijoins assumed a slow wide area networks, the local processing cost. However, with fast local area networks, the local processing cost and sometimes even more important. Therefore, semijoins should be employed carefully and not systematiclly since they tend to increase the local processing cost.Join and semijoin techniques should be both employed as shown in, because each technique is better than the other, depending on database-dependent parameters. Semijoins implemented by hashed bit arrays are shown in to be very beneficial. The query optimization algorithm of Distributed INGRES (Epstein ct. al. 1978) is derived from the algorithm used in centralized INGRES (see Section 9.2. 1.). Therefore, it consists of dynamically optimizing the processing strategy of a given viu query. The objective function of the algorithm is to minimize a combination of both the communication cost and the response time. Because these two objectives are conflicting (increasing communication cost may well decrease response time), the function can give a greater weight to one or the other. Note that this query optimization algorithm ignores the cost of transmitting the data to the result site. The algorithm also takes advantage of fragmentation, but only horizontal fragmentation is handled for simplicity. The distributed query optimization algorithm of R* (Selinger and Adiba, 1980) and (Lohman et al., 1985) is a substantial extrential extension ofo the techniques developed for System R's optimizer (see Section 9.2.2). Therefore, it uses a compilation approach where an exhaustive search of all alternative strategies is performed in order to choose the one with the least cost. Although predicting and enumerating these strategies is costly, the overhead of exhaustive search is rapidly amortized if the query is executed frequently. Although the algorithm described in (Selinger and Adiba, 1980) deals with fragmentation, the implemented version of R* supports neither fragmentation nor replication. Therefore, the R* query processing algorithm deals only with relations as basic units. Query compilation is a distributed tast in R*, coordinated by a master site, where the query is initiated. The optimizer of the master site makes all intersite decisions, such as the selection of the execution sites and the fragments as well as the method for transferring data. Another promising approach to solving join queries is through the use of auxiliary and small data structures, called join indices which have the potential of minimizing both local processing cost and communication cost. The performance analyses reported in already exhibit the value of join indices in a centralized context for optimizing both complex relational queries and recursive queries. Join indices can be useful in a distributed context as well. In this section,mostly on join queries for two reasons: join queries are the most frequent queries in the relational framework and they have been studied extensively are focused.Furthermore, the number of joins involved in queries expressed in languages of higher expressive power than relational calculus can be extremely large, making the join ordering more crucial. However, the optimization of general queries containing joins, unions, and aggregate functions is a harder problem. Distributing unions over joins is a simple and good approach since the query can be reduced as a union of join subqueries, which are optimized individually. Note also that the unions are more frequent in distrubuted DBMSs then in centralized DBMSs because they permit the localisation of horizontally fragmented relations. IX The consepts and solutions presented in this section have been illustrated by their use in four query processing algorithms, chosen for their generality and practical interest (all of which were implemented). However, other methods can be derived from these concepts. A good aproach is to specialize the optimization algorithm to take advantage of the network topology, such as star networks, satellite networks and broadcast networks. Query processing in a distributed system requeries the transmission of data between computers in a network. The arrangement of data transmissions and local data processing is known as a distribution strategy for a query. Two cost measures, response time and total time are used to judge the quality of a distribution strategy. Simple algorithms re presented that derive distribution stategies which have minimal response time minimal total time, for a spcial class of queries. These optimal algorithms are used as a basis to develop a general query processing algorithm. Distributed query examples are presented and the complexity of the general algorithm is analyzed. The integration of a query processing subsystem into a distributed database management system is discussed.

Query optimization represents both a challenge and an opportunity for relational systems; Incidentally, the advantage of optimization is not merely that users do not have to worry about how best to state their queries (i.e., how to phrase requests in order to get the best performance out of the system). On the contrary, there is a real possibility that the optimizer might actually do better than a human programmer. There are several reasons for this state of affairs. 1. A good optimizer will have a wealth of information available to it that human programmers typically do not have. To be specific, it should have certain statistical information, such as the cardinality of each domain, the cardinality of each table, the number of values in each column, the number of times each different value occurs in each column, and so on. (This information will be kept in the system catalog, of course.) As a result, the optimizer should be able to make a more accurate assessment of the efficiency of any given strategy for implementing a particular request, and thus should be more likely to choose the most efficient implementation. 2. Furthermore, if the database statistics change significantly (e.g., if the database is physically reorganized), then a different choice of strategy may be desirable; in other words, reoptimization may be required, in a relational system, reoptimization is trivial - it simply involves a reprocessing of the original relational request by the system optimizer. In a nonrelational system, by contrast, reoptimization involves rewriting the program, and will probably therefore not be done at all. 3. Third, the optimizer is a program, and therefore by definition is much more patient than a typical application programmer. The optimizer is quite capable of considering literally hundreds of different implementation strategies for a given request, whereas it is extremely unlikely that a human programmer would ever consider more than there or four. 4. Fourth, the optimizer can be regarded in a sense as embodying the skills and services of "the best" human programmers. As a consequence, it has the effect of making those skills and services available to everybody which means, of course, that it VI is making an otherwise scarce set of resources available to a wide range of users, in an efficient and cost effective manner. All of the above should serve as evidence in support of the claim, made at the beginning of this 2nd section that optimizability -i.e., the fact that relational requests are optimizable- is in fact a strength of relational systems. The overall purpose of the optimizer, then, is to choose an efficient strategy for evaluating a given relational expression. In the second section two of the most popular query optimization techniques for centralized systems are presented. This presentation is a prerequisite to understanding distributed query optimization for there reasons. First, a distributed query is translated into local queries, each of which is processed in a centralized way. Second, distributed query optimization techniques are often extensions of the techniques for centralized systems. Finally, centralized query optimization is a simpler problem; the minimization of communication costs makes distributed query optimization more complex. In this section some common techniques for query decomposition, two popular relational database systems are compared: INGRES and System R. Furthermore, both systems have distributed versions whose optimization algorithms are extensions of the centralized version. The optimization techniques of these systems differ significantly. INGRES employs a dynamic optimization algorithm and System Ruses a static optimization algorithm based on exhaustive search using statistics about the database. We note that most commercial relational DBMSs ikhmekerii vadiariiz of üne exhaustive search approach for its efficiency and compatibility with query compilation. In the third section the basic concepts and techniques for distributed query optimization are presented. Their application in the basic query optimazition algorithms of Distributed INGRES, R*, SDD-1, and AHY are reviewed. With the availability of faster communication networks (e.g., local area networks), more recent distributed DBMSs consider local processing costs as well. Performance evaluation of R* has shown significant contributions of local costs, even for general networks, which have become faster. The main factor affecting the performance of an execution strategy is the size of the intermediate relations that are produced during the execution. When a subsequent operation is located at a different site, the intermediate relation must be VII transmitted over the network. Therefore, it is of prime interest to estimate the size of the intermediate results of relational algebra operations in order to minimize the size of data transfers. This estimation is based on statistical information about the base relatioons and formulas to predict the cardinalities of the results of the relational operations. Important inputs to the query optimization problem are the database statistics and the formulas used to estimate the size of intermediate results. There is a direct trade -off between performance and the accuracy (and maintenance cost) of statistics. The critical operation is generally the join of distributed relations. For most frequent joins that are not on foreign keys, join selectivity factors should be of great benefit. The use of statistics can be avoided by applying a simple algorithm based on heuristics to trasform a query. However, it has been recognized in that exhaustive search of the solution space, based on statistics, significantly outperforms the heuristic approaches. When done statically ( at compile time ), the overhead incurred by exhaustive search is rapidly amortized if the query becomes complexs or is executed frequently. As a prerequisite to understanding distributed query optimization, we have introduced centralized query optimization. The two most popular algorithms used by INGRES and System R were presented. These algorithms form the basis of a distributed optimization algorithm for distributed INGRES and R*,respectivly. Two approaches to solve distributed join queries, which are usually considered the most important type of queries have been seen. The first illustrated by Distributed INGRES and R*, considers join ordering. Semijoins are beneficial only when a join has good selectivity, in which case the semijoins act as powerful size reducers. The first system that make extensive use of semijoins assumed a slow wide area networks, the local processing cost. However, with fast local area networks, the local processing cost and sometimes even more important. Therefore, semijoins should be employed carefully and not systematiclly since they tend to increase the local processing cost.Join and semijoin techniques should be both employed as shown in, because each technique is better than the other, depending on database-dependent parameters. Semijoins implemented by hashed bit arrays are shown in to be very beneficial. The query optimization algorithm of Distributed INGRES (Epstein ct. al. 1978) is derived from the algorithm used in centralized INGRES (see Section 9.2. 1.). Therefore, it consists of dynamically optimizing the processing strategy of a given viu query. The objective function of the algorithm is to minimize a combination of both the communication cost and the response time. Because these two objectives are conflicting (increasing communication cost may well decrease response time), the function can give a greater weight to one or the other. Note that this query optimization algorithm ignores the cost of transmitting the data to the result site. The algorithm also takes advantage of fragmentation, but only horizontal fragmentation is handled for simplicity. The distributed query optimization algorithm of R* (Selinger and Adiba, 1980) and (Lohman et al., 1985) is a substantial extrential extension ofo the techniques developed for System R's optimizer (see Section 9.2.2). Therefore, it uses a compilation approach where an exhaustive search of all alternative strategies is performed in order to choose the one with the least cost. Although predicting and enumerating these strategies is costly, the overhead of exhaustive search is rapidly amortized if the query is executed frequently. Although the algorithm described in (Selinger and Adiba, 1980) deals with fragmentation, the implemented version of R* supports neither fragmentation nor replication. Therefore, the R* query processing algorithm deals only with relations as basic units. Query compilation is a distributed tast in R*, coordinated by a master site, where the query is initiated. The optimizer of the master site makes all intersite decisions, such as the selection of the execution sites and the fragments as well as the method for transferring data. Another promising approach to solving join queries is through the use of auxiliary and small data structures, called join indices which have the potential of minimizing both local processing cost and communication cost. The performance analyses reported in already exhibit the value of join indices in a centralized context for optimizing both complex relational queries and recursive queries. Join indices can be useful in a distributed context as well. In this section,mostly on join queries for two reasons: join queries are the most frequent queries in the relational framework and they have been studied extensively are focused.Furthermore, the number of joins involved in queries expressed in languages of higher expressive power than relational calculus can be extremely large, making the join ordering more crucial. However, the optimization of general queries containing joins, unions, and aggregate functions is a harder problem. Distributing unions over joins is a simple and good approach since the query can be reduced as a union of join subqueries, which are optimized individually. Note also that the unions are more frequent in distrubuted DBMSs then in centralized DBMSs because they permit the localisation of horizontally fragmented relations. IX The consepts and solutions presented in this section have been illustrated by their use in four query processing algorithms, chosen for their generality and practical interest (all of which were implemented). However, other methods can be derived from these concepts. A good aproach is to specialize the optimization algorithm to take advantage of the network topology, such as star networks, satellite networks and broadcast networks. Query processing in a distributed system requeries the transmission of data between computers in a network. The arrangement of data transmissions and local data processing is known as a distribution strategy for a query. Two cost measures, response time and total time are used to judge the quality of a distribution strategy. Simple algorithms re presented that derive distribution stategies which have minimal response time minimal total time, for a spcial class of queries. These optimal algorithms are used as a basis to develop a general query processing algorithm. Distributed query examples are presented and the complexity of the general algorithm is analyzed. The integration of a query processing subsystem into a distributed database management system is discussed.

##### Açıklama

Tez (Yüksek Lisans) -- İstanbul Teknik Üniversitesi, Fen Bilimleri Enstitüsü, 1995

##### Anahtar kelimeler

Dağıtık veri tabanı sistemleri,
Optimizasyon,
Sorgulama dili,
Distributed database systems,
Optimization,
Query lanquage