Kısmi eşleme erişimi ve buna uygun dosya yapıları

Oğuzhan, Öztaş
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Fen Bilimleri Enstitüsü
Institute of Science and Technology
Kısmi-Esleme erişini çeşitli anaçlar için ihtiyaç duyulan bir erisin yöntemidir, özellikle ofis otomasyonunda çok kullanılır. Bu erisin üzerine çeşitli yöntemler vardır. Bunlardan bazılarını su şekilde sıralayabiliriz ; 1.- indekslenmis Tanımlayıcı Dosyalar 2.- Tersine Çevrilmiş Dosyalar 3.- tnza Dosyaları 4.- Genisletilebilir Hash (Ext. Hashing) 5.- Tanımlayıcı ve Hash Biz burada bu yapılardan her birini genel hatlarıyla anlatıp, örneklerle daha iyi anlaşılmalarını sağladıktan sonra bu teknikleri genel ve basit bir yapı üzerinde birbiri ile mukayese ederek aralarında en iyi sonuç vereni saptamaya çalıştık. Sinülasyon deneyleri kısmındaki tablo değerleri ve grafik üzerindeki gösterinden de anlaşılacağı gibi indekslenmiş tanımlayıcı dosya yapısı bizce diğer yapılar içerisinde en iyi performansı sağlamıştır.
Partial-match retrieval is a problem involving searching on secondary keys. A number of approaches have been suggested. One way to tackle the problem is to search the entire file in response to each query, using a string or a regular expression based pattern matching algorithm. Although regular expressions provide considerable flexibility in posing queries, particularly those concerned with text, the requirement of having to search through the entire file to answer each query limits this technique to files that are not too large, unless queries can be batched or special hardware is available. A number of alternative schemes have been suggested to reduce the amount of information that needs to be searched to answer a query. One such class of schemes associates with each record a short representative code word. Searches can then be performed over the shorter code word file to determine which records are potential answers to a given query (or, equivalently, to exclude records that cannot possible be answers). Another common approach is to construct an inverted index list for each field. A query can be answered by intersecting the index lists for each specified field. Some of the methods for partial-match retrieval are shown in the following case: 1.- Extendible Hashing 2.- Indexed Descriptor Files 3.- Signature Files - vi i - 1. -Review of Extendible Hashing and Its Appication to Partial Hatch Retrieval The file is contained in a number of pages ( or buckets). There are two kinds of pages. Leaf pages contain the records themselves and directory pages contain the directory. The directory is organized as a linear array of m=2'1 entries where d is called the global depth of the directory. m is always a power of two and changes (doubled or halved) in response to the changes in the volume of the file. Each entry in the directory contains a pointer to a leaf page. Each leaf page has a header that contains its local depth d*. For a leaf page with local depth d'» there are 2lA~A' ' directory entries pointing to it. Fig. 1 shows a file organization when d=3. Fig. 1-- An extendible hashing file with global depth=3. - viii - Associated with the file is a hashing function h:KEY - >D where KEY is the domain of the primary key of the file, D={0, 1, 2, 3,..., 2<1",*>c-l >, and 2*,""c is the maximum allowable size of the directory. To locate or insert a record with primary key value V, we calculate the pseudokey V as the first d bits of h(V). V is then used as an index into the directory. The indexed directory entry contains a pointer to the leaf page where the record will be found or inserted. When inserting a record into a full leaf page with local depth d', d'< d, we split the bucket into two buckets, distribute the records between the two buckets according to their pseudokey values, and then change the pointers in the appropriate directory entries. However, when inserting a record into a full leaf page with local depth d'=d, we double the size of the directory; global depth d is then increased by one. Each directory entry becomes two complement directory entries with identical pointer values. Now the overflow leaf page can be split similarly as before. Lloyd and Ramamohanarao extended this single-key file organization to handle partial-match retrievals in the following way. Let n be the number of fields in the record structure of the file. With each field of the record structure, associate a hashing function ht:Ft - >B4, 1=1, 2,..., n where F* is the key space of the ith field and Bi is the pseudokey space of a certain length nu. The index of a record Cri, ra,..., r") into the directory is assembled using the pseudokeys hı(rı), h2(r2),..., h"(r") and the choice vector (..., i3, iz,ii) If i4=p in the choice vector, then the jth (rightmost) bit in the index should come from the pth pseudokey h_,(r»). The choice vector has important consequences on the performance of the file. It dynamically adapts to the current state of the system according to the distribution of occurrences of each field in the record structure in the file operations performed thus far. This is done by deferring the calculation of ij until the directory size grows to 2j. The procedure, called Minimal Marginal Increase (MMI), used to calculate the choice vector, and the theory behind it are discussed in C73. - ix - The operations of insertion, searching, bucket splitting, directory doubling, etc., are then similar to the single-key case. Example 1: Let the global depth = 5, record R = (vx, v2, v3, v*) and the choice vector = (..., 4, 2, 2, 3, 2). Let h(Vi) = a"... aiao, h(vz) = b»... bzbib©, h(v3) = cm... CiCo, and h(v«) = dm... dido uhere am... aia0, bm... bib©, Cm... CiCo, and dm... dido are binary numbers with a sufficiently large number m of digits. Then the address of record E = d©babıc0bo. 2.- Indexed Descriptor Files Pfaltz et al. applied the technique of disjoint coding to a file structure called the indexed descriptor file. The idea behind the technique is to speed up retrieval of records by encoding information about the records in an efficient manner. Basically, the infor mation in a single record is represented by a descriptor word and the descriptor words of all records stored in one bucket (on disk) are bitwise OR'd together to form a bucket descriptor D». The directory of an indexed descriptor file is just the collection of all bucket descriptors. A descriptor D, of a record r = (ri, r2|...» r*) where râ?D4 for l<i. There are k functions H4:D«. - >?1, 2,...,m4} for ISiik. In a descriptor D«- of a record r = (ri, ra,..., r», ), the H±(r4)th bit in F* is set to 1 (the remaining w4-l bits are set to 0). There are exactly k bits set to 1 in a record descriptor.</i
Tez (Yüksek Lisans) -- İstanbul Teknik Üniversitesi, Fen Bilimleri Enstitüsü, 1992
Thesis (M.Sc.) -- İstanbul Technical University, Institute of Science and Technology, 1992
Anahtar kelimeler
Büro otomasyonu, Erişim yöntemleri, Kısmi eşleme erişimi, Office automation, Access methods, Partial match retrieval