Ağaç yapısının lempel-ziv veri sıkıştırma algoritmasına uyarlanması

thumbnail.default.alt
Tarih
1993
Yazarlar
Ulus, Tolga
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
Fen Bilimleri Enstitüsü
Özet
Bu çalışmada, veri sıkıştırma algoritmalarından. Lempel-Ziv algoritmasında kullanılan veri yapısı ikili ağaç seçilmiştir. Kullanılan bu veri yapısı sayesinde algoritma oldukça hız kazanmıştır. Veri sıkıştırma, hem veri saklama, hem de iletişim alanında büyük kolaylıklar sağlamaktadır. Bu yolla, aynı ortamda daha fazla bilgi saklanabilmekte veya aynı bant genişliğinde daha fazla bilgi daha kısa sürede gönder ilebilmektedir. Metinlerde, karakter dağılımı, karakter tekrarı, çok kullanılan kalıplar gibi çeşitli gereksizlik tipleri vardır. Bir metni sıkıştırmak, bu metindeki gereksizliklerin belli bir oranda yokedilmesiyle olur. Varolan sıkıştırma yöntemleri, bu gereksizliklerin birini veya bir kısmını hedef alırlar. İdeal sıkıştırma ise, metindeki bütün gereksizliklerin tamamen yokedilmesiyle olur. Yarım asır boyunca çeşitli sıkıştırma yöntemleri önerilmiştir. Bunlardan bir grubu, çeşitli karakteristiklerdeki metinler için özel olarak hazırlananlardır. Diğer bir kısmı ise statik olarak adlandırılırlar. Statik yöntemler metin üzerinden iki kere geçerler. İlkinde şifreleri oluşturur, ikincisinde ise bu şifreleri kullanarak sıkıştırılmış metni yaratırlar. Dinamik yöntemler ise, metin üzerinden sadece bir kez geçerek, hem şifrelemeyi oluşturur veya değiştirir, hem de sıkıştırılmış metni yaratırlar. Bunlar metnin değişen karakteristiğine adapte olduğundan statik yöntemlerden daha yüksek oranda sıkıştırma sağlarlar. Dinamik yöntemlerden Lempel-Ziv, bir karakter katarı tablosunun tutulması ve idare edilmesini gerektirir. Bu tablo, sırasal dizi olarak tutulduğunda sıkıştırma algoritması oldukça uzun çalışmaktadır. Tablo ikili ağaç yapısında tutulduğu zaman, sıkıştırma diğerine oranla çok daha hızlı olmaktadır. Genişletmede ise, durum tam tersidir. Tablo, sırasal dizi olarak tutulduğunda daha hızlı genişletme sağlanmaktadır.
In this study, a variety of data compression, techniques are presented. The study especially concentrates on Lempel-Ziv algotihm, one 'of' the latest developments in the area. The binary tree structure is proposed that speeds up the algorithm over the classical array structure. Data compression is the process of transforming a series of bytes into a new string that contains the same information but whose length is as small as possible. Data compression has important applications in the areas of information storage and data communication. Most computer applications need a large amount of data storage. Also, computer communication networks transfer large volumes of data over communication lines. Compressing data to be stored or transferred reduces the cost of storage and transmission. When files are stored in compressed form, the capacity of storage medium is approximately doubled. Similarly, when the amount of data to be transmitted is reduced, the capacity of communication channel is increased. Thus, data compression provides an economical and indispensible option to store and transmit the information. Data compression may be achieved in either software or hardware. When it is hardware, a compressor is a hardware device that reads a series of bytes of data and converts it into a shorter string; and decompressor is a device that converts the compressed data back to its original form. When data compression is done in software, a compression algorithm does the former; and an expansion or decompression algorithm does the latter. The logic that enables data compression is that the same information retained in the text after some bytes are replaced with shorter representations. In other words, data compression is achieved by removing the redundancy in the text. There are several types of redundancy found in the texts. These categories are not independent, but overlap to some extend. The major redundancy types are character distribution, character repetition, high-usage patterns and positional redundancy. In typical texts, some bytes occur more frequently than others. Moreover, some bytes may never be used. Thus, a few bits on the average might be saved in the representation of bytes, using less number of bits for most frequently used bytes and more number of bits for least frequently used bytes. In some files such as texts and business files, long repetitions of a single character occurs. These files can be encoded more compactly than by just repeating the character symbol. Examples are the space characters in text files, strings of zeros in business files and especially the graphic files. Certain sequences of characters reappear with relatively high frequency and can therefore be represented with relatively fewer bits for a saving of space. A period followed by two spaces appears very frequently in texts and could be encoded using less number of bits. If certain characters appear consistently at a predictable place in each block of data, then they are at least partially redundant. For instance, in data base files, certain record fields may almost always have the same entry, such as "none" in it. Text, on the other hand, has no positional redundancy in it. In general, data compression techniques can be divided into two categories, the nonreversible and the reversible. In nonreversible techniques, the size of the physical representation of data is reduced while a subset of the information is preserved. For example, in some data sets "leading zeros" or "trailing blanks" may be considered as irrelevant information. Note that these techniques are, by definition, dependent on the semantics of the data. In reversible techniques, all of the information is considered to be relevant, and compression/decompression process recovers the orijinal data. The reversible procedures can further divided into two groups, semantic independent and semantic dependent techniques. Semantic independent techniques can be used on any data with varying degrees of effectiveness. They do not use any information regarding the information content of the data. On the other hand, semantic dependent techniques depend on the context and semantics of the data to provide for redundancy reduction. A code is a mapping of source messages into codewords. Codes can be categorized as block-block, block- variable, variable-block or variable-block, where block-block indicates that the source messages and codeword are of fixed length and variable-variable codes map variable- length source messages into variable- length codewords. The oldest and most widely used codes, ASCII and EBCDIC, are examples of block-block codes, mapping an alphabet of 64 (or 256) single characters onto 6-bit (or 8-bit) codewords. These are not discussed, since they do not provide compression. Rather the other codes fall into our interest area. Data compression schemes are also classified as either static or dynamic. A static method is one in which the mapping from the set of messages to the set of codewords is fixed before transmission begins, so that a given message is represented by the same codeword every time it appears in the message ensemble. In these methods, the mapping is based on the the probabilities with which the source messages appear in the source ensemble. Messages that appear frequently are represented by short codewords; messages with smaller vii probabilities map to longer codewords. These probabilities are determined before transmission begins. A code is dynamic if the mapping from the set of messages to the set of codewords changes over time. These methods involve computing an approximation to the probabilities of occurrence, as the ensemble is being transmitted. The assignment of codewords to messages is based on the values of the relative frequencies of occurrence at each point in time. A message may be represented by a short codeword early in transmission because it occurs frequently at the beginning of the ensemble, even though its probability of occurrence over the total ensemble is low. Later > when the more probable messages begin to occur with higher frequency, the short codeword will be mapped to one of the higher probability messages, and it will be mapped to a longer codeword. Dynamic codes are also referred to in the literature as adaptive, in that they adapt to changes in ensemble characteristics over time. All of the adaptive methods are one-pass methods; only one scan of the ensemble is required, static methods require two passes: one pass to compute probabilities and determine the mapping, and a second pass for transmission. Thus, as long as the encoding and decoding times of an adaptive method are not substantially greater than those of a static method, the fact that an initial scan is not needed implies a speed improvement in the adaptive case. In addition, the mapping determined in the first pass of a static coding method must be transmitted along with compressed ensemble by the encoder to the decoder. There are two dimensions along which the data compression schemes may be measured: algorithm complexity and amount of compression. When data compression is used in a data transmission application, the goal is speed. Speed of transmission depends on the number of bits sent, the time required for the encoder to generate the coded message, and the time required for the decoder to recover the original ensemble. In a data storage application, although the degree of compression is the primary concern, it is nonetheless necessary that the algorithm be efficient in order for the schemes to be practical. For a static scheme, there are three algorithms to analyse: the map construction algorithm, the encoding algorithm, and the decoding algorithm. For a dynamic method, there are just two algorithms: the encoding algorithm and the decoding algorithm. Several common measures of compression have been suggested: average message length and compression ratio. Average message length is clearly the average length of a coded message. Compression ratio is defined in several ways. The ratio of the length of the coded message to the length of the original ensemble gives the most common meaning. Semantic dependent data compression techniques are designed to respond to specific types of applications, one of which is image representation and processing. One of these methods is run-length encoding where successive occurrences of a symbol is replaced by a single symbol along with its frequency. The other is difference mapping, in which the image is represented as an array of viii differences in brightness (or color) between adjacent pixels rather than the brightness values themselves. Semantic dependent data compression is also of interest in business data processing. The runs of zeros and blanks can compressed using run- length encoding. The other methods may be dictionary substitution, selection of common phrases, selection of common suffixes and prefixes and selection of common character pairs. Semantic independent data compression techniques are based on the probabilities of messages in the ensemble. Some are static codings that are Shannon-Fano coding, static Huffman coding, universal coding and arithmetic coding. The others are dynamic ones that are dynamic Huffman coding, Lempel-Ziv coding and BSTW coding. Shannon-Fano coding has the advantage of simplicity. The code is constructed as follows: the source messages and their probabilities are listed in order of nonincreasing probability. This list is then divided in such a way as to form two groups of as nearly equal total probabilities as possible. Each message in the first group receives 0 as the first digit of its codeword; the messages in the second half have codewords beginning with 1. Each of these groups is then divided according to the same criterion, and additional code digits are appended. The process is continued until each subset contains only one message. Huffman coding takes the probabilities of source letters and constructs a full binary tree. Initially, there is a set of singleton trees, each having the probability of a source letter. At each step, the method merges the two smallest probabilities into a tree whose probability is the total of two probabilities and whose root has two children that are two probabilities. The process is continued until one tree whose probability is 1 is left. The codewords are assign to each message following the path from root to leaves, such as right child meaning a 0 and left child meaning a 1. The advantage of universal codes is that it does not require the exact probabilities with which the sourae messages appear. By mapping messages in order of decreasing probability to codewords in order of increasing length, the universality can be achieved. Another advantage to universal codes is that the codeword sets are fixed. It is not necessary to compute a codeword set based on the statistics of an ensemble. Since the ranking of source messages is the essential parameter in universal coding, a universal code may be thought as representing an enumeration of the source messages or as representing the integers. The most commonly used universal codes are Elias codes and Fibonacci codes. Arithmetic coding represents a source ensemble by an interval between 0 and 1 on the real number line. Each symbol of the ensemble narrows this interval. As the interval becomes smaller, the number of bits needed to specify it grows. A high probability message narrows the interval less than a low probability message, so that high probability messages contribute fewer bits to the coded message. Adaptive Huffman coding determines the mapping from source messages to codewords on the basis of a running estimate of the source message probabilities. The code is adaptive and requires only one pass over the data. The encoder learns the characteristics of the source. The decoder must learn along with the encoder by continually updating the Huffman tree in order to stay in syncronization with the encoder. BSTW coding uses a list as an auxiliary data structure and shorter encodings for words near the front of this list. The message list is updated at each transmission using the move-to-front method. When a message is to be transmitted, the sender transmits its current position on the list, and then updates the list by moving the message to position 1 and shifting each of the other messages down one position. The receiver alter its list similarly. For the transmission of positions, the universal codes may be used. Lempel-Ziv coding defines the set of source messages as the algorithm executes. The Lempel-Ziv algorithm parses the source ensemble into a collection of segments of gradually increasing length. At each step, the longest prefix of the remaining ensemble that matches an existing table entry is parsed off, along with the character following this prefix in the ensemble. The new source message which is the concatenation of existing table entry and character is added to the code table. In the implementation of table, if tree structure is used, the algorithm speeds up. Also, the space needed by the algorithm is decreased.
Açıklama
Tez (Yüksek Lisans) -- İstanbul Teknik Üniversitesi, Fen Bilimleri Enstitüsü, 1993
Anahtar kelimeler
Mühendislik Bilimleri, Algoritmalar, Ağaç yapı, Lempel-ziv şifreleme, Veri sıkıştırma, Şifreleme, Engineering Sciences, Algorithms, Tree structure, Lempel-zıv encryption, Data compression, Encryption
Alıntı