Düşük Bit Hızlarında Konuşma Kodlama Ve Uygulamaları

thumbnail.default.alt
Tarih
1998
Yazarlar
Aşkın, Tarık
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
Fen Bilimleri Enstitüsü
Institute of Science and Technology
Özet
Günümüzde insanoğlu iletim ortamından gittikçe daha fazla kapasite beklemektedir. Teknolojik yenilikler ile ne kadar büyük bandgenişlikleri sağlansa da, iletilecek işaretlerin bandgenişlikleri de üstel olarak artmaktadır. Bu nedenle sayısal iletişimin kanal kapasitesinin artması gereksinimi, haberleşme kanalının etkin olarak kullanılması problemini başlıca araştırma konularından biri haline getirmiştir. Böylelikle sayısallaştırılmış konuşma işaretlerinin iletimi ve kaydı için etkin kodlayıcılar oluşturulması gereksinimi doğmuştur (örneğin, mobil haberleşme, uydu haberleşmesi, ses kaydetme ve gönderme sistemleri). Günümüzde, konuşma kalitesini düşürmeden hızı 4,8 kbit/san ve altına çekmek için çeşitli teknikler uygulanmaktadır. Konuşma işaretinde bulunan mevcut fazlalıkları atarak düşük bit hızlarında kodlanma mümkün olmaktadır. Ayrıca, insanın işitme sistemi değişik frekanslardaki bozulmalara karşı eşit derecede hassas değildir ve sınırlı bir dinamik aralığı vardır. Uygulanan konuşma kodlama teknikleri, bit hızını düşürmek için konuşmanın hem üretimi, hem de algılanmasıyla ilgili özelliklerden yararlanır. Bütün bu karmaşık algoritmaların uygulanması, bu yöntemlerin gerçek zamanlı uygulamalarını mümkün kılan VLSI/DSP teknolojisindeki hızlı ilerlemeler sayesinde olmuştur. CELP yöntemi 1985 yılında Schroeder ve Atal tarafından ilk ortaya atıldığında gerçekleme için 333 MlPS'lik bir işlem kapasitesi, yani o zamanki Cray-1 bilgisayarında 1 sn, konuşma işaretini işlemek için 125 sn. CPU süresi gerektirmekteydi. Buradan hareketle bu tezde düşük bit hızlarında konuşma kodlama teknikleri ve sayısal işaret işlemciler (DSP) ile gerçekleme yöntemleri araştırılmış; bu tekniklerden hareketle, 4.8 kbit/san civarındaki bit hızlan için DSP tabanlı gerçeklemeler yapılmıştır. Çalışmada daha düşük hızlara inmek değil eşdeğer hızlarda kaynaklan (DSP imkanlannı) daha verimli kullanıp, daha kaliteli kodlayıcılar tasarlama amacı güdülmüştür. Bu amaçla CELP kodlayıcılannın DSP tabanında gerçeklenmesi için gerekli gerçek zaman şartlan, sabit noktalı ve kayan noktalı yaklaşımlar irdelenmiştir. Tipik bir CELP konuşma kodlama algoritmasında işlem süresinin çoğu kapalı çevrim LTP ve kod kitabı arama için harcanmaktadır. Bu işlemlerde oluşan karmaşıklık konvolüsyon, çaprazilişki ve özilişki hesaplamaları sonucunda oluşmaktadır. Bu nedenle bu işlemlerin daha etkin yapılabilmesi için sadeleştirme yöntemleri önerilmiştir. Aynca kodlayıcının sentezleyerek analiz çevrimi incelenmiş ve uyarma parametreleri belirlendikten sonra kısa dönemli sentez filtresi parametrelerinin güncellenmesi için yöntemler araştırılmıştır. Gösterilmiştir ki sentez filtresi parametrelerini bandgenişliği genişletme (bandwith expansion) ve aradeğerleme {interpolation) yaklaşımlarında olduğu gibi güncellemek performansta önemli gelişmeler sağlamaktadır. Bundan başka bütün önerilen geliştirmelerin kullanıldığı bir CELP kodlayıcısı, yüksek seviyeli yazılım dili C kullanılarak gerçeklenmiş ve sabit noktalı bir sayısal işaret işlemciye uygulanmıştır. Elde edilen sonuçların sayısal bir dahili santralın söz bellek (voice mail) modülünde kullanılması hedeflenmektedir.
In an age where the word gigabit became common when talking about channel or disk capacity, the aim of compression or low bit rate speech coding is not clear to everyone and one needs to justify it by describing the myriad of new applications demanding less and less bits per second and the rapidly expanding corpora. Speech is produced when air is forced from the lungs through the vocal cords and along the vocal tract. The vocal tract extends from the opening in the vocal cords (called the glottis) to the mouth, and in an average man is about 17 cm long. It introduces short-term correlations (of the order of 1 ms) into the speech signal, and can be thought of as a filter with broad resonances called formants. The frequencies of these formants are controlled by varying the shape of the tract, for example by moving the position of the tounge. An important part of many speech codecs is the modelling of the vocal tract as a short term filter. As the shape of the vocal tract varies relatively slowly, the transfer function of its modelling filter needs to be updated only relatively infrequently (typically every 20 ms or so). The vocal tract filter is excited by air forced into it through the vocal cords. Speech sounds can be broken into three classes depending on their mode of excitation:. Voiced sounds are produced when the vocal cords vibrate open and closed, thus interrupting the flow of air from the lungs to the vocal tract and producing quasi- periodic pulses of air as the excitation. The rate of the opening and closing gives the pitch of the sound. This can be adjusted by varying the shape of, and the tension in, the vocal cords, and the pressure of the air behind them. Voiced sounds show a high degree of periodicity at the pitch period, which is typically between 2 and 20 ms.. Unvoiced sounds result when the excitation is a noise-like turbulence produced by forcing air at high velocities through a constriction in the vocal tract while the glottis is held open. Such sounds show little long-term periodicity, although short-term correlations due to the vocal tract are still present.. Plosive sounds result when a complete closure is made in the vocal tract, and air pressure is built up behind this closure and released suddenly. Some sounds cannot be considered to fall into any one of the three classes above, but are a mixture. For example voiced fricatives result when both vocal cord vibration and a constriction in the vocal tract are present. Although there are many possible speech sounds which can be produced, the shape of the vocal tract and its mode of excitation change relatively slowly, and so speech can be considered to be quasi-stationary over short periods of time (of the order of 20 xiv ms). Speech signals show a high degree of predictability, due sometimes to the quasi- periodic vibrations of the vocal cords and also due to the resonances of the vocal tract. Speech coders attempt to exploit this predictability in order to reduce the data rate necessary for good quality voice transmission. Commonly Used Speech Codecs Coding algorithms seek to minimize the bit rate in the digital representation of a signal without an objectionable loss of signal quality in the process. High quality is attained at low bit rates by exploiting signal redundancy as well as the knowledge that certain types of coding distortion are imperceptible because they are masked by the signal. The models of signal redundancy and distortion masking are becoming increasingly more sophisticated, leading to continuing improvements in the quality of low bit rate signals. Here its briefly discussed the main speech coding techniques which are used today, and those which may be used in the future. In order to simplify the description of speech codecs they are often broadly divided into three classes - waveform codecs, source codecs and hybrid codecs. Typically waveform codecs are used at high bit rates, and give very good quality speech. Source codecs operate at very low bit rates, but tend to produce speech which sounds synthetic. Hybrid codecs use techniques from both source and waveform coding, and give good quality speech a intermediate bit rates. Waveform Codecs: Waveform codecs attempt, without using any knowledge of how the signal to be coded was generated, to produce a reconstructed signal whose waveform is as close as possible to the original. This means that in theory they should be signal independent and work well with non-speech signals. Generally they are low complexity codecs which produce high quality speech at rates above about 16 kbits/s. When the data rate is lowered below this level the reconstructed speech quality that can be obtained degrades rapidly. The simplest form of waveform coding is Pulse Code Modulation (PCM), which merely involves sampling and quantizing the input waveform. Narrow-band speech is typically band-limited to 4 kHz and sampled at 8 kHz. If linear quantization is used then to give good quality speech around twelve bits per sample are needed, giving a bit rate of 96 kbits/s. This bit rate can be reduced by using non-uniform quantization of the samples. In speech coding an approximation to a logarithmic quantizer is often used. Such quantizers give a signal to noise ratio which is almost constant over a wide range of input levels, and at a rate of eight bits/sample (or 64 kbits/s) give a reconstructed signal which is almost indistinguishable from the original. Such logarithmic quantizers were standardised in the 1960's, and are still widely used today. In America u-law companding is the standard, while in Europe the slightly different A-law compression is used. They have the advantages of low complexity and delay with high quality reproduced speech, but require a relatively high bit rate and have a high susceptibility to channel errors. xv A commonly used technique in speech coding is to attempt to predict the value of the next sample from the previous samples. It is possible to do this because of the correlations present in speech samples due to the effects of the vocal tract and the vibrations of the vocal cords, as discussed earlier. If the predictions are effective then the error signal between the predicted samples and the actual speech samples will have a lower variance than the original speech samples. Therefore we should be able to quantize this error signal with fewer bits than the original speech signal. This is the basis of Differential Pulse Code Modulation (DPCM) schemes - they quantize the difference between the original and predicted signals. The results from such codecs can be improved if the predictor and quantizer are made adaptive so that they change to match the characteristics of the speech being coded. This leads to Adaptive Differential PCM (ADPCM) codecs. In the mid 1980's the CCITT standardised a ADPCM codec operating at 32 kbits/s, which gave speech quality that was very similar to the 64 kbits/s PCM codecs. Later ADPCM codecs operating at 16,24 and 40 kbits/s were also standardised. The waveform codecs described above all code speech with an entirely time domain approach. Frequency domain approaches are also possible, and have certain advantages. For example in Sub-Band Coding (SBC) the input speech is split into a number of frequency bands, or sub-bands, and each is coded independently using for example an ADPCM like coder. At the receiver the sub-band signals are decoded and recombined to give the reconstructed speech signal. The advantages of doing this come from the fact that the noise in each sub-band is dependent only on the coding used in that sub-band. Therefore we can allocate more bits to perceptually important sub-bands so that the noise in these frequency regions is low, while in other sub-bands we may be content to allow a high coding noise because noise at these frequencies is less perceptually important. Adaptive bit allocation schemes may be used to further exploit these ideas. Sub-band codecs tend to produce communications to toll quality speech in the range 16-32 kbits/s. Due to the filtering necessary to split the speech into sub-bands they are more complex than simple DPCM coders, and introduce more coding delay. However the complexity and delay are still relatively low when compared to most hybrid codecs. Another frequency domain waveform coding technique is Adaptive Transform Coding (ATC), which uses a fast transformation (such as the discrete cosine transformation) to split blocks of the speech signal into a large numbers of frequency bands. The number of bits used to code each transformation coefficient is adapted depending on the spectral properties of the speech, and toll quality reproduced speech can be achieved at bit rates as low as 16 kbits/s. Source Codecs: Source coders operate using a model of how the source was generated, and attempt to extract, from the signal being coded, the parameters of the model. It is these model parameters which are transmitted to the decoder. Source coders for speech are called vocoders, and work as follows. The vocal tract is represented as a time-varying filter and is excited with either a white noise source, for unvoiced speech segments, or a train of pulses separated by the pitch period for voiced speech. Therefore the information which must be sent to the decoder is the filter specification, a xvi voiced/unvoiced flag, the necessary variance of the excitation signal, and the pitch period for voiced speech. This is updated every 10-20 ms to follow the non- stationary nature of speech. The model parameters can be determined by the encoder in a number of different ways, using either time or frequency domain techniques. Also the information can be coded for transmission in various different ways. Vocoders tend to operate at around 2.4 kbits/s or below, and produce speech which although intelligible is far from natural sounding. Increasing the bit rate much beyond 2.4 kbits/s is not worthwhile because of the inbuilt limitation in the coder's performance due to the simplified model of speech production used. The main use of vocoders has been in military applications where natural sounding speech is not as important as a very low bit rate to allow heavy protection and encryption. Hybrid Codecs: Hybrid codecs attempt to fill the gap between waveform and source codecs. As described above waveform coders are capable of providing good quality speech at bit rates down to about 16 kbits/s, but are of limited use at rates below this. Vocoders on the other hand can provide intelligible speech at 2.4 kbits/s and below, but cannot provide natural sounding speech at any bit rate. Although other forms of hybrid codecs exist, the most successful and commonly used are time domain Analysis-by- Synthesis (AbS) codecs. Such coders use the same linear prediction filter model of the vocal tract as found in LPC vocoders. However instead of applying a simple two- state, voiced/unvoiced, model to find the necessary input to this filter, the excitation signal is chosen by attempting to match the reconstructed speech waveform as closely as possible to the original speech waveform. AbS codecs were first introduced in 1982 by Atal and Remde with what was to become known as the Multi- Pulse Excited (MPE) codec. Later the Regular-Pulse Excited (RPE), and the Code- Excited Linear Predictive (CELP) codecs were introduced. These coders will be discussed briefly here. AbS codecs work by splitting the input speech to be coded into frames, typically about 20 ms long. For each frame parameters are determined for a synthesis filter, and then the excitation to this filter is determined. This is done by finding the excitation signal which when passed into the given synthesis filter minimes the error between the input speech and the reconstructed speech. Thus the name Analysis-by- Synthesis - the encoder analyses the input speech by synthesising many different approximations to it. Finally for each frame the encoder transmits information representing the synthesis filter parameters and the excitation to the decoder, and at the decoder the given excitation is passed through the synthesis filter to give the reconstructed speech. The synthesis filter is usually an all pole, short-term, linear filter. This filter is intended to model the correlations introduced into the speech by the action of the vocal tract. The synthesis filter may also include a pitch filter to model the long-term periodicities present in voiced speech. Alternatively these long-term periodicities may be exploited by including an adaptive codebook in the excitation generator so that the excitation signal includes a component, where the estimated pitch period is xvii mentioned. Generally MPE and RPE codecs will work without a pitch filter, although their performance will be improved if one is included. For CELP codecs however a pitch filter is extremely important, for reasons discussed below. The error weighting block is used to shape the spectrum of the error signal in order to reduce the subjective loudness of this error. This is possible because the error signal in frequency regions where the speech has high energy will be at least partially masked by the speech. The weighting filter emphasises the noise in the frequency regions where the speech content is low. Thus minimising the weighted error concentrates the energy of the error signal in frequency regions where the speech has high energy. Therefore the error signal will be at least partially masked by the speech, and so its subjective importance will be reduced. Such weighting is found to produce a significant improvement in the subjective quality of the reconstructed speech for AbS codecs. The distinguishing feature of AbS codecs is how the excitation waveform u(n) for the synthesis filter is chosen. Conceptually every possible waveform is passed through the filter to see what reconstructed speech signal this excitation would produce. The excitation which gives the minimum weighted error between the original and the reconstructed speech is then chosen by the encoder and used to drive the synthesis filter at the decoder. It is this 'closed-loop' determination of the excitation which allows AbS codecs to produce good quality speech at low bit rates. However the numerical complexity involved in passing every possible excitation signal through the synthesis filter is huge. Usually some means of reducing this complexity, without compromising the performance of the codec too badly, must be found. The differences between MPE, RPE and CELP codecs arise from the representation of the excitation signal used. In multi-pulse codecs the excitation signal is given by a fixed number of non-zero pulses for every frame of speech. The positions of these non-zero pulses within the frame, and their amplitudes, must be determined by the encoder and transmitted to the decoder. In theory it would be possible to find the very best values for all the pulse positions and amplitudes, but this is not practical due to the excessive complexity it would entail. In practice some sub-optimal method of finding the pulse positions and amplitudes must be used. Typically about 4 pulses per 5 ms are used, and this leads to good quality reconstructed speech at a bit-rate of around 10 kbits/s. Like the MPE codec the Regular Pulse Excited (RPE) codec uses a number of non zero pulses to give the excitation signal. However in RPE codecs the pulses are regularly spaced at some fixed interval, and the encoder needs only to determine the position of the first pulse and the amplitude of all the pulses. Therefore less information needs to be transmitted about pulse positions, and so for a given bit rate the RPE codec can use many more non-zero pulses than MPE codecs. For example at a bit rate of about 10 kbits/s around 10 pulses per 5 ms can be used in RPE codecs, compared to 4 pulses for MPE codecs. This allows RPE codecs to give slightly better quality reconstructed speech quality than MPE codecs. However they also tend to be more complex. The pan-European GSM mobile telephone system uses a simplified RPE codec, with long-term prediction, operating at 13 kbits/s to provide toll quality speech. xviii Although MPE and RPE codecs can provide good quality speech at rates of around 10 kbits/s and higher, they are not suitable for rates much below this. This is due to the large amount of information that must be transmitted about the excitation pulses' positions and amplitudes. If we attempt to reduce the bit rate by using fewer pulses, or coarsely quantizing their amplitudes, the reconstructed speech quality deteriorates rapidly. Currently the most commonly used algorithm for producing good quality speech at rates below 10 kbits/s is Code Excited Linear Prediction (CELP). This approach was proposed by Schroeder and A tal in 1985, and differs from MPE and RPE in that the excitation signal is effectively vector quantized. The excitation is given by an entry from a large vector quantizer codebook, and a gain term to control its power. Typically the codebook index is represented with about 10 bits (to give a codebook size of 1024 entries) and the gain is coded with about 5 bits. Thus the bit rate necessary to transmit the excitation information is greatly reduced - around 15 bits compared to the 47 bits used for example in the GSM RPE codec. Originally the codebook used in CELP codecs contained white Gaussian sequences. This was because it was assumed that long and short-term predictors would be able to remove nearly all the redundancy from the speech signal to produce a random noise-like residual. Also it was shown that the short-term probability density function (pdf) of this residual was nearly Gaussian. Schroeder and Atal found that using such a codebook to produce the excitation for long and short-term synthesis filters could produce high quality speech. However to choose which codebook entry to use in an analysis-by-synthesis procedure meant that every excitation sequence had to be passed through the synthesis filters to see how close the reconstructed speech it produced would be to the original. This meant the complexity of the original CELP codec was much too high for it to be implemented in real-time - it took 125 seconds of Cray-1 CPU time to process 1 second of the speech signal. Since 1985 much work on reducing the complexity of CELP codecs, mainly through altering the structure of the codebook, has been done. Also large advances have been made with the speed possible from DSP chips, so that now it is relatively easy to implement a real-time CELP codec on a single, low cost, DSP chip. Several important speech coding standards have been defined based on the CELP principle, for example the American Department of Defence (DoD) 4.8 kbits/s codec, and the CCITT low-delay 16 kbits/s codec. The CELP coding principle has been very successful in producing communications to toll quality speech at bit rates between 4.8 and 16 kbits/s. The CCITT standard 16 kbits/s codec produces speech which is almost indistinguishable from 64 kbits/s log- PCM coded speech, while the DoD 4.8 kbits/s codec gives good communications quality speech. Recently much research has been done on codecs operation below 4.8 kbits/s, with the aim being to produce a codec at 2.4 or 3.6 kbits/s with speech quality equivalent to the 4.8 kbits/s DoD CELP. The CELP codec structure can be improved and used at rates below 4.8 kbits/s by classifying speech segments into one of a number of types (for example voiced, unvoiced and transition frames). The different speech segment types are then coded differently with a specially designed encoder for each type. For example for unvoiced frames the encoder will not use any long-term prediction, whereas for xix memory and processing power available on a single chip are both expected to continue to increase significantly over the next several years. Processor efficiency as measured by mips-per-milliwatt of power consumption is also expected to improve by at least one order of magnitude. However, to accommodate coding algorithms of much higher complexity on these devices, we will need continued advances in the way we match processor architectures to complex algorithms, especially in configurations that permit graceful control of speech quality as a function of processor cost and power dissipation. The issues of power consumption and battery life are particularly critical for personal communication services and portable information terminals. Conclusion In this thesis low bit rate speech coding techniques and the implementation aspects by using digital signal processors are investigated. Some applications around 4.S kbps are developed on DSP basis. The actual aim was not to reduce the bit rate of the coded signal, mostly the target was to use the resources (the DSP) efficiently and to improve the quality. In 1985 as Schroeder and Atal were first introduced the CELP coding approach, the needed CPU power was around 333 MIPS. This means at that time for processing 1 second of speech 125 sn CPU time was needed on the Cray-1 computer. For this purpose some simplification methods are proposed and applied for the closed loop LTP and code book search. On the other hand it is shown that for the analysis by synthesis loop of the coder after the excitation parameters are defined, the short term synthesis parameters should be updated to improve the coding quality like bandwith expansion and interpolation. In Chapter 2 to form a basis for the later usage, the basic properties and the modelling aspects of the speech signal are investigated. In Chapter 3 speech coding principals and applications are discussed. On top of this, the important particularities for the speech coding algorithm selection for various application areas are examined. Chapter 4 surveys the coding systems, besides this the quantization types and coding schemes are described. Chapter 5 is the consideration of low bit rate speech coding tools. The Chapter 6 is a detailed description of the Analysis by Synthesis (AbS) technique, which is the base approach for CELP and CELP like coding algorithms. Finally in Chapter 7, real time implementation issues are discussed, covering the topics of fixed versus floating point implementation, currently available DSP's, and the processing requirement of a CELP coder and the proposed codebook search simplifications. Also ways of improving the Analysis by Synthesis (AbS) structure used in CELP speech codecs are studied. Methods for updating the short-term synthesis filter once the excitation parameters have been determined are examined. It is shown that, significant improvements can be achieved by updating the synthesis filter, similar to those obtained using the well known methods of interpolation and bandwith expansion. Furthermore all the proposed methods are applied, to implement a CELP codec by using a fixed point DSP and the high level language C. This result will be used for the voice mail application on a PBX system. xxiv Technology Targets Given that there is no rigorous mathematical formula for speech entropy, a natural target in speech coding is the achievement of high quality at bit rates that are at least a factor of two lower than the numbers that currently provide high quality: 4 kbps for telephone speech, 8 kbps for wideband speech and 24 kbps for CD-quality speech. These numbers represent a bit rate of about 0.5 bit per sample in each case. Another challenge is the realization of robust algorithms in the context of real-life imperfections such as input noise, transmission errors and packet losses. Finally, an overarching set of challenges has to do with realizing the above objectives with usefully low levels of implementation complexity. In all of these pursuits, we are limited by our knowledge in several individual disciplines, and in the way these disciplines interact. Advances are needed in our understanding of coding, communication and networking, speech production and hearing, and digital signal processing. In discussing directions of research, it is impossible to be exhaustive, and in predicting what the successful directions may be, we do not necessarily expect to be accurate. Nevertheless, it may be useful to set down some broad research directions, with a range that covers the obvious as well as the speculative. The last part of this section is addressed to this task. Future Directions Coding. Communication, and Networking: In recent years, there has been significant progress in the fundamental building blocks of source coding: flexible methods of time-frequency analysis, adaptive vector quantization, and noiseless coding. Compelling applications of these techniques to speech coding are relatively less mature. Complementary advances in channel coding and networking include coded modulation for wireless channels and embedded transmission protocols for networking. Joint designs of source coding, channel coding, and networking will be especially critical in wireless communication of speech, especially in the context of multimedia applications. Speech Production and Perception: Simple models of periodicity, and simple source models of the vocal tract need to be supplemented (or replaced) by models of articulation and excitation that provide a more direct and compact representation of the speech-generating process. Likewise, stylized models of distortion masking need to be replaced by models that maximize masking in the spectral and temporal domains. These models need to be based on better overall models of hearing, and also on experiments with real speech signals (rather than simplified stimuli such as tones and noise). Digital Signal Processing: In current technology, a single general-purpose signal processor is capable of nearly 100 million arithmetic operations per second, and one square centimeter of silicon memory can store about 25 megabits of information. The xxiii memory and processing power available on a single chip are both expected to continue to increase significantly over the next several years. Processor efficiency as measured by mips-per-milliwatt of power consumption is also expected to improve by at least one order of magnitude. However, to accommodate coding algorithms of much higher complexity on these devices, we will need continued advances in the way we match processor architectures to complex algorithms, especially in configurations that permit graceful control of speech quality as a function of processor cost and power dissipation. The issues of power consumption and battery life are particularly critical for personal communication services and portable information terminals. Conclusion In this thesis low bit rate speech coding techniques and the implementation aspects by using digital signal processors are investigated. Some applications around 4.S kbps are developed on DSP basis. The actual aim was not to reduce the bit rate of the coded signal, mostly the target was to use the resources (the DSP) efficiently and to improve the quality. In 1985 as Schroeder and Atal were first introduced the CELP coding approach, the needed CPU power was around 333 MIPS. This means at that time for processing 1 second of speech 125 sn CPU time was needed on the Cray-1 computer. For this purpose some simplification methods are proposed and applied for the closed loop LTP and code book search. On the other hand it is shown that for the analysis by synthesis loop of the coder after the excitation parameters are defined, the short term synthesis parameters should be updated to improve the coding quality like bandwith expansion and interpolation. In Chapter 2 to form a basis for the later usage, the basic properties and the modelling aspects of the speech signal are investigated. In Chapter 3 speech coding principals and applications are discussed. On top of this, the important particularities for the speech coding algorithm selection for various application areas are examined. Chapter 4 surveys the coding systems, besides this the quantization types and coding schemes are described. Chapter 5 is the consideration of low bit rate speech coding tools. The Chapter 6 is a detailed description of the Analysis by Synthesis (AbS) technique, which is the base approach for CELP and CELP like coding algorithms. Finally in Chapter 7, real time implementation issues are discussed, covering the topics of fixed versus floating point implementation, currently available DSP's, and the processing requirement of a CELP coder and the proposed codebook search simplifications. Also ways of improving the Analysis by Synthesis (AbS) structure used in CELP speech codecs are studied. Methods for updating the short-term synthesis filter once the excitation parameters have been determined are examined. It is shown that, significant improvements can be achieved by updating the synthesis filter, similar to those obtained using the well known methods of interpolation and bandwith expansion. Furthermore all the proposed methods are applied, to implement a CELP codec by using a fixed point DSP and the high level language C. This result will be used for the voice mail application on a PBX system. xxiv Technology Targets Given that there is no rigorous mathematical formula for speech entropy, a natural target in speech coding is the achievement of high quality at bit rates that are at least a factor of two lower than the numbers that currently provide high quality: 4 kbps for telephone speech, 8 kbps for wideband speech and 24 kbps for CD-quality speech. These numbers represent a bit rate of about 0.5 bit per sample in each case. Another challenge is the realization of robust algorithms in the context of real-life imperfections such as input noise, transmission errors and packet losses. Finally, an overarching set of challenges has to do with realizing the above objectives with usefully low levels of implementation complexity. In all of these pursuits, we are limited by our knowledge in several individual disciplines, and in the way these disciplines interact. Advances are needed in our understanding of coding, communication and networking, speech production and hearing, and digital signal processing. In discussing directions of research, it is impossible to be exhaustive, and in predicting what the successful directions may be, we do not necessarily expect to be accurate. Nevertheless, it may be useful to set down some broad research directions, with a range that covers the obvious as well as the speculative. The last part of this section is addressed to this task. Future Directions Coding. Communication, and Networking: In recent years, there has been significant progress in the fundamental building blocks of source coding: flexible methods of time-frequency analysis, adaptive vector quantization, and noiseless coding. Compelling applications of these techniques to speech coding are relatively less mature. Complementary advances in channel coding and networking include coded modulation for wireless channels and embedded transmission protocols for networking. Joint designs of source coding, channel coding, and networking will be especially critical in wireless communication of speech, especially in the context of multimedia applications. Speech Production and Perception: Simple models of periodicity, and simple source models of the vocal tract need to be supplemented (or replaced) by models of articulation and excitation that provide a more direct and compact representation of the speech-generating process. Likewise, stylized models of distortion masking need to be replaced by models that maximize masking in the spectral and temporal domains. These models need to be based on better overall models of hearing, and also on experiments with real speech signals (rather than simplified stimuli such as tones and noise). Digital Signal Processing: In current technology, a single general-purpose signal processor is capable of nearly 100 million arithmetic operations per second, and one square centimeter of silicon memory can store about 25 megabits of information. The xxiii memory and processing power available on a single chip are both expected to continue to increase significantly over the next several years. Processor efficiency as measured by mips-per-milliwatt of power consumption is also expected to improve by at least one order of magnitude. However, to accommodate coding algorithms of much higher complexity on these devices, we will need continued advances in the way we match processor architectures to complex algorithms, especially in configurations that permit graceful control of speech quality as a function of processor cost
Açıklama
Tez (Doktora) -- İstanbul Teknik Üniversitesi, Fen Bilimleri Enstitüsü, 1999
Thesis (Ph.D.) -- İstanbul Technical University, Institute of Science and Technology, 1999
Anahtar kelimeler
CELP, Kodlama, Konuşma kodlama, CELP, Coding, Speech coding
Alıntı