## Ses Parmakizi Kullanılarak Reklam Tanıma

thumbnail.default.placeholder
2015-06-24
Çabuk, Hüseyin
##### Yayınevi
Fen Bilimleri Enstitüsü
Instıtute of Science and Technology
##### Özet
Akıllı telefonların kullanımı her geçen gün daha da fazla artmaktadır. Artık insanlar günlük hayatlarında birçok faaliyetle eş zamanlı olarak akıllı telefon kullanmaktadırlar. Akıllı telefonun eş zamanlı olarak en çok kullanıldığı faaliyetlerden biri de televizyon izlemektir. Akıllı telefon kullanıcılarının \%84'ü televizyon izlerken eşzamanlı olarak akıllı telefonlarını kullanmaktadırlar.  Televizyon izlerken akıllı telefon kullanımına yönlendiren sebeplerden biri de televizyonda izlenen konu ile ilgili daha detaylı bilgiye ulaşmak için telefondan arama yapmaktır. Metinsel olarak arama yapmak zahmetli ve uzun bir işlemdir. Özellikle reklam sektörünün, tanıttıkları ürünle ilgili kullanıcıların daha kolay ve hızlı şekilde bilgiye ulaşması için başka çözümlere ihtiyacı vardır.  Çalışmamızda, televizyonda oynayan reklamın, akıllı telefondaki bir uygulamaya dinletilip tanınmasını sağlayacak bir algoritmanın geliştirilmesi amaçlanmıştır. Reklam tanıması için ses parmakizi (audio fingerprinting) yöntemlerinin kullanılmasına karar verilmiştir. Geliştirilecek yöntemin, ses tanıma sistemlerinin sahip olması gereken gürbüzlük, güvenilirlik, veri boyutu küçüklüğü, parçalılık, arama süresi kısalığı ve hesaplama maliyeti küçüklüğü özelliklerine sahip olması hedeflenmiştir. Farklı reklamlarda, şarkılardan farklı olarak, aynı müzik veya konuşma bölümleri geçebilmektedir, bu nedenle güvenilirlik kontrolü aşamasında, müzik tanıma sistemlerinden farklı bir yaklaşım geliştirilmesi amaçlanmıştır.  Ses parmakizi olarak, spektrogramdaki zirve noktalarının aralarındaki zaman ve frekans farklarından yararlanılarak üretilen karmalar kullanılmıştır. Gürültü ve sinyal bozulmaları durumlarında bile spektrogramdaki zirve noktaların en azından bir kısmının korunması beklenmektedir.  İlgili yöntemin prototip geliştirmesi yapıldıktan sonra deneylerde bazı zayıf yönleri tespit edilmiştir. Genelde, spektrogramdaki zirve noktaların zaman veya frekans yönünde küçük kaymaları nedeniyle ortaya çıkan bu problemlere çözümler geliştirilmiştir. Yöntemin aynı başarım oranlarına daha az veri kullanarak ulaşabilmesi için de katkılar sunulmuştur. Güvenilirlik kontrolü aşamasında da iki eşik değeri parametreli bir çözüm üretilmiştir.  Yapısal katkılar tamamlandıktan sonra, yapılan birçok deney ile yöntemin en iyi başarım oranlarını verdiği sistem parametreleri belirlenmiştir.  Deneylerde şarkılar ve reklamlardan oluşan 2 deney kümesinin, çeşitli seviyelerde beyaz gürültülü, pembe gürültülü, kahverengi gürültülü, kırpılma uygulanmış, bar ortamı etkisi uygulanmış, canlı kayıt etkisi uygulanmış, akıllı telefon kayıt etkisi uygulanmış, akıllı telefon çalma etkisi uygulanmış alt-deney kümeleriyle, ayrıca bir de İstanbul'daki bir alış veriş merkezinde akıllı telefon ile kaydedilmiş versiyonları kullanılmıştır.  Deney sonuçları, anma, kesinlik, kullanılan veri boyutu ve tanıma süresi açılarından değerlendirilmiş, baz alınan yöntemle karşılaştırılmıştır. Bahsedilen bütün kriterlerde baz alınan yönteme göre daha iyi sonuçlar elde edildiği tespit edilmiştir.
Smart phone usage is rapidly increasing day by day. Nowadays, in their daily lives, people often use their smart phones simultaneously with another activity. One of the most often conducted activities that goes with smart phone usage is watching television. 84\% of smart phone owners, use their smart phones while watching television.  One of the main factors that leads people to use their smart phones while watching television is to search for more detailed information about the topics they are watching on TV. Text-based searching is inconvenient and also time consuming. Especially advertising industry needs some other resolution to let their customers find information about their products more quickly and easily.  In this thesis, it is aimed to develop an algorithm that can be used in a smart  phone application to identify a TV commercial by listening through microphone. It is decided to use audio fingerprinting techniques for commercial identification. The algorithm is aimed to have the properties that most of the audio fingerprinting systems should have, such as robustness, reliability, granularity, fingerprint size, search speed and scalability. Some commercials contain same part of music or speech, so another type of reliability check method is needed to be developed, apart from the ones used in song identification systems. Audio fingerprinting literature is reviewed and one algorithm is chosen to be used as base, to satisfy the said requirements.   According to the algorithm, temporal and frequencial distances between the peaks in audio spectrogram were used to generate the fingerprint. A time-frequency point is a candidate peak if it has a higher energy content than all its neighbors in a region centered around the point. At least some of the peaks are supposed to survive in presence of noise or signal distortions.  Base algorithm was developed as a prototype and some weaknesses were identified in the initial tests. It was seen that the problems are often caused by small shifts of peaks in time or frequency directions. Especially in situations with high level of noise, it was observed that, those shifts could be much more and cause success rates to decrease excessively. Some solutions were presented for the mentioned problems. Firstly, in the database search step, hashes, that were generated using the distances between the peaks, were searched in the fingerprint database with some alternative values to increase the success rates. Although increase in success rates were observed in the tests, database search duration was also increased as expected. As search duration was still under real time, it was evaluated as applicable. In the initial test, it was observed that using all the peaks in search step was decreasing the success rates and also increasing the search duration and computation cost, because most of the peaks couldn't survive in high levels of noise. Therefore another contribution was made to use only the strongest \textit{n} peaks in search step. Value of \textit{n} was decided to be calculated for every second of the query using the equation \ref{eq:hash_count_rate}. This contribution helped achieve the same success rates using smaller data sizes. It was observed that shifts of peaks caused some problems also in scoring step. According to this, right answer had smaller score than it should have, so it took longer time to exceed the threshold. to overcome this problem, another contribution, that is called histogram normalization, was presented. This contribution led higher success rates. Another weakness of the method was identified in the tests with commercial test sets. Since more than one commercial could have same music or speech parts, querying with this parts resulted in false positives with the original hypothesis testing method. Hypothesis testing step was also improved with a method that uses two threshold parameters, which are called matching rate and power rate. Test results showed that false positives rate decreased excessively after this contribution.  After the structural contributions were completed, a number of tests were employed to find the optimal values for the 3 system parameters, $\alpha$ (alpha), $\beta$ (beta) and $\sigma$ (sigma), to achieve the best success rates. Two sets, consisting of songs and commercials, were used as test sets. A number of degradations including white noise, pink noise, brown noise, clipping, bar environment effect, smart phone recording effect, smart phone playback effect and live recording effect were applied to test sets. Furthermore, another test set was generated by recording with a smart phone in a real shopping mall in İstanbul.  Test results were examined in terms of recall, precision, required data size and search time, and then compared with the base algorithm. These result values also refer to the required performance parameters of any audio fingerprinting system, which are robustness, reliability, fingerprint size, granularity and search time.  It has been observed that the developed system is more advantageous than the base algorithm. In another words, search time and storage need were decreased and recall and precision were increased with the contributions. For now, developed system is running as a standalone windows application, written in C++ programming language. When a song or commercial, that were introduced to the system beforehand, is played, testing is started by clicking the button on the user interface and  then application starts listening from the microphone and tries to find a match in the database. For now, the fingerprints of songs and commercials, that are introduced to the system beforehand, are stored in a text file. Before testing step, another button is used to start reading this fingerprints from text file into system memory. Since memory is limited, the number of song and commercial fingerprints is limited too. In tests, a maximum of 500 songs and commercials were used. As the system to be usable in real life applications, it is needed to make some developments to use a real database as fingerprint storage. The database structure needs to be indexable to prevent search times from increasing as the fingerprint count goes up, otherwise the system will be unusable. As the system to be usable in real life applications, a client-server architecture must be developed, with a smart phone application as client, and fingerprint database as server. Smart phone application should listen from the microphone, generate the fingerprint, and send the fingerprint to server instead of all the song data. Server should search the fingerprint in database, make scoring and hypothesis testing and response the client appropriately.  Audio fingerprint generation speed with the smart phone application,  should be observed and if it is above the real time, some optimizations should be done to speed up the process.  Searching and scoring processes in server side, are convenient for parallel processing. If the server is developed in parallel processing architecture, searching and scoring times can be reduced excessively.
##### Açıklama
Tez (Yüksek Lisans) -- İstanbul Teknik Üniversitesi, Fen Bilimleri Enstitüsü, 2015
Thesis (M.Sc.) -- İstanbul Technical University, Instıtute of Science and Technology, 2015
##### Anahtar kelimeler
Ses Parmakizi, Sayısal Müzik İşleme, Müzik Tanıma, Reklam Tanıma, Audio Fingerprinting, Digital Music Processing, Music Identification, Commercial Idendification