A few notes on the SBRECOG speech recognition demo With SBRECOG, I am presenting a speaker dependent speech recognizer that works on DOS machines with soundblaster compatible sound cards. The recognition can be quite good if the conditions are optimal: -- sufficiently distinct test sets that consist of words of two or more syllables -- good recording conditions Sets that work fine with me are 4-6 element sets consisting of Italian numbers or the aviation alphabet. The program is based on the paper "Untersuchungen zur Verteilung von Nulldurchgangsabstaenden in Sprachsignalen" (a study on the distribution of zero crossing distances in speech signals) by Michael Kirstein, published in IKP-Forschungsberichte II/62, Hamburg 1977. Under the next two headings, I try to summarize the paper, only of course as far as I have understood it and think it relevant for the program. I. Related works A couple of works presented since the 1950s give reason to assume that zero crossings of a speech signal contain sufficient information to allow the discrimination of phonemes or at least words: -- Licklider & Pollack (1948) show that clipped speech remains under- standable. In SBRECOG the amplitudes of the individual samples are clipped at the value of |1|, i.e. the signal is reduced to 1 Bit -- Chang, Pihl & Essigmann (1951) examine how the densities of zero crossings and extrema (rho0 and rho0') are related to the first and second formant in voiced sounds -- Peterson (1951) shows that their values in the spectrum of vowels are proportional to rho0, rho0'. -- Chang, Pihl & Wiren (1952) introduce the "intervalgram", a graphical representation of intervals between zero crossings -- Kirstein (1971) talks about "Kumulanten" ("cumulants"), characteristic concentrations of intervals (horizontal lines in the intervalgram) Kirstein also quotes the rather pessimistic Burghard & Hess (1971) who come to the result that zero crossing interval distributions did not allow discrimination of vowels. II. Problem and method Windows with common sizes such as 10 or 20 ms are too narrow to give a stable "view" on a speech signal; the distributions found are not significant. That is why whole word utterances are chosen as the subject of study. -- The signal s(t) is clipped to a square signal _s_(t)=c*sgn(s(t)) -- the zero crossing intervals are collected -- their distribution is examined, i.e. it is counted how many intervals have the size i, how many the size i*2 and so on Kirstein makes his PDP 15 micro examine the signal in real time; to reduce the necessary computations he watches the positive part of the signal only. Thus he reaches a sample frequency of 32 kHz. He admits that the speech signal is "not at all symmetrical to the zero line", but thinks the results are usable anyway. The smallest interval that can be measured (at the resulting time resolution) is 31.6 mu-s; the biggest that gets counted is 6.3 ms. Thus there are 200 possible intervals, stretching over a frequency range of 79..15,823 Hz. These 200 intervals are classified into 16 classes; the idea is that one class stretches over the bandwidth of about one formant. Here is how Kirstein assigned intervals to the 16 classes: class 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 interv 1 2 3 4 5 6 7 8 9-10 11-12 13-15 16-19 20-25 26-38 39-78 79-200 mu-s 31 158 284 347 410 505 632 821 1232 2496 The signal durations turned out to be varying significantly between speakers and even between different productions of one speaker. As the zero crossings vary of course with the length of a signal, relative frequencies must be calculated. What we have at this point is a 16-dimensional vector representing each word. Kirstein examines a number of statistical methods that compare the similarity of two vectors. The one yielding the best results in his study is a contingency matrix. The method is similar to that employed by information theorists to calculate the "information transmission rate" or "Transinformation" (Meyer-Eppler, 1969). The formula combines input entropy + output entropy - overall entropy to calculate a measure for the transmitted information T = Sum_i=1..r ( Sum_j=1..c (p_ij * log_2 (p_ij/(p_i.*p_.j))), where c, r are columns, rows of the matrix (c: dimension of the vectors, r: number of vectors to be compared); p_ij are matrix cells, p_i. are row sums, p_.j column sums. Kirstein decides to smooth out the vectors (by averaging each element with its weighted nearest neighbours). This turned out be desastrous in my implementation, so I left out the smoothing. III. About my implementation My main interest was voice recognition in the telephone network, thus I had to make do with a smaller bandwidth and a sampling rate of around 11 kHz. It is easy to see why the number of possible interval sizes is reduced to 64 instead of Kirstein's 200 (see the related comments in the code). Although their classification, that must eventually yield to the 16-dimensional vector, is quite crucial for the performance of this method, I must admit I did it quite ad hoc: I printed out a couple of matrices and decided that they looked characteristical enough... The performance of my program of course changes considerably with different CPU speeds, as the sampling frequency is not constant on different machines. If you do not achieve satisfactory results, try changing the #define value of CPUSPEED to the tact rate of your machine, or lower. I didn't test the program on machines other that 286s, so given the quite different CPU designs it may be possible that you have to set CPUSPEED to a value that doesn't match that of your computer at all... The playback rate (that you observe during the training of words) is no clue here, as of course the recognition depends only on the recording speed. Just fiddle around with these things a bit. The "user interface" of the program is so primitive that you will master it without my explaining it here. Just note that there are basically two ways of improving the recognition of a test set: You can have multiple dictionary entries for different realisations of one word. You may want to attach different ID strings to the dictionary entries (like "bravo_1", "bravo_2", "bravo_fast", "bravo_slow"...), so that you can see how often each of the entries is picked by the program.-- Or you can have the parameter vectors in the dictionary calculated as the average of two or more (the program supports two only) realisations. This is what the program means by asking "Would you like another test set to be averaged with the set entered". The sound blaster interface "direct.obj" was written by Joel Lucsy of Vroom Diggy Diggy Software and is part of a Freeware package, "Blast". I am including only the Blast files necessary to compile my demo. If you want to use the package for your own programs I suggest you let archie search your favourite ftp servers for it. Why am I publishing this demo program? I would like to see people starting further experiments inspired by the ideas presented here. The material is free to use and share. I hope you may feel somewhat obliged to make your enhancements and applications free software, too. If you have any further questions or comments, you can contact me by electronic mail at kiehl@ldv01.uni-trier.de or by conventional mail until 06-31-1993 from 07-01-1993 Johannes Kiehl Johannes Kiehl Postfach 2441 Postfach 2441 D - W 5500 Trier D - 54214 Trier