Problemas nas evidências envolvendo identificação de locutor

Forensic Phonetics:

Issues in speaker identification evidence

Andrew Butcher

Centre for Human Communication Research

Flinders Medical Research Institute

Flinders University, Adelaide, Australia

Abstract

The field of forensic phonetics has developed over the last 20 years or so and embraces a number of areas involving analysis of the recorded human voice. The area in which expert opinion is most frequently sought is that of speaker identification – the question of whether two or more recordings of speech (from suspect and perpetrator) are from the same speaker. Automated analysis (in which Australia is a world leader) is only possible where recording conditions are identical. In the most frequently encountered real-world forensic situation, comparison is required between a police interview recording and recordings made via telephone intercepts or listening devices. This necessitates a complex procedure, involving auditory and acoustic comparison of both linguistic and non-linguistic features of the speech samples in order to build up a profile of the speaker. The most commonly used measures are average fundamental frequency and the first and second formant frequencies of vowels. Much work is still needed to develop appropriate statistical procedures for the evaluation of phonetic evidence. This means estimating the probability of finding the observed differences between samples from the same speaker and the probability of finding those same differences between samples from two different speakers. Thus there needs to be an acceptance that the outcome will not be an absolute identification or exclusion of the suspect. By itself, your voice is not a complete giveaway.

1. The field of forensic phonetics

The use of phonetics as a forensic tool has developed over the past 20 years or so (Hollien 1990; Baldwin & French 1991), but with the rapid expansion in the number of cases depending on the evidence of covert audio and video recordings in recent years, forensic phonetics now plays a crucial role in an increasing number of criminal trials. A forensic phonetician may be asked to prepare reports in a number of areas, of which the following four are the most frequently encountered:

1.1 Speaker identification. This is by far the most commonly required task and the subject of the remainder of this paper.

1.2 Disputed utterances. In view of the usually very poor quality of covert police recordings (especially those made via a listening device), there is often ample scope for a defendant to challenge the prosecution’s version of what was actually said in the course of a recorded conversation. Forensic phoneticians may be asked to prepare a report on the quality of the recording and the intelligibility of the speech. They may also be asked to prepare an ‘objective’ transcript of the recording.

1.3 Tape authentication. Occasionally a defendant (or a civil litigant) may have cause to question whether an audio recording has been tampered with in some way. Usually the claim is that certain sections have been excised or perhaps transposed. It is not generally within the competence of a phonetician to give an opinion as to the physical condition of a tape, but there may be evidence within the acoustic signal (‘pops’ or abrupt changes in either the signal itself or the background noise) which would be indicative of electronic editing. However, currently available software makes ‘seamless’ editing comparatively easy, and a phonetician may be needed to give an opinion on the only remaining evidence of any tampering – linguistic evidence in the form of unnatural changes in rhythm, tempo or intonation.

1.4 Voice line-ups. The practice of confronting witnesses of a crime with a tape recorded ‘voice line-up’, where the voice of a suspect is included amongst a series of ‘foils’, may be used to obtain evidence of identification in cases where, in the course of committing a crime, an unseen or masked perpetrator has spoken in the presence of the witnesses. This recording is played to the witness(es) and they are asked to state whether they can identify any of the voices as that of the perpetrator. In order to be entirely fair to the suspect, there are a number of criteria which need to be observed (Broeders & Rietveld 1995; Hollien, Huntley, Künzel & Hollien P 1995). As with visual identification parades, it is a general principle of fairness in the conducting of voice line-ups is that there should be no feature of any of the voices or the recordings which would cause non-witnesses to pick out a particular speaker (whether suspect or foil) as being different from the rest. A phonetician may be consulted on aspects of the construction of the tape and the administration of the confrontation.

2. Speaker Identification: analysis and measurement

I would estimate that at least 90% of my work as a forensic phonetician is concerned with the identity of speakers in audio recordings. There is a good deal of misunderstanding surrounding the capabilities of speech technology in this area. Some of this misunderstanding dates from the 1960’s, when the “Voiceprint” technique became a favourite tool of certain police forces, most notably in the USA. This methodology, which involved the visual inspection and impressionistic comparison of sound spectrograms, was regarded sceptically by the scientific community at the time, and has since been entirely discredited (Hollien 1990, 2002; Gruba & Poza 1995). The term “Voiceprint” suggests that the technique is analogous to forensic techniques such as fingerprinting or DNA analysis. There are a number of reasons why this is an inappropriate analogy. Firstly, there is no single feature of the voice which is unique to every speaker. Unlike the vanishingly small possibility in the case of fingerprints or DNA molecules, it is quite possible for two speakers to be, for all practical purposes, identical in some respect. Secondly, most (if not all) of the features of the voice which are measurable in recordings of the quality typically encountered in the forensic context are capable of being consciously changed by the speaker. These include, voice pitch, aspects of voice quality, consonantal articulation, and vowel quality. At present it is not impossible for a skilled mimic to defeat the forensic voice identification procedure. Thirdly, for most of the voice features, we do not have sufficient data on the normal population to know what the chances are of two speakers being similar or identical with respect to that feature. Finally, acoustic parameters vary as a consequence of differences in recording conditions as well as of differences in the voice itself. Australia leads the world in the technology of automatic speaker recognition (in 2001 a team from the RCSAVT Speech Research Lab at Queensland University of Technology won two of the categories for single speaker detection tasks in the National Institute of Standards & Technology’s benchmark tests on speaker recognition), but automatic speaker recognition is not yet able to separate out variation due to speaker differences from variation due to recording conditions (and it is doubtful whether it will ever be able to). Thus automatic speaker recognition techniques are of limited use in the typical forensic situation, where a voice recorded over the telephone or via a listening device is to be compared with a voice recorded in a police interview room. The intervention of a phonetically and linguistically qualified human operator is required. The main components of the procedure are an auditory analysis and an acoustic analysis, each of which in turn has a number of component parts. Voice ID is therefore more appropriately compared with a technique such as a ‘photo-fit’ type of procedure, where a number of features are considered as part of an overall profile.

2.1. Auditory analysis

This part of the analysis involves careful and repeated listening by the expert, noting features of the voices in question under four basic headings. Firstly, voice quality features are ascertained. This means describing ‘voice’ in the technical sense – i.e. the sound made by the vibration of the vocal folds – and ignoring for the moment any variations contributed by the resonances of the throat, mouth and nasal passages above. It can be done using one of a number of descriptive frameworks (e.g. Isshiki & Takeuchi 1970; Laver 1980; Wendler, Rauhut & Krüger 1986; Oates & Russell 1998), whereby aspects of the voice can be quantified according to parameters such as ‘roughness’, ‘strain’, ‘creakiness’, ‘breathiness’ and so on – terms which are meaningful to other phoneticians and speech scientists and which describe in as accurate and objective way as possible the auditory impressions of the listener. Secondly, the investigator attends to the non- linguistic characteristics of the speech which are not produced by the larynx. This means listening to the effects of the long-term setting of the throat, the tongue and lips and the resonances of the nasal passages and sinuses. This is known as the articulatory setting, and here too, established descriptive frameworks are available (Laver 1980; Esling 1994) which rate the voice according to such parameters as ‘hypernasality’, ‘pharyngealisation’, ‘labialisation’, as well as vertical position of the larynx. The third set of parameters relate to aspects of (mainly vowel) articulation which provide clues to the speaker’s geographical and social background. In long-established linguistic communities such as in the United Kingdom and Europe, this part of the analysis can provide very useful information. In a recently established community such as (non-Aboriginal) Australia, the information which can be gleaned is usually quite scanty. Australian English accents are traditionally classified on a three-point scale as being ‘Broad’, ‘General’ or ‘Cultivated’ (Mitchell & Delbridge 1965), but there are very few features which enable us to pinpoint the speaker’s geographical origins with any accuracy. One or two pronunciations are peculiar to Queensland and another one or two distinguish speakers with a South Australian background. A more recent phenomenon is the “pan-ethnic” accent (sometimes known as “wogspeak”) which has developed among second- and subsequent-generation Australians of non-English-speaking background (Warren 1999). The final component of the auditory analysis is the identification of any idiosyncratic pronunciation features which may be present. The more commonly occurring idiosyncrasies involve the articulation of consonants, and include various types of ‘lisp’, the labialising of ‘r’ (‘rabbit’ becomes something likes ‘wabbiit'') and the pronunciation of ‘th’ as ‘v’. Apart from this, speakers may exhibit various kinds of dysfluency, including stuttering, ‘cluttering’ and slurring of words.

2.2 Acoustic analysis

In order to carry out an acoustic analysis the recording must be digitised to a computer hard drive or compact disc (a sampling rate of 22.05 kHz and a 16-bit resolution are normally used). The recordings are usually edited so as to contain only the voice of the speaker under investigation. Published recommended minimal sample sizes for forensic speaker comparison range from 15 s to 120 s. With regard to fundamental frequency measurement (F0), one recent review of the forensic phonetic literature concludes: “If the communicative behaviour may be considered ‘normal’, 15-20 sec of speech will be sufficient to calculate speaker F0” (Braun 1995). The analyses described below can be performed using any one of a number of currently available speech analysis software packages.

2.2.1 Fundamental frequency

The rate of vibration of the vocal folds during voiced segments of speech is what the listener perceives as the pitch of the voice. This is known as the fundamental frequency, and is measured in cycles per second or ‘Hertz’ (Hz). Obviously this is capable of variation by the speaker, and indeed this is one of the main ways of conveying both grammatical and emotional meaning in speech.

Figure 1: Waveform (above) and pitch contour (below) of the utterance “We went to Woolloomooloo”

Figure 1 shows a waveform and pitch contour for an Australian English sentence. The waveform at the top represents the tiny variations in air pressure caused by the transmission of the sound waves. The bottom trace shows the variation in frequency of those vibrations over time: the fundamental frequency. Each speaker has a particular range of fundamental frequency which s/he habitually uses and within which s/he feels most comfortable and this is an important measure for forensic purposes, because it is one of the few measures for which we know the distribution amongst the adult population at large. The average speaking fundamental frequency for an adult caucasian male is 113 Hz, and 50% of the male population lie somewhere between 100 to 130 Hz in spontaneous speech (Kuenzel 1989). The corresponding average for females is 225 Hz. Figure 2 shows how this measure may be used in building up a voice profile. In this case the voice of a person issuing a ransom demand over the telephone is compared with the voices of two suspects (Butcher & Moody 1999). Clearly the fundamental frequency of suspect 1 is much closer to that of the perpetrator than is the fundamental frequency of suspect 2. Furthermore both the perpetrator and suspect 1 differ markedly from the population mean and in the same direction.

Figure 2: Graph of mean fundamental frequency of three speakers in a number of recordings. The vertical lines represent one standard deviation either side of the mean. The dashed line represents the mean for the adult male population.

2.2.2 Long term average spectrum

A spectrum is a plot of energy against frequency. It shows the distribution of energy throughout the frequency range during a very small time ‘slice’ of sound. A long-term spectral energy profile is derived by averaging a large number of spectral slices over a longer sample of speech, thus eliminating information on the details of individual sounds. This is, in theory, the best measure of what we perceive as voice quality and vocal effort, as well as the overall effects of long-term articulatory settings. It is this kind of measure which is used in most automated speaker recognition procedures (Butcher & Moody 1999). Unfortunately, such measures also reflect differences in recording conditions, and often these may be sufficiently large to mask any similarities between speakers. Figure 3 illustrates this problem. In Figure 3a the voice of a suspect recorded via a mobile telephone is compared with unknown voices from four other calls made from the same phone. Clearly there is a high degree of similarity between the voices. In Figure 3b, however, the voice of the suspect is shown under three different conditions: recorded on standard audio cassette via telephone, recorded in free field on microcassette and recorded in free field on VHS-C cassette. In this case the three spectra look quite different – in particular the there is a large discrepancy between the telephone recording and the two free-field recordings, which represent the two recording conditions most commonly offered for comparison in the forensic situation. Clearly this measure can only be used in the limited number of situations where the conditions under which recordings have been made can be assumed to be similar.

Figure 3: Long-term average spectra (a) from 5 separate phone calls, allegedly by a single speaker and (b) from the same speaker, recorded under three different conditions

2.2.3 Vowel formant frequencies

When a speaker pronounces a vowel sound, a number of resonances are produced in the vocal tract (the tube formed by the mouth and throat cavities). These are known as formants. The frequencies of the lowest two or three formants change according to the ‘colour’, ‘quality’ or ‘timbre’ of the vowel. Formants can be measured from a sound spectrogram, which is a kind of three-dimensional spectrum. As with the spectrum, the distribution of energy is shown over the frequency range, but in this case we can see how this distribution varies as a function of time. Frequency is shown on the vertical axis and time on the vertical axis, whilst the amount of energy present is represented by the darkness of the shading. The formants appear as dark horizontal bands, whose vertical position varies according to the nature of the vowel. This is illustrated in Figure 4. For example, if a number of speakers pronounce the short ‘a’ vowel in words such as ‘cat’, ‘bad’, ‘sack’ etc, one might expect to find some small, but consistent differences between speakers, if the sample is large enough, and likewise for each of the other vowels of the language.

Figure 4: A sound spectrogram of the words ‘head, had, hard’, spoken by an adult male. The dark horizontal bands (F1, F2, F3) in the vowels represent areas of higher energy known as FORMANTS.

A useful way of summarising vowel formant frequency data from a given speaker is to plot the mean values of the first formant against the mean values of the second formant for all the vowels. This provides a characteristic pattern or ‘vowel space’ for the speaker, as shown in Figure 5, which is based on data measured from the voice of a murder suspect during interview. In this figure the first formant frequency is shown on the vertical axis and the second formant frequency on the horizontal axis. The origins of the axes are placed in the top right hand corner, so that the positions of the points on the chart relate approximately to the position of the tongue and jaw: vowels pronounced with a forward position of the tongue and spread lips appear on the left and those with a retracted tongue and rounded lips appear on the right. Vowels with a raised tongue and closed jaw are at the top and vowels with a lower tongue and open jaw at the bottom. The individual letters represent a point positioned at the intersection of the means of the first and second formant frequencies of the vowel in question. The ellipses represent a distance of two standard deviations around the mean for that vowel, i.e the area which would include 95% of the speaker’s vowels of that type.

Figure 5: Formant plot of short vowels of a suspect in a police interview recording. The phonetic symbols represent a point positioned at the intersection of the mean first and second formant frequencies of the vowel in question and the ellipses represent two standard deviations around the means. From left to right, the symbols represent ‘i’ as in ‘ring’, ‘e’ as in ‘left’, ‘a’ as in ‘that’, ‘u’ as in ‘up’, ‘o’ as in ‘got’, and ‘oo’ as in ‘good’.

Figure 6: Comparison of short vowels from a suspect in a police interview recording with corresponding vowels of a speaker recorded via a listening device. The ellipses are the same as in Figure 5 – i.e. they represent two standard deviations around the means for the suspect’s voice. The phonetic symbols represent individual short vowels from the unknown speaker.

In Figure 6 the same ellipses are superimposed on a set of data points representing the formant frequencies of vowels from an unknown speaker recorded via a listening device. The degree of overlap between the two speakers can be roughly quantified by calculating the proportion of vowel points from the unknown speaker which fall within the appropriate ellipse of the suspect speaker. In this particular diagram, only 50% of the unknown speaker’s vowels fall within the corresponding ellipse of the suspect speaker. Based on this data alone, there would have to be considerable doubt that the speakers are the same.

Data from a different case are shown in Figures 7 and 8. In these plots the mean frequencies of the vowel sets are compared. In Figure 7 the combined mean values from a perpetrator’s vowels in a number of phone calls are compared with the values for the equivalent vowels spoken by 20 adult male speakers of General Australian English from the Australian National Database of Spoken Language (Millar, Vonwiller, Harrington & Dermody 1994). The two patterns look quite different, and in the overall mean difference between the values of the perpetrator and those of this sample of the general population is 12.2%. Figure 8 shows the same set of perpetrator vowels compared with those of a suspect. The degree of similarity between the two patterns appears much greater, and indeed the mean difference between the values for the perpetrator and those for the suspect is 3.3%. Thus the formant frequencies of the perpetrator are considerably closer to those of the suspect than they are to those of the general population. Experience suggests that a variation of 5% or less is of the order expected within a single speaker.

These, then are the major parameters that may be used to build up a profile of a two or more voices for the purposes of forming an opinion as to their overall similarity.

Figure 7: Comparison of vowels from a perpetrator with vowels from the Australian National Database of Spoken Language. Each phonetic symbol represents the mean for that vowel in one of the two sets of data. All means for a given data set are connected by a line:

Figure 8: Comparison of vowels from a perpetrator with vowels from a suspect. Each phonetic symbol represents the mean for that vowel in one of the two sets of data. All means for a given data set are connected by a line:

3. Presenting the evidence

3.1 Problems with ‘Probability’

Having carried out the analyses and formed an opinion, the phonetician must now present the evidence to the court and express his opinions based upon it. The usual expectation of lawyers appears to be that the expert give his opinion in the form of an answer – preferably in numerical terms – to the question “Given the degree of similarity between the speech samples, what is the probability of the two voices belonging to the same speaker?” And the answer that is required is something along the lines of: “Given the high degree of similarity between the two speech samples, there is a very high (90%) probability that these two samples are from the same speaker”. Some expert (but non-phonetician) witnesses in the field appear to be prepared to make statements of this kind. This is, however, highly inappropriate and has no statistical basis. The witness is in fact expressing the probability of a hypothesis, given the evidence. This is not only logically incorrect, but, according to my understanding, also legally incorrect, as this is ultimately the job of the court and not of the expert witness. Essentially what forensic phoneticians have traditionally been asked to do is akin to answering the question “Given that this creature has wings, what are the chances of it being a bird?” The question the expert witness should be answering, however, is the equivalent of “Given that this creature is a bird, what are the chances of it having wings?” Translated back into the real world, this means “If we assume that the two speech samples are from the same speaker, what is the likelihood of them displaying this degree of similarity?” In other words s/he should be expressing the probability of the evidence, given the hypothesis.

3.2 The Likelihood Ratio

Ideally the evidence of the expert witness should be expressed within the framework of Bayesian statistics (Robertson & Vignaux 1995). This means answering a question of the type “How much more likely is this creature to have wings if it were a bird than if it were not a bird?” or in reality “How much more likely is the given degree of similarity between samples if they were by the same speaker than if they were by different speakers?”. This involves the use of a likelihood ratio, which is arrived at in the following way (Rose 2002). The phonetician observes and quantifies a certain degree of similarity (X) between the perpetrator and suspect speech samples. Let’s assume, for the sake of argument, that published research has shown that, with paired samples of X degree of similarity, 85% are from the same speaker. This means that the probability of observing X degree of similarity between samples from the same speaker would be 85% and the probability of finding X degree of similarity between different speakers would be 15%. The likelihood ratio is then 85 divided by 15 or 5.67.

A likelihood ratio greater than 1.0 supports the prosecution hypothesis – i.e shows that the degree of similarity found between the speech samples is more likely if they were by the same speaker than if they were by different speakers. A likelihood ratio less than 1.0 supports the defence hypothesis – i.e shows that the degree of similarity found between the speech samples is more likely if they were by different speakers than if they were by the same speaker. The value of the likelihood ratio thus quantifies the strength of the evidence, and likelihood ratios from different areas of forensic evidence can be combined. Each successive likelihood ratio should be evaluated in terms of the degree of confidence in the assertion of guilt before consideration of the evidence in question (the so-called ‘prior odds’) (Robertson & Vignaux 1995).

4. Speaker Identification: where we are now

At the beginning of the previous subsection I used the word “ideally” and indeed the subsequent paragraphs describe the ideal situation. The key sentence is the one beginning “Let’s assume, for the sake of argument, that published research has shown …”. Unfortunately, however, we cannot assume any such thing at this time. Our knowledge of what is ‘normal’ or ‘average’ for the population is severely lacking in most areas and the data that we do have is inevitably limited to the majority population groups – i.e. in the case of Australia to the Anglo-Celtic community – and to somewhat artificial ‘laboratory’ conditions. Furthermore, the statistical modelling of the highly complex variation that occurs in speech is still in its infancy, and is still a long way from being able to cope with the distinction between variation due to speaker differences and variation due to differences in recording conditions – as we have seen, a crucial requirement in the forensic context. Thus statements as to probability made by forensic phoneticians are at this stage limited by these two significant constraints. Whilst every scrap of available quantitative data on the general population will be taken into account, such statements will inevitably rely heavily on the extensive experience and accumulated knowledge of the individual expert.

References

BALDWIN J & FRENCH P (1991) Forensic Phonetics. London & New York: Pinter.

BRAUN A (1995) Fundamental frequency – how speaker-specific is it? In: A BRAUN & J-P KÖSTER (eds) Studies in Forensic Phonetics. Trier: Wissenschaftlicher Verlag Trier, 9-23.

BROEDERS APA & RIETVELD ACM (1995) Speaker identification by earwitness. In A Braun and J-P Köster (eds), Studies in Forensic Phonetics. Trier: Wissenschaftlicher Verlag.

BUTCHER AR & MOODY MP (1999) The case of the ‘third voice’: a rare opportunity for closed set comparison in the forensic context. Paper presented at the Annual Conference of the International Association for Forensic Phonetics, York, England.

ESLING JH (1994) Voice quality. In R.E. Asher & J.M.Y. Simpson (eds) The Encyclopedia of Language and Linguistics. Oxford: Pergamon Press, 4950-4953.

GRUBA JS & POZA FT (1995) Voicegram identification evidence. 54 American Jurisprudence Trials 1.

HOLLIEN H (1990) The Acoustics of Crime. New York & London: Plenum. HOLLIEN H (2002) Forensic Voice Identification. San Diego: Academic Press.

HOLLIEN H, HUNTLEY RA, KÜNZEL HJ & HOLLIEN PA (1995) Criteria for earwitness lineups. Forensic Linguistics 2, 143-153.

ISSHIKI N & TAKEUCHI Y (1970) Factor analysis of hoarseness. Studia Phonologica 5, 37-44.

KÜNZEL HJ (1989) How well does average fundamental frequency correlate with speaker height and weight? Phonetica 46, 117-125.

LAVER J (1980) The Phonetic Description of Voice Quality. Cambridge: Cambridge University Press.

OATES JM & RUSSELL A (1998) Learning voice analysis using an interactive multi-media package: Development and preliminary evaluation. Journal of Voice 12, 500-512.

MILLAR JB, VONWILLER J, HARRINGTON JM & DERMODY P (1994). The Australian National

Database of Spoken Language. Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Adelaide, 67-100.

MITCHELL AG & DELBRIDGE A (1965) The pronunciation of English in Australia (revised edition). Sydney: Angus and Robertson.

ROBERTSON B & VIGNAUX GA (1995) Interpreting Evidence: Evaluating Forensic Science in the Courtroom. New York: John Wiley & Sons.

ROSE P (2002) Forensic Speaker Identification. London: Taylor & Francis.

WARREN J (1999) ‘Wogspeak’: transformations of Australian English. Journal of Australian Studies 62, 86-94.

WENDLER J, RAUHUT A & KRÜGER H (1986) Classification of voice qualities. Journal of Phonetics 14, 483-488.