Analysis of Pathological Speech Signals Análisis de Señales de Voz Patológicas Tomás Arias Vergara Tesis doctoral presentada para optar al título de Doctor en Ingeniería Electrónica y de Computación Directores Prof. Juan Rafel Orozco Arroyave, Doctor (PhD) en Ingeniería Electrónica y de Computación Prof. Elmar Nöth, Doctor (PhD) en Ciencias de la Computación Prof. Maria Schuster, Doctor (PhD) en Medicina Clínica Universidad de Antioquia Facultad de Ingeniería Doctorado en Ingeniería Electrónica y de Computación Medellín, Antioquia, Colombia y Friedrich-Alexander-Universität Erlangen–Nürnberg Facultad de Ingeniería Doctorado en Ciencias de la Computación Erlangen, Alemania 2022 Cita (Arias-Vergara, 2022) Referencia Estilo APA 7 (2020) Arias-Vergara, T.. (2022). Analysis of Pathological Speech Signal [Tesis doctoral]. Universidad de Antioquia, Medellín, Colombia. Doctorado en Ingeniería Electrónica y de Computación, Cohorte XVII. Grupo de Investigación en Telecomunicaciones Aplicadas - GITA Centro de Investigación Ambientales y de Ingeniería (CIA). Seleccione biblioteca, CRAI o centro de documentación UdeA (A-Z) Repositorio Institucional: http://bibliotecadigital.udea.edu.co Universidad de Antioquia - www.udea.edu.co Rector: John Jairo Arboleda Céspedes Decano/Director: Jesús Francisco Vargas Bonilla Jefe departamento: Augusto Enrique Salazar Jiménez El contenido de esta obra corresponde al derecho de expresión de los autores y no compromete el pensamiento institucional de la Universidad de Antioquia ni desata su responsabilidad frente a terceros. Los autores asumen la responsabilidad por los derechos de autor y conexos. https://co.creativecommons.net/tipos-de-licencias/ https://co.creativecommons.net/tipos-de-licencias/ Acknowledgments The development of this thesis wouldn’t be possible without the help and support of many people. I’m very thankful to my family for their support during all my academic life. Especially to my mom for her wisdom and guidance. She has always given me reasons to keep going, go beyond my limits, and dream for the best. Thanks to my brothers, Matias and Simon, for their invaluable support in many difficult situations that I could have never faced alone. I’m very grateful to my supervisor Prof. Dr.-Ing. Juan Rafael Orozco-Arroyave, Prof. Dr.-Ing. Elmar Nöth, and Prof. Dr. med. Maria Schuster. I can only offer my sincere appreciation for all of their advice, encouragement, and learning opportunities to them. Rafa has been my supervisor since I was an undergraduate student. He allowed me to explore the world of academic research. His guidance helped me accomplish several of my career goals and influenced many of my life decisions. Elmar opened the doors of the Pattern Recognition Lab and from the very first moment I arrived in Germany, he was very supportive academically and personally. I’m very thankful for all the fruitful discussions we had and his time helping me be a better researcher. I’m thankful to Maria and the trust she put in me to carry on this project. She always encouraged me to get the best out of every task I’ve performed and offered me the best conditions to accomplish my goals. I also want to thank my colleagues from the GITA lab, Camilo, Paula, Parra, Patricia, Orlando, Cristian, Daniel, Lucho, Manuel, and Nicanor. In particular, I’m very thankful to Camilo and Paula. Together with Camilo, we developed new ideas, discussed the results of many experiments, and shared many great moments. Although, in the end, we went on different paths, I will always be grateful to him for all of the help. I’m also very thankful to Paula. There were difficult moments towards the end of my Ph.D. where she was my only support and the one that gave me the strength to continue given the circumstances. She became my “partner in crime” and I hope to support her as much as she did to me, now that she started her Ph.D. I would like to thank my colleagues at the Pattern Recognition Lab, Philipp, Sebastian, Tino, Hendrik, and Dalia. They helped me a lot when I arrived in Germany and always were very kind and friendly. I want to express my deep gratitude to Philipp. He has helped me a lot with different technical and general things during my stay in Germany. I consider him a close friend of mine, and I’m glad for the opportunity to have worked with him at the Pattern Recognition Lab. And last but not least, I want to thank the patients and volunteers of the Fundalianza Parkinson Colombia, the Clinic of the Ludwig-Maximillians University in Munich, and the people from the Augustinum retirement home. Without their help and willingness to collaborate in this work, none of this could have been possible. Thanks to them for letting me be part of their group. Abstract The present thesis addresses the automatic analysis of speech disorders resulting from Parkinson’s disease and hearing loss. For Parkinson’s disease, the progression of speech symptoms are evaluated considering speech recordings captured in the short-term (4 months) and long-term (5 years). Machine learning methods are used to perform three tasks: (1) automatic classification of patients vs. healthy speakers, (2) regression analysis to predict the dysarthria level and neurological state, and (3) speaker embeddings to analyze the progression of the speech symptoms over time. For hearing loss, automatic acoustic analysis is performed to evaluate whether the duration and onset of deafness (before or after speech acquisition) influences the speech production of cochlear implant users. Additionally, articulation, prosody, and phonemic analyses are performed to show that cochlear implant users present altered speech production even after hearing rehabilitation. Automatic acoustic analysis is performed considering phonation, articulation, prosody, and phonemic features. Phoneme precision is characterized using the posterior probabilities obtained from recurrent neural networks trained in German and Spanish. The phonemic analysis considers three main dimensions: manner of articulation, place of articulation, and voicing. This thesis also proposes a methodology for automatically detecting voice onset time in voiceless stop consonants. Furthermore, this thesis studies the acoustic cues that reflect changes in elderly people due to the aging process. Regression analysis is performed to estimate a person’s age using the phonation, articulation, prosody, and phonemic features. Additionally, the use of smartphones for health care applications is considered here. Zusammenfassung Die vorliegende Dissertation befasst sich mit der automatischen Analyse von Sprachstörun- gen infolge von Parkinson und Hörverlust. Bei der Parkinson-Krankheit wird der Verlauf der Sprachsymptome anhand von Sprachaufzeichnungen bewertet, die kurzzeitig (4 Monate) und langfristig (5 Jahre) aufgenommen wurden. Methoden des maschinellen Lernens werden verwen- det, um drei Aufgaben zu erfüllen: (1) automatische Klassifikation von Patienten vs. gesunde Sprecher, (2) Regressionsanalyse zur Vorhersage des Dysarthrie-Levels und des neurologischen Zustands und (3) Sprechereinbettungen zur Analyse des Verlaufs der Sprachsymptome im Laufe der Zeit. Bei den Patienten mit Hörverlust wird eine automatische akustische Sprachanalyse durchgeführt, um zu beurteilen, ob die Dauer und das Einsetzen der Taubheit (vor oder nach dem Spracherwerb) die Sprachproduktion von Cochlea-Implantat-Trägern beeinflusst. Darüber hinaus werden Artikulations-, Prosodie- und Phonemanalysen durchgeführt, um zu zeigen, dass Träger von Cochlea-Implantaten auch nach einer Hörrehabilitation eine veränderte Sprachproduktion unterschiedlichen Ausmasses aufweisen. Für automatischen akustischen Analysen werden wird Phonation, Artikulation, Prosodie und phonemischen Merkmalen berücksichtigt. Die Phonempräzision wird durch die Posterior- Wahrscheinlichkeiten charakterisiert, die aus rekurrenten neuronalen Netzen gewonnen werden, die auf Deutsch und Spanisch trainiert wurden. Die phonemische Analyse fokussiert auf drei Hauptdimensionen: Artikulationsart, Artikulationsort und Stimmgebung. Diese Arbeit schlägt auch eine Methodik zur automatischen Erkennung der Stimmeinsatzes nach stimmlosen Stopp- konsonanten vor. Darüber hinaus untersucht diese Arbeit die akustischen sprachlichen Charakteristika, die Veränderungen bei älteren Menschen aufgrund des Alterungsprozesses widerspiegeln. Eine Re- gressionsanalyse wird durchgeführt, um das Alter einer Person unter Verwendung der Phonation, Artikulation, Prosodie und phonemischen Merkmale zu schätzen. Darüber hinaus wird hier der Einsatz von Smartphones für Anwendungen im Gesundheitswesen betrachtet. Resumen La presente tesis aborda el análisis automático de los trastornos del habla derivados de la en- fermedad de Parkinson y la pérdida auditiva. En el caso de la enfermedad de Parkinson, el progreso de los sı́ntomas del habla se evalúa considerando las grabaciones capturadas a corto (4 meses) y largo plazo (5 años). Métodos de aprendizaje automático son utilizados para realizar tres tareas: (1) clasificación automática de pacientes contra a hablantes sanos, (2) análisis de regresión para predecir el nivel de disartria y el estado neurológico, y (3) modelos de hablante para análisis longitudinal del progreso de los desórdenes en la voz. En el caso de la pérdida auditiva, se realiza un análisis acústico automático para evaluar si la duración y el inicio de la sordera (antes o después de la adquisición del habla) influye en la producción del habla de los usuarios de implantes cocleares. Además, se realizan análisis de articulación, prosodia y fonémicos para demostrar que los usuarios de implantes cocleares presentan una producción del habla alterada incluso después de la rehabilitación auditiva. El análisis acústico automático se realiza considerando fonación, articulación, prosodia y caracterı́sticas fonémicas. La precisión de la producción de fonemas se caracteriza mediante el cálculo de las probabilidades obtenidas de redes neuronales recurrentes entrenadas en Alemán y Español. El análisis fonémico considera tres dimensiones principales: forma de articulación, lugar de articulación y sonorización. Esta tesis también propone una metodologı́a para la detección automática del tiempo de inicio de la voz en consonantes oclusivas sordas. Además, en este trabajo se analiza la influencia de la edad en el análisis acústico. El análisis de regresión se realiza para estimar la edad de una persona utilizando las caracterı́sticas de fonación, articulación, prosodia y fonética. También, en esta tesis se considera el uso de smartphones para aplicaciones en el sector médico. Contents 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Speech disorders in selected populations . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Parkinson’s disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.2 Hearing loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.3 Aging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4.1 General objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4.2 Specific objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Contribution of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.6 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2 Speech production process 12 2.1 Speech chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.1 Physiological processes of speech production . . . . . . . . . . . . . . . 12 2.2 Impact of Parkinson’s disease on speech motor control . . . . . . . . . . . . . . 15 2.2.1 Neuropathophysiology of motor control related to Parkinson’s disease . . 15 2.2.2 Motor speech disorders in Parkinson’s disease . . . . . . . . . . . . . . . 18 2.3 Auditory system and speech control . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.1 Overview of the auditory system . . . . . . . . . . . . . . . . . . . . . . 18 2.3.2 Cochlear implants (CIs) . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.3 Auditory feedback and speech control . . . . . . . . . . . . . . . . . . . 21 3 State-of-the-art 23 3.1 Severity estimation of Parkinson’s disease from speech . . . . . . . . . . . . . . 23 CONTENTS 3.2 Speech analysis of cochlear implant users . . . . . . . . . . . . . . . . . . . . . 27 3.3 Aging and speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4 Smartphone-based applications for health care . . . . . . . . . . . . . . . . . . . 34 3.4.1 Applications for Parkinson’s disease . . . . . . . . . . . . . . . . . . . . 34 3.4.2 Applications for hearing loss . . . . . . . . . . . . . . . . . . . . . . . . 35 4 Automatic analysis of pathological speech signals 38 4.1 Speech processing techniques-an overview . . . . . . . . . . . . . . . . . . . . . 38 4.1.1 Short-time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.1.2 Time-frequency analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.1.3 Filterbank analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.1.4 Voice Activity Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 Pathological speech modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.1 Phonation analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.2 Articulation analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2.3 Phonemic analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2.4 Prosody analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3 Machine learning methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3.1 Support Vector Machine for classification . . . . . . . . . . . . . . . . . 64 4.3.2 Support Vector Machine for regression . . . . . . . . . . . . . . . . . . 70 4.3.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3.4 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . 77 4.3.5 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 79 4.4 Speaker models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.4.1 Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.4.2 i–vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.4.3 x–vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5 Data collection 92 5.1 Parkinson’s disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.1.1 PCGITA (Spanish) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.1.2 PD At-home (Spanish) . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.1.3 PD Longitudinal (Spanish) . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.1.4 Apkinson (Spanish) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.2 Cochlear implants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 CONTENTS 5.2.1 LMU TAPAS (German) . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.2.2 LMU Onset (German) . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.3 Supporting datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.3.1 Young healthy controls (Spanish) . . . . . . . . . . . . . . . . . . . . . 95 5.3.2 PhonDat 1 Corpus (German) . . . . . . . . . . . . . . . . . . . . . . . . 96 5.3.3 Verbmobil subset (German) . . . . . . . . . . . . . . . . . . . . . . . . 96 5.3.4 TEDx Spanish Corpus - TSC (Spanish) . . . . . . . . . . . . . . . . . . 96 6 Experiments and results 97 6.1 Models for speech analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.1.1 Phoneme posterior probabilities . . . . . . . . . . . . . . . . . . . . . . 97 6.1.2 Automatic detection of voice onset time . . . . . . . . . . . . . . . . . . 105 6.2 Parkinson’s disease patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.2.1 Automatic methods for the assessment of PD from speech . . . . . . . . 112 6.2.2 Speaker embeddings to monitor Parkinson’s disease . . . . . . . . . . . 121 6.3 Cochlear Implant users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.3.1 Quantification of phoneme precision to evaluate onset and duration of deafness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.3.2 Segmental and suprasegmental speech analysis of postlingually deafened CI users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.4 Aging and speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 6.5 Smartphone-based applications for health care . . . . . . . . . . . . . . . . . . . 159 6.5.1 Apkinson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.5.2 Cochlear Implant Testing App - CITA . . . . . . . . . . . . . . . . . . . 163 7 Summary 166 7.1 Automatic methods for speech analysis . . . . . . . . . . . . . . . . . . . . . . 166 7.2 Parkinson’s disease patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 7.3 Cochlear implant users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 7.4 Aging and speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 7.5 Smartphone-based applications for health care . . . . . . . . . . . . . . . . . . . 171 Appendices 172 A Speech tasks 173 A.1 Spanish speech protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 CONTENTS A.1.1 Vowel phonation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 A.1.2 Sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 A.1.3 Read text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 A.1.4 Speech diadochokinesia . . . . . . . . . . . . . . . . . . . . . . . . . . 174 A.1.5 Monologue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 A.2 German speech protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 A.2.1 Read text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 A.2.2 Rhino sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 A.2.3 PLAKSS words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 B Publications 177 B.1 Journal publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 B.2 Conference publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 List of Figures 182 List of Tables 188 Acronyms 192 Bibliography 195 Chapter 1 Introduction 1.1 Motivation Oral communication of adults and children can be affected by developmental or acquired speech disorders resulting from motor/neurological impairments (e.g., brain injuries, Parkinson’s disease) or sensory/perceptual disorders (e.g., hearing loss)1. On the one hand, neurological diseases such as Parkinson’s disease (PD) affect certain regions in the brain and the muscles involved in the speech production process, leading to different motor speech-based impairments such as imprecise articulation, slower speaking rate, monotonous speech, hoarse quality of voice, among others (Ho et al., 1999; Trail et al., 2005). On the other hand, perceptual disorders such as sensorineural hearing loss cause decreased speech intelligibility, changes in terms of phoneme articulation, abnormal nasalization, slower speaking rate, and decreased variability in fundamental frequency (Hudgins and Numbers, 1942; Langereis et al., 1997; Leder et al., 1987). One of the aims of pathological speech processing is the development of technology to support the diagnosis and monitoring of different medical conditions through speech (Gupta et al., 2016). This thesis focuses on the automatic acoustic analysis of speech signals captured from PD patients and people with hearing loss. Furthermore, as the speech of elderly people changes due to the aging process, a clinical condition, or both, the description of acoustic cues in the speech that reflect such differences is a topic that deserves special attention. PD is a neurodegenerative disease characterized by the progressive loss of dopaminergic neurons in the substantia nigra of the midbrain (Hornykiewicz, 1998). The primary motor symptoms of PD include tremor, slowness, rigidity of the limbs and trunk, postural instability, swallowing disorders, and speech impairments. Many of the symptoms are controlled with 1www.asha.org/Practice-Portal/Clinical-Topics/Articulation-and-Phonology 1 2 CHAPTER 1. INTRODUCTION medication, however, there is no clear evidence indicating positive effects of those treatments on the speech impairments (Skodda et al., 2010), but there is evidence showing that speech therapy combined with the pharmacological treatment improves the communication ability of PD patients (Schultz and Grant, 2000). The evaluation of PD requires the patient to be present at the clinic, which is time-consuming and expensive for both, the patient and healthcare system (Yang et al., 2020), however, the continuous monitoring of PD patients could help to make timely decisions regarding their medication and therapy. In the case of hearing loss, there are different treatments available for different types and degrees of deafness. A Cochlear implant (CI) is the most suitable device for severe and profound deafness when hearing aids do not improve sufficiently speech perception. A CI uses a sound processor to capture audio signals and send them to a receiver implanted under the skin behind the ear. The receiver transforms the signal into electrical impulses which are sent to electrodes implanted in the cochlea. However, CI users often present altered speech production and limited understanding even after hearing rehabilitation. Thus, if the deficits of speech would be better known the rehabilitation might be properly addressed (Pomaville and Kladopoulos, 2013). CI users require assistance before, during, and after surgery from audiologists, medical specialists in Otorhinolaryngology, and speech-language pathologists 2; however, speech production quality is seldom assessed in outcome evaluations, thus including speech technology could lead to a reliable outcome evaluation contributing to the rehabilitation success. This thesis addresses the automatic evaluation of speech production from PD patients and CI users by combining signal processing techniques with machine learning methods. Such methods are also considered to analyze the effect of age as another possible source of changes in speech production. Additionally, since the use of smartphones for health care has become more frequent, some of the speech processing techniques addressed in this thesis are implemented in Android-based applications. 1.2 Speech disorders in selected populations 1.2.1 Parkinson’s disease Clinical diagnosis Parkinson’s disease is characterized by a combination of some symptoms regarding motor control. Moreover, next to motor control, other symptoms such as mood changes, cognitive decline, and 2www.asha.org/Practice-Portal/Professional-Issues/Cochlear-Implants/ 1.2. SPEECH DISORDERS IN SELECTED POPULATIONS 3 sleep disorders might occur (Poewe, 2008). There is no standard method to diagnose PD. Doctors rely on the clinical history and physical examination to assess the patients. Additionally, the severity of the disease is evaluated by neurologist experts using different scales such as the Movement Disorder Society–Unified Parkinson Disease Rating Scale (MDS-UPDRS) (Goetz et al., 2008). This is a perceptual scale used to assess motor and non-motor abilities of the patients with 65 items distributed in four sections: • Section 1 (MDS-UPDRS-I, 13 items) concerns the non-motor experiences of daily living such as cognitive impairment, depressed mood, and fatigue. • Section 2 (MDS-UPDRS-II, 13 items) considers motor experiences of daily living such as eating, dressing, handwriting, and tremor. • Section 3 (MDS-UPDRS-III, 33 items) is used to evaluate the motor capabilities of the patient including speech production, upper/lower limbs movement, postural stability, and gait. • Section 4 (MDS-UPDRS-IV, 6 items) concerns motor complications such as time spent without medication (OFF state), time spent with dyskinesia (involuntary movements), among others. Speech production is evaluated by the neurologist during the patient’s visit to the clinic. The patients are asked to talk about different subjects in order to assess several aspects including speech’s volume, intelligibility, modulation of words, among others. The speech item of the MDS-UPDRS scale considers the following categories for the evaluation (Table 1.1): Table 1.1: Speech scoring system from the MDS-UPDRS-III. Score Category Definition 0 Normal No speech problems 1 Slight Loss of voice intensity or modulation 2 Mild Some words are unclear 3 Moderate Speech is difficult to understand 4 Severe Speech is unintelligible The MDS-UPDRS-III also includes the Hoehn & Yahr (H&Y) scale, which comprises a set of five severity levels where 1 is associated with a minimal or no functional disability and 5 is assigned to patients who are confined in bed or wheelchair unless aided. There are two variants of 4 CHAPTER 1. INTRODUCTION the scale, the original one with integer values for the stages from 1 to 5, and a modified one with the addition of stages 1.5 and 2.5 for a total of 7 severity levels (Hoehn et al., 1998). The MDS-UPDRS scale is suitable to assess the neurological state of the patients. However, speech production is evaluated only in one item. Regarding the complexity of speech, a single item summarizing different aspects such as voice, articulation, fluency, intonation, speaking rate, and intelligibility is not sufficient. The symptoms of motor speech disorders caused by PD are often associated with hypokinetic dysarthria, resulting from problems controlling the muscles and articulators involved in the speech production process. A more suitable clinical scale to evaluate speech impairments is the Frenchay Dysarthria Assessment–2 (FDA–2) (Enderby and Palmer, 2008), which is a perceptual scale used to evaluate dysarthria considering 34 items distributed in eight sections. Table 1.2 shows the aspects considered in the FDA–2 scale. The patients are asked to perform different tasks in each section. The category complementary refers to factors that might influence speech production. All sections (excluding Complementary) are rated on a 9-point scale. Table 1.2: List of items evaluated in the FDA–2 scale. Category Item Reflexes Cough, swallow, dribble/drool Respiration At rest, in speech Lips At rest, spread, seal, alternate, in speech Palate Fluids, maintenance, in speech Laryngeal Time, pitch, volume, in speech Tongue At rest, protrusion, elevation, lateral, alternate, in speech Intelligibility Producing words, sentences, conversation Complementary Hearing, sight, teeth, language, mood, posture, speech rate, sensation (upper lip and tongue tip) A modified version of the FDA–2 scale, i.e., the mFDA, was proposed by Orozco-Arroyave et al. (2018) and was designed to be applied considering only the speech recordings of the patient; therefore, the patient is not required to visit the clinic for assessment. The mFDA is 1.2. SPEECH DISORDERS IN SELECTED POPULATIONS 5 administered considering different speech tasks including sustained phonation of the vowel /a/, reading, monologues, and the alternating and sequential production of the syllables /pa-ta-ka/, /pa-ka-ta/, /pe-ta-ka/, /pa/, /ta/, and /ka/. The scale has a total of 13 items and each one of them ranges from 0 (normal or completely healthy) to 4 (very impaired), thus the total score of the mFDA ranges from 0 to 52. Table 1.3 shows the details of the mFDA scale. The main limitation of Table 1.3: List of items evaluated in the mFDA scale. Category Item Speech task Respiration Duration of the recording Sustained phonation of the vowel /a/ Breathing capacity Multiple repetition of /pa-ta-ka/, /pa-ka-ta/, /pe-ta-ka/ Lips Strength of lip closure Multiple repetitions of the syllable /pa/ Lips control Reading, monologue Palate Nasality Reading, monologue Velar movement Multiple repetitions of the syllable /ka/ Larinx Phonatory capability 1 Sustained phonation of the vowel /a/ Phonatory capability 2 Reading, monologue Monotonicity Reading, monologue Effort to produce speech Reading, monologue Tongue Velocity to move the tongue 1 Multiple repetition of /pa-ta-ka/ and /pa-ka-ta/ Velocity to move the tongue 2 Multiple repetitions of the syllable /ta/ Intelligibility Speech intelligibility Reading, monologue the MDS-UPDRS or mFDA is the lack of precision, since the severity of the disease is evaluated based on a perceptual score which depends on the experience of the clinician. Speech production PD affects the speech of the patients in different ways. For instance, stability and periodicity problems are caused by an inadequate closing of the vocal folds, which is related to rigidity in the muscle (Hanson et al., 1984). Thus, perturbations in the vibration of vocal folds can be measured by estimating fundamental frequency (F0) based features from the sustained phonation of vowels (Almeida et al., 2019; Skodda et al., 2013; Tsanas et al., 2010). Articulation-based deficits are mainly related with reduced amplitude and velocity of lip, tongue, and jaw movements causing a reduced articulatory capability in PD patients to produce vowels and continuous speech (Ackermann and Ziegler, 1991; Skodda et al., 2011). Such reduction can be measured by computing the triangular Vowel Space Area (tVSA) formed with the formant frequencies F1 and F2 extracted from the vowels /a/, /i/, and /u/, while articulation-based problems in continuous speech can be detected by analyzing the transitions from voiced-to-voiceless sounds (and vice versa) and computing spectral-based fratures such as the Mel-Frequency Cepstral Coefficients 6 CHAPTER 1. INTRODUCTION (MFCCs) (Orozco-Arroyave, 2016; Skodda et al., 2011). PD can also influence speech at the segmental (individual sounds/phonemes) and suprasegmental level (speech prosody). For instance at the segmental level, some studies have found that the difficulties of PD patients to control laryngeal movements affects the production of stop consonants e.g., /p/, /t/, /k/, /b/, /d/, /g/ (Fischer and Goberman, 2010). Such difficulties are typically measured by means of the Voice Onset Time (VOT), which is defined as the time interval between the initial burst of a stop consonant and the onset of voicing for the following vowel. The changes in the duration of the VOT produced by patients often differs when compared with respect to a group of age-matched healthy speakers (Argüello-Vélez et al., 2020; Montaña et al., 2018; Novotný et al., 2015; Tykalova et al., 2017). Speech deficits at the segmental level can also be detected by estimating the probability of occurrence of phonemes in a speech sequence (phoneme posterior probabilities), which can be achieved by training a deep neural network to learn the representation of several phoneme classes grouped according to different phonological rules (Cernak et al., 2015; Vásquez-Correa et al., 2019). Suprasegmental speech deficits include variation in intonation, reduced loudness, variable speech rate, among others (Jones, 2009). These deficits can be measured by means of the F0 contour, energy content of the signal, and the amount of speech units (words, voiced segments) produced by the speakers. Chapter 2 contains more details about the relationship between PD and the speech production system. 1.2.2 Hearing loss Clinical diagnosis Hearing loss can appear due to various reasons such as senescence, trauma, inflammation, aging, and others, and often without a known cause. Hearing loss can be acquired or it can be congenital, e.g. because of genetic alterations, intrauterine infections or malformations. The treatment for hearing loss depends on the severity and cause. The grade of the impairment can be categorized as normal, mild, moderate, severe, or profound depending on audiometry descriptors. Such descriptors are usually obtained by a pure-tone audiometry test which consists of a threshold search by reproducing sinusoidal waveforms (through speakers or headphones) at different frequencies (125 Hz, 250 Hz, 500 Hz, and from 1000 Hz to 8000 Hz in steps of 1000 Hz) and intensity levels. The patient is asked to indicate whether the sounds are perceived by raising a hand or pressing a button. Figure 1.1 shows an audiogram indicating the degree of hearing loss for different loudness and frequency values. For instance, a person that can only hear sounds between 40 dB and 60 dB might suffer from moderate hearing loss. 1.2. SPEECH DISORDERS IN SELECTED POPULATIONS 7 Figure 1.1: Audiogram indicating the degree and type of hearing loss for different loudness and frequency values. The hearing thresholds correspond to the range of values adopted by the World Health Organization (Olusanya et al., 2019). Although, the pure-tone audiometry test provides useful information about the hearing status of a person, expert clinicians do not rely solely on such a test to determine the adequate treatment of the patient. Treatment options are provided to the patient depending on the type of hearing loss which can be conductive, sensorineural, or a mixture of both (Weber and Klein, 1999). On the one hand, conductive hearing loss occurs due to a damage produced in the outer or middle ear or by a malformation (e.g. ear canal, middle ear), causing the person to perceive sounds with low intensity levels. Usually, hearing aids can be used as a treatment option because it amplifies the sounds to improve audio perception. There are types of conductive hearing loss that can be treated with medication or surgery. On the other hand, sensorineural hearing loss is related to disorders in the inner ear (cochlea) or the auditory nerve system resulting in disabling hearing impairment. Usually, therapy consists of the amplification of sounds by hearing aids which are adapted to the hearing loss at different frequencies in the hearing range. In more profound hearing loss and deafness (in the following summarized as deafness), amplification of sounds is not enough to provide sufficient hearing for speech perception. In this case, CIs are the most suitable devices for treatment. Contrary to hearing aids, a CI bypasses the damaged portions of the ear and directly stimulate the auditory nerve. In the cochlea, frequencies are arranged from high frequencies at the base to the deep frequencies at the top. The inserted implant in the cochlea 8 CHAPTER 1. INTRODUCTION follows this natural representation of the sounds called “tonotopy” and stimulates the nerves that correspond to the region of excitation. Although hearing with a CI is quite different from normal hearing, speech understanding can be restored (Lenarz, 2017; Pisoni et al., 2017). Regarding the outcome after cochlear implantation, some aspects need to be considered. The time of occurrence of sensorineural hearing loss also affects the speech perception and production of the CI users. On the one hand, prelingual onset of deafness refers to people who lost their hearing capability before the acquisition of spoken language, their speech production is affected because they have never monitored their own speech (Smith, 1975). On the other hand, postlingual onset of deafness refers to people who lost their hearing after speech acquisition, however, their speech production might be affected by the lack of sufficient and stable auditory feedback (Leder and Spitzer, 1990). Speech production People suffering from severe/profound deafness may experience different speech production disorders. At a segmental level, such disorders include voicing errors, phoneme misarticulation, vowel errors, among others (Gold, 1980; Waldstein, 1990). Voicing errors might be caused due to failed attempts to coordinate respiration, phonation (voicing), and articulation resulting in a confusion of the voiced-voiceless distinction. Thus, similar to the PD patients, voicing errors can be detected by automatic extraction of voiced sounds, i.e., speech segments with F0 values different than zero. Phoneme production errors are caused by different reasons. For instance, the studies reviewed by Osberger and McGarr (1982) revealed that there was a general trend of hearing impaired people to better produce the most visible phonemes, e.g., phonemes produced with the lips or/and teeth. Consonant errors can also occur due to incorrect timing of the articulators e.g., causing nasalization of non-nasal speech sounds due to improper velar control (Kato and Yoshino, 1988; Stevens et al., 1976). Such phoneme articulation errors might cause a decreased speech intelligibility, which can be evaluated with Automatic Speech Recognition (ASR) systems, phoneme posterior probabilities, among others. At suprasegmental level, the speech of severely and profoundly hearing impaired speakers also exhibits deviation from normal speech in timing and voice quality. On the one hand, people suffering from hearing loss have been reported to speak slower than healthy people due to the prolongation of speech and non-speech segments (consonants, vowels, pauses), and the insertion of pauses within sentences (Oster, 1990). On the other hand, voice quality problems include abnormally high F0 values (particularly in adolescent and adult males) and insufficient or excessive variations of F0 within a sentence (Gold, 1980). Thus, similar to the speech of PD patients, some of the suprasegmental aspects of speech can be evaluated by computing F0-related features, duration, speech rate, energy, among others. 1.3. HYPOTHESES 9 Chapter 2 contains more details about the role of auditory feedback on the speech production system. 1.2.3 Aging Speech of the elderly sometimes can be called “slurred” with comprises slight changes in voicing, articulation and prosody. The changes in organs and tissues involved in voice production which are associated with the aging process include facial skeleton growth (Israel, 1973), pharyngeal muscle atrophy (Zaino and Benventano, 1977), tooth loss (Adams, 1991), reduced mobility of the jaw (Kahane, 1981), tongue musculature atrophy, and weakening of pharyngeal musculature. The precise nature of vocal resonance is unclear, however a consistent pattern seems to be a vocal tract lengthening with age (Linville, 1996). These changes alter the phonation and articulation dimensions of speech, for instance elderly people exhibit a significantly greater frequency perturbation than the young speakers (Benjamin, 1981). There are also differences in the stability of F0 and amplitude of vocal fold vibration relative to young and middle-aged adults (Xue and Deliyski, 2001). Changes in F0 and the formant frequencies have been also observed in longitudinal analyses. Particularly, changes in the first formant frequency are believed to compensate the decline of F0 in order to maintain the auditory distance between F0 and F1 (Reubold et al., 2010). The influence of some of these parameters on speech assessment have been addressed before when measuring speech intelligibility by considering an Automatic Speech Recognition (ASR) system. In the experiments performed by Vipperla et al. (2010) on adult and older voices, the authors found that elderly people show increased jitter and shimmer and these variations have an impact on average phoneme recognition. 1.3 Hypotheses Since different factors influencing speech production are considered in this thesis, the following hypotheses are investigated: • It is possible to evaluate the speech production of PD patients, CI users, and elderly speakers using similar signal processing techniques. • Since PD is a progressive disease that also affects speech, it is possible to assess the progression per patient from speech signals captured in different recording sessions. • The duration and onset of deafness influences speech production of CI users in different ways, thus, automatic acoustic analysis can be used to detect these changes. 10 CHAPTER 1. INTRODUCTION • It is possible to use smartphone applications to evaluate the speech production of PD patients and CI users. • Aging affects different aspects of speech production and such changes can be captured by most of the features considered to analyze pathological speech. 1.4 Objectives 1.4.1 General objective To propose a methodology for the monitoring of pathological speech signals combining different signal processing techniques and machine learning methods. 1.4.2 Specific objectives • To identify the contribution of different speech dimensions for the automatic assessment of pathological speech signals. • To analyze and select the most suitable features to detect changes in pathological speech signals. • To combine different speech processing techniques and machine learning methods for the automatic assessment of pathological speech signals. 1.5 Contribution of this thesis • Collection of a speech corpus from PD patients and CI users. The recordings were captured in clinical settings and at-home of the patients using smartphones. • A methodology for the automatic detection of VOT in voiceless stop sounds using a deep neural network approach. • A methodology to monitor the progression of PD patients over time using automatic acoustic analysis. • A methodology to quantify the phoneme production of CI users using a deep neural network approach. 1.6. STRUCTURE OF THE THESIS 11 • A methodology to evaluate the impact of age on different acoustic measurements. • Implementation of signal processing techniques on smartphones to evaluate speech produc- tion of PD patients and CI users. • Participation in the development of the mobile applicationApkinson, used to collect speech and movement data from PD patients. • Participation in the the development of the mobile application CITA (Cochlear Implant Testing App), which is intended to collect data from CI users in order to evaluate the speech perception and production of the patients. The source code of CITA is based on Apkinson. 1.6 Structure of the thesis Chapter 2 includes information about the physiological processes of speech production, the influence of PD in speech motor control and speech disorders associated with the disease. This chapter also gives an overview of the auditory system, cochlear implants, and the role of auditory feedback in speech motor control. Chapter 3 includes information about the contributions in the state-of-the-art methods related to predicting the severity of PD from speech signals, automatic methods used for analysis of speech production in CI users, and smartphone-based applications developed to evaluate PD and hearing loss. Chapter 4 describes the speech processing techniques and acoustic features used to model speech disorders. Additionally, this chapter includes the machine learning methods used in this thesis for classification, regression analysis, and speaker models. Chapter 5 includes details about the PD patients, CI users, and healthy speakers considered in this thesis. Additional databases used to support the training of models used for automatic speech analysis are also described. Chapter 6 includes the experiments and results obtained for the automatic analysis of PD patients and CI users from speech signals, and the effect of aging in speech production. Chapter 7 summarizes the addressed aspects about pathological speech analysis. Chapter 2 Speech production process 2.1 Speech chain In the speech chain model described by Denes and Pinson (1993), oral communication consists of a sequence of events happening on three levels: linguistic, physiological, and acoustic. The process to produce intelligible speech starts in the speaker’s brain, at the linguistic level (Fig- ure 2.1). First, the speaker collects his/her thoughts, decides what words to say, and places these words to form sentences according to language dependent rules. The speech production process continues at the physiological level, with the neural activity inside the brain sending the necessary instructions to activate the muscles that control the vocal folds, tongue, lips, jaw, among others. The speech production is completed at the acoustic level, where the movements of the vocal muscles (combined with the air coming from the lungs) generates speech sound waves. Once the speech is produced it travels through the air activating the hearing mechanism of the listeners. The auditory feedback plays a key role in oral communication because it helps the speakers to continuously monitor the quality and intelligibility of their own speech. 2.1.1 Physiological processes of speech production In general, the speech production process involves the complex coordination and activation of different muscles and limbs in the respiratory, laryngeal, and oral motor system. The respiratory system is essential to produce speech by generating air pressure from the lungs during the expiratory and inspiratory phases. The airflow passes a small valve, the glottis, which is formed by the two vocal folds. During respiration, the vocal folds are in a lateral position. During phonation, the vocal folds are closing resulting in vibrations of the soft mucosal tissue as a result of the 12 2.1. SPEECH CHAIN 13 Speech signal Auditory feedback Auditory nerve Speech motor control Figure 2.1: The speech production process starts in the brain, at the linguistic level, continues with the neural and motor activity at the physiological level, and its completed with the generation and transmission of sound waves at the acoustic level. The auditory feedback allows the speaker to monitor its own speech. Based on Denes and Pinson (1993). subglottal pressure and the airflow passing through the glottis (Van den Berg, 1958). During oscillation, the vocal folds convert the air into a rapid sequence of airflow pulses generating audible sounds (voice source sounds), which are perceived as a buzz whose frequency is proportional to the vibration rate. During the production of the airflow pulses, the vocal folds have four main stages: closed, opening, open, and closing (Figure 2.2). Speech sounds produced in this way are commonly known as voiced sounds. If the vocal folds remain open, then the source of energy for speech production is a stable stream of air coming from the lungs which is made audible by other articulator (s) at some place in the vocal tract. The speech sounds that are not produced by vibration of the vocal folds are commonly known as unvoiced sounds. The oral motor system includes the articulatory mechanism necessary to modulate the voice source which allows us to produce speech sounds with different acoustic properties. Such properties depend on the shape of the vocal tract, which can be modified by moving the principal articulators namely the tongue, lower jaw, lips, and velum. The oral motor system also includes nasal, oral, and pharyngeal cavities which act as resonance chambers to transform the stream of air into sounds with an additional acoustic characteristic (Benesty et al., 2007; Denes and Pinson, 1993; Fant, 1980). Figure 2.2 shows a diagram of the main articulators and resonators (oral, nasal 14 CHAPTER 2. SPEECH PRODUCTION PROCESS pharyngeal cavities) involved in the speech production process. The air coming from the lungs is the source to generate speech sounds. The muscles in the larynx act as a valve to control the air stream coming from the lungs. The coordination and movements of the different articulators together with the nasal, oral, and pharyngeal cavities provide the acoustic properties necessary to generate different speech sounds. For instance, the vowel /a/ is commonly produced by a Figure 2.2: Schematic views of the speech production system. (Left) Vocal folds vibration pattern during the production of voiced speech segments. (Right) Components of the vocal tract used to produce speech sounds. Based on Benesty et al. (2007) and Denes and Pinson (1993) combination of tongue, jaw, and vocal folds movements. The vibration of the vocal folds creates the voice source sound, which is then modulated by opening the mouth (lowering of the jaw) and holding the tongue in a low position. Another example is the production of plosive sounds such as /p/, which is produced by blocking (for a short period of time) the air stream with the lips building enough air pressure to produce the sound when the closure is released. Generally, the vocal folds remain open when producing the consonant /p/. Nasal cavities are also used to generate speech sounds. For instance the nasal consonants /n/ and /m/ are produced during vibration of the vocal folds and by blocking the air stream in the oral cavity with the lips (in the case of /m/) or the tip of the tongue (in the case of /n/). Additionally, the velum partially blocks the air to the vocal cavity and routes it to the nasal cavity. 2.2. IMPACT OF PARKINSON’S DISEASE ON SPEECH MOTOR CONTROL 15 2.2 Impact of Parkinson’s disease on speech motor control 2.2.1 Neuropathophysiology of motor control related to Parkinson’s dis- ease Motor deficits in PD can be analyzed by considering the interaction of the basal ganglia, the motor cortex, and the thalamus (Figure 2.31). The basal ganglia are a group of neural formations (subcortical structures) including the striatum (putamen and caudate nucleus), the Globus Pallidus and its internal (GPi) and external (GPe) segments, the subthalamic nucleus (STN), and the substantia nigra pars compacta (SNpc) and pars reticulata (SNpr). Anatomically, the STN belongs to the subthalamus and the substantia nigra to the midbrain, however, they play a key role in the functioning of the basal ganglia. Motor impairments in PD are mainly caused due to a degeneration of dopaminergic neurons in the SNpc located in the midbrain. The main function of the subcortical structures in the basal ganglia is to send signals to the thalamus which then influence the activity in the motor cortex. This interaction can be analyzed considering the most basic circuit model of the basal ganglia proposed by Albin et al. (1989) more than 30 years ago. Although, more complex connections in the basal ganglia have been discovered since then (Bostan and Strick, 2018; Redgrave et al., 2010), the basic model proposed in the late 80s is still valid to understand some of the most important aspects of motor control related to PD (Milardi et al., 2019). Figure 2.4 shows a diagram of the neural circuits and neurotransmission mechanism involved in the communication between cerebral cortex and basal ganglia. Basically, the circuit model involves two main parallel loops: 1. The first loop is a cortex-to-cortex circuit in which the motor cortex sends signals to the striatum, from which neural projections travel to the globus pallidus and then continue to the thalamus which in turn sends information to the motor cortex. 2. The second loop involves activity from the substantia nigra, which projects dopaminergic neurons to the striatum causing two opposite effects on two different receptors, the D1 and D2 dopamine receptors: excitation (in D1) and inhibition (in D2). The excitation and inhibition of movements are regulated by the dopaminergic input to the striatum (from the SNpc) and go to the basal ganglia via the direct and indirect pathways: 1These figures are adapted versions of https://commons.wikimedia.org/wiki/File: Basal_ganglia_circuits.svg and https://commons.wikimedia.org/wiki/File: Midbrainsection.svg Last retrieved 02/02/2021; under the Creative Commons Attribution-Share Alike 3.0 Unported license. https://commons.wikimedia.org/wiki/File:Basal_ganglia_circuits.svg https://commons.wikimedia.org/wiki/File:Basal_ganglia_circuits.svg https://commons.wikimedia.org/wiki/File:Midbrainsection.svg https://commons.wikimedia.org/wiki/File:Midbrainsection.svg 16 CHAPTER 2. SPEECH PRODUCTION PROCESS Figure 2.3: Schematic views of the motor cortex, the thalamus, and components of the basal ganglia. (A) shows a lateral view of the left hemisphere of the human brain. The dashed vertical lines represent two coronal cuts (B and C) of posterior sections of the brain. (D) shows a superior view of the midbrain signaling the substantia nigra (with SNpc and SNpr) in a healthy (left) and Parkinson’s disease (right) brain. GPi: Globus pallidus internal segment; GPe: Globus pallidus external segment; STN: Subthalamic nucleus; SNpc: substantia nigra pars compacta; SNpr: substantia nigra pars reticulata. Adapted from Häggström (2021) and Madhero (2021) • Direct pathway: The main function of the direct pathway is to excite the motor cortex and to facilitate movement. This pathway begins in the motor cortex, where the neural impulses enter the basal ganglia through the striatum via glutamatergic neurons, which produce an excitatory neurotransmitter called glutamate. Then, the neurons from the striatum send their axons to the GPi and SNpr via GABAergic inhibitory projections. The neurons from the GPi/SNpr communicate with the thalamus, also via inhibitory projections. Then, the 2.2. IMPACT OF PARKINSON’S DISEASE ON SPEECH MOTOR CONTROL 17 thalamus excitatory pathways go to the motor cortex resulting in an increased motor activity. • Indirect pathway: The main function is to inhibit motor activity by suppressing involuntary movement. The pathway begins in the motor cortex by projecting glutamate to the striatum. The neurons in the striatum send their axons to the GPe, then continue to the STN and the GPi/SNpr, which in turn, suppress the activity of the thalamus on the motor cortex. Figure 2.4: Diagram of the internal connections between motor cortex and basal ganglia. The dashed red lines indicate inhibitory projections and the green lines indicate excitatory projections. In the direct pathway the striatum communicates directly to the GPi and SNpr. In the indirect pathway, the striatum communicates to the GPi and SNpr through the GPe and the STN. The dopamine projected from the SNpc to the striatum causes excitatory and inhibitory effects on D1 and D2 receptors, respectively. GABA: y-aminobutyric acid; GPi: Globus pallidus internal segment; GPe: Globus pallidus external segment; STN: Subthalamic nucleus; SNpc: substantia nigra pars compacta; SNpr: substantia nigra pars reticulata. Based on Obeso et al. (2000) In summary, dopamine helps to regulate the excitability of the neurons in the striatum, which is involved in the body movement. In a healthy brain, the signal that is forwarded from the motor cortex (and continues to the body) is the result (in part) of a balanced activation of neurons in the direct and indirect pathways. In PD patients, decreased dopamine levels cause an increased inhibition in the GPe in the indirect pathway. In parallel, there is a decreased inhibition of the GPi activity in the direct pathway. The result is an increased activity in the GPi/SNpr output of the basal ganglia, which makes it difficult to the patients to control their movements (Obeso et al., 2000). 18 CHAPTER 2. SPEECH PRODUCTION PROCESS 2.2.2 Motor speech disorders in Parkinson’s disease The speech production disorders often associated with PD are known as hypokinetic dysarthria, which is the result of a dysfunction in the basal ganglia internal pathways. As described by Duffy (2000), hypokinetic dysarthria is characterized by a reduction in the range of movements, rigidity, and slow repetitive movements affecting different dimensions of speech such as phonation, articulation, and prosody. Phonation problems include tight breathiness, hoarse speech, voice tremor, and bowing of the vocal cords. Phonatory deviations are usually evaluated during sustained phonation of vowels. In the case of articulation, the reduced range of movements of jaw, lips, and tongue results in prolongation of speech sounds, problems to initiate speech, and imprecise articulation of sounds, which can be evident during speech tasks including conversations, reading, alternating and sequential production of syllables (/pa/, /ta/, /ka/, and /pa-ta-ka/). In the case of prosody, the most common speech disorders include a reduction in the variability of pitch (monopitch) and loudness (monoloudness), rapid speech rate, reduced loudness. Prosodic deviations are mainly detected during conversational and read speech tasks. 2.3 Auditory system and speech control 2.3.1 Overview of the auditory system The auditory system is composed of the outer, middle, and inner ear (cochlea) and regions in the brain including the auditory cortex. Figure 2.52 shows a diagram of the components in the ear. Sound waves travel through the ear canal (an air-filled path) setting the eardrum into vibration. The middle ear (an air-filled chamber) acts as a mechanical bridge between the eardrum and the inner ear by means of three small bones (malleus, incus, and stapes). The movements of the eardrum are transmitted by these bones to the oval window, which is the entrance to the inner ear: the cochlea is a fluid-filled cavity (perilymphatic fluid) with three scales formed as a snake. In the middle scale is the Corti organ on the basilar membrane which contains the hair cells. The mechanical vibrations produced in the middle ear are transformed into electrical signals by hair cells found in the basilar membrane within the cochlea (Figure 2.53). Specifically, when the oval window is being push-in by the stapes, the fluids in the cochlea are moved towards the apex, 2This figure is an adapted version of https://en.wikipedia.org/wiki/File:Anatomy_of_the_ Human_Ear_cs.svg Last retrieved 02/02/2021; under the Creative Commons Attribution-Share Alike 2.5 Generic license 3This figure is an adapted version of https://medienportal.siemens-stiftung.org/en/ cochlea-transparent-uncoiled-101976 Last retrieved 02/02/2021; under the Creative Commons Attribution-ShareAlike 4.0 international license https://en.wikipedia.org/wiki/File:Anatomy_of_the_Human_Ear_cs.svg https://en.wikipedia.org/wiki/File:Anatomy_of_the_Human_Ear_cs.svg https://medienportal.siemens-stiftung.org/en/cochlea-transparent-uncoiled-101976 https://medienportal.siemens-stiftung.org/en/cochlea-transparent-uncoiled-101976 2.3. AUDITORY SYSTEM AND SPEECH CONTROL 19 generating pressure waves at different points in the basilar membrane, which in turn, bends the hair cells releasing a neurotransmitter that fires auditory neurons that connect the ear with the brain. There are two different hair cells - the inner hair cell that functions as a receptor and the outer hair cell that amplify the incoming signal. A deviation of the basilar membrane leads to a bending of the tiny hairs on the top of these cells that results in rhythmic elongation and shortening of the outer cells according to the frequency representation at their location and by that to an increased basilar membrane vibration. The flow of fluid inside the cochlea produced by the inward movement of the oval window is accommodated by the round window at the other end of the cochlea (Denes and Pinson, 1993). The information about frequencies of the acoustic signals are encoded by the auditory system by locating the places of the basilar membrane in which the pressure waves produce the maxi- mum displacement (vibration) amplitude. For instance, the place of maximum displacement for high frequencies occurs near the base (stiffest part), while for lower frequencies, the place of max- imum vibration displacement occurs towards the apex (Loizou, 1999). After the sound waves are transmitted and transformed into electrical impulses in the inner ear, the receptor neurons transmit the signals over a pathway of nerves (passing through regions of the medulla and the midbrain) connected to the auditory cortex. The phenomenon of frequency-localization-organization called “tonotopy” persists from the cochlea over the neurons to the cortex. Figure 2.5: (Left). Schematic view of the outer, middle, and inner ear. (Right). portion of the cochlea in the inner ear. Sound waves are transformed into electrical signals by the bending of the hair cells inside the basilar membrane. Adapted from Brockmann (2021). 20 CHAPTER 2. SPEECH PRODUCTION PROCESS 2.3.2 Cochlear implants (CIs) As described in Section 1.2.2, sensorineural hearing loss is caused by disorders in the inner ear occurring at birth, due to a disease, as the result of an infection, among others. For instance, Meningitis is an infection that can destroy the hair cells within the cochlea. Thus, without the hair cells, the connection between the ear and the central nervous system is broken (Weber and Klein, 1999). CIs bypass the damaged parts by triggering the hearing nerves via a direct electrical stimulation through electrodes inserted in the cochlea (Figure 2.64). In general, a CI consists of an external speech processor, which captures, preprocesses, and transforms the speech signals into electrical impulses which are sent to an array of electrodes implanted inside the cochlea of the patient. Commonly, the insertion of the electrodes is performed through the round window. The insertion depth depends on the size of the cochlea and can reach distances close to the apex (Carlson, 2020; Lenarz, 2017). The implants may have 12 or 22 (only half of them are active) electrodes along the cochlea. There are a number of factors that can influence the frequency resolution of the sounds perceived with help of a CI (Brant and Eliades, 2020; Loizou, 1999). Some factors are: 1. The distance between the electrode contacts and the auditory neurons. Neural activation decreases as the result of a decreased strength of the electrical stimulation in the targeted neuron region. 2. The spread of the electrical stimulation. The propagation of the electrical current in the electrodes, is spread by the perilymphatic fluid along the cochlea, thus, the electrical excitation is not focused on a single region and might excite the surrounding neurons. 3. The number of auditory neurons available for electrical stimulation is limited. In order for the CI to work properly, there has to be neural tissue left to receive electrical current. 4. The insertion by the surgeon sometimes is difficult resulting in a diminished number of activated electrodes. Considering what is mentioned above, it is clear why a CI user may notice differences between the sounds perceived and the sounds produced, even after cochlear implantation (Lane et al., 1995). 4This figure is an adapted version of https://www.embopress.org/doi/full/10.15252/emmm. 201911618 Last retrieved 02/02/2021; under the Creative Commons Attribution 4.0 license https://www.embopress.org/doi/full/10.15252/emmm.201911618 https://www.embopress.org/doi/full/10.15252/emmm.201911618 2.3. AUDITORY SYSTEM AND SPEECH CONTROL 21 Figure 2.6: (Left). Schematic view of a cochlea (and cross-section) with normal hearing. (Right). Schematic view of a cochlea (and cross-section) with implant. Commonly, the electrode array is implanted through the round window. The electrical stimulation of the electrode contact is spread in a region of the target neurons. Adapted from Dieter et al. (2020). 2.3.3 Auditory feedback and speech control Auditory feedback is the precondition of constant survey and correction of our own speech and by that for the development and maintenance of speech movements. As described by Tourville et al. (2008), speech motor control is characterized by feedback and feedforward control. On the one hand, in feedback control the performance of the movements is evaluated during execution and any deviation is corrected according to sensory information. On the other hand, in feedforward control the performance of the movements depends on previously learned commands without relying on sensory information. These mechanisms of speech control are often examined and include different aspects. Some examples of the impact of auditory feedback on these two processes include: • Voice control, when a speaker raises his/her voice because the self-perceived loudness is too low or simply to overcome background noise (Lombard effect; Lombard (1911)). • Speech disfluency caused by delayed auditory feedback (Stuart et al., 2002) • Adaptation of formant frequencies when a speaker hears persistent shift of formants of their own speech (Purcell and Munhall, 2006; Tourville et al., 2008). Normally, speech production is constantly monitored and compared to an internal speech model in the brain which is acquired and maintained with the use of auditory feedback (Perkell et al., 2000). In the Directions Into Velocities of Articulators (DIVA) model of speech production proposed by Guenther (1994), the speech movements are planned considering a speech sound map (in the motor cortex) that is activated to: (1) learn speech sound targets and (2) to control the necessary articulatory movements to achieve different acoustic goals (Guenther and Hickok, 22 CHAPTER 2. SPEECH PRODUCTION PROCESS 2016). With ongoing hearing loss the speech sound map can slightly change, but moreover, the sensory-motor control is decreasing as one tends to use only as much force and effort for all movements as necessary (Guenther et al., 2004; Perkell et al., 2007). This has a considerable impact on speech of people with hearing impairment. For instance, when hearing loss occurs after speech acquisition (post-lingual onset of deafness), at first somatosensory feedback maintains precise speech production. If there is a persistent lack of auditory feedback, speech production may eventually deteriorate due to a diminished precision of articulation. Summary The speech production process requires the complex coordination of regions in the brain, vocal tract, and auditory system. Depending on the clinical condition, different aspects of speech can be affected, and thus, it is possible to detect these changes using automatic acoustic analysis. The following chapter describes the techniques and methods used to model pathological speech signals and detect speech production changes by analyzing aspects related to phonation, articulation, and prosody. Chapter 3 State-of-the-art 3.1 Severity estimation of Parkinson’s disease from speech Typically, the assessment of the neurological state of PD patients from the speech is performed using regression analysis, which consists of training a model to learn the relationship between acoustic features (extracted from the speech signals) and the clinical score of the patient. Several studies have addressed the prediction of clinical scores of PD patients. Asgari and Shafran (2010) proposed a methodology to predict the UPDRS-III score (motor sub-score) from speech recordings of 61 PD patients and 21 Healthy Controls (HC). Phonation, articulation, and prosody analyses were performed by extracting acoustic features from the sustained phonation of the vowel /a/, the rapid repetition of /pa-ta-ka/, and the reading of three standard texts. The set of features considered are F0, jitter (cycle-to-cycle variation of pitch), shimmer (cycle-to-cycle variation of the glottal waveform), spectral entropy (entropy of the log power spectrum), cepstral coefficients (shape of the spectral envelope), the number and duration of voiced and unvoiced frames, among others. A feature vector was formed for each speaker, and a Support Vector Regressor (SVR) was trained to predict the patients’ UPDRS scores. The authors reported that it is possible to estimate the UPDRS-III with a Mean Absolute Error (MAE) of 5.66 using an ε-SVR with a cubic polynomial kernel. Tsanas et al. (2010) performed regression analysis to estimate the UPDRS scores from 42 PD patients (28 male, 14 female). Speech recordings with the sustained phonation of vowels were captured once per week for six months. However, the neurological state of the patients was assessed only three times during that period: at the beginning, three months later, and at the end. Thus, the authors used a piece-wise linear interpolation in order to obtain the missing UPDRS scores. Speech signals were modeled considering acoustic features based on pitch/amplitude perturbation, noise, and entropy. Regression analysis was 23 24 CHAPTER 3. STATE-OF-THE-ART performed using least squares, Least Absolute Shrinkage and Selection Operator (LASSO), and Classification And Regression Trees (CARTs). Additionally, the MAE was used to evaluate the proposed approach’s performance to estimate the total UPDRS and the scores from the motor section (UPDRS-III). The authors reported that the CARTs approach was the best approach with an MAE of 7.5 points in the evaluation of the total value of the UPDRS scale. The scores of the motor section in the UPDRS were estimated with an MAE of 6 points. Skodda et al. (2013) presented a study where the speech deterioration was evaluated over time. The speech of 80 PD patients (48 male, 32 female) was recorded from 2002 to 2012 in two recording sessions. The time between the first and second sessions ranged from 12 to 88 months. A control group of 60 healthy persons (30 male, 30 female) was also considered. The participants were asked to read a text and to produce a sustained phonation of the vowel /a/. In both sessions, the patients were assessed by neurologist experts according to the UPDRS-III. The audio signals were perceptually evaluated considering four aspects of speech: voice, articulation, prosody, and fluency. Acoustic analysis was performed to describe these speech aspects. Voice was modeled with a set of features, including jitter, shimmer, and pitch average. The Vowel Articulation Index (VAI) and the proportion of pauses within polysyllabic words were considered for articulation. Prosody is analyzed with the estimation of the standard deviation of the pitch. In addition, fluency was evaluated considering the speech rate and the pause ratio. To assess the progression of speech and voice impairments, the authors compared the extracted features in the first and the second session. The authors found significant differences for shimmer, speech rate, pause ratio, and VAI when features extracted from the first session were compared to the same features extracted from the second session. However, as the authors stated, the results were not conclusive due to methodological limitations, like a long time between the two recording sessions. Bayestehtashk et al. (2015) considered three regression techniques to predict the UPDRS scores, including ridge regression, LASSO regression, and linear SVR. Speech recordings of 168 patients were collected in a single recording session. Automatic methods for acoustic analysis of PD was also addressed in the Parkinson’s Condition sub-challenge, which was part of the INTERSPEECH 2015 Computational Paralinguistic Challenge (Schuller et al., 2015). The challenge consisted on predicting the MDS-UPDRS-III score, using recordings of 50 patients (25 male, 25 female) included in the PC-GITA database (Orozco-Arroyave et al., 2014) were considered to form the train and development subsets. The test set included a total of 11 new patients recorded in non- controlled noise conditions, i.e., not using a sound-proof booth. A total of 42 speech tasks were considered. The neurological state of the patients was assessed by a neurologist expert according to the MDS-UPDRS-III subscale. The winners of the challenge (Grósz et al., 2015) reported 3.1. SEVERITY ESTIMATION OF PARKINSON’S DISEASE FROM SPEECH 25 a Spearman’s correlation of 0.65 between the real MDS-UPDRS-III scores and the predicted values using deep rectifier neural networks and Gaussian processes. Orozco-Arroyave et al. (2016) presented a methodology to estimate the neurological state (MDS-UPDRS-III) of 158 PD patients: 50 Colombian (25 male), 88 Germans (47 male), and 20 Czech (all male). The regression process was performed using a linear ε-SVR. The speech tasks considered are reading isolated words, sentences, a standard text, and a monologue. In order to model articulation problems, the authors extracted the energy in the transitions from unvoiced to voiced (onset) and from voiced to unvoiced (offset) segments considering different frequency bands distributed according to the Bark and the Mel scales. Speech intelligibility was evaluated using an automatic speech recognition system. According to the authors, the neurological state of the patients (MDS-UPDRS-III) can be estimated with a Spearman’s correlation of up to 0.74 when several speech tasks are modeled considering the fusion of articulation and intelligibility measures. The openSMILE toolkit was considered for feature extraction, which allows computing more than 6000 descriptors (Eyben et al., 2010). The authors reported that the neurological state of the patients could be assessed with an MAE of 5.5. A study for the monitoring of PD progression was also presented by Gómez-Vilda et al. (2017). The authors considered speech recordings from 8 male patients captured twice for four weeks between sessions. Speech recordings of 100 healthy speakers were considered as a baseline. The participants were asked to perform the sustained phonation of the vowels /a/, /e/, /i/, /o/,/u/, and read a short sentence and a standard text. The authors used two methods to estimate the features: (i) vocal tract inversion using an adaptive lattice filter and (ii) biomechanical inversion of a 2-mass model of the vocal folds. The features include jitter, shimmer, harmonicity, vocal fold body mechanical stress, and tremor during vibration of the vocal folds. During the recording sessions, the patients continued their pharmacological treatment and received speech therapy. Each patient was evaluated according to the H&Y scale. The relationship between the neurological scale and the acoustic features was evaluated using hypothesis testing based on Bayesian Likelihood. According to the authors, the tremor and biomechanical features evolve differently with the treatment. The authors suggest defining different time intervals between evaluations to obtain more conclusive results. Sztahó et al. (2017) proposed a method to estimate the severity of PD using rhythm-based features. The authors considered speech recordings of 51 PD patients (25 male) and 27 healthy speakers (14 male) from Hungary. All of the patients were evaluated according to the H&Y scale. The speech tasks consisted of a monologue and the reading of a standard text. The set of rhythm features includes the standard deviation of the duration of consonants and vowels, the average duration of the speech/pauses, the pause ratio, percentage of consonants/vowels, the articulation rate, and the raw and normalized Pairwise Variability Index 26 CHAPTER 3. STATE-OF-THE-ART (rPVI, nPVI) of the consonants and vowels. Regression analysis was performed to estimate the severity of the disease using linear regression, SVR, Artificial Neural Networks (ANN), and Deep Neural Networks (DNN). The authors obtained Spearman’s correlation coefficient of up to 0.744 (SVR, reading task) between the predicted and the target H&Y scores. Hemmerling and Wojcik- Pedziwiatr (2020) estimated the severity of PD by extracting acoustic features from the sustained phonation of the vowels /a/, /e/, /i/, /o/, and /u/. The set of features includes average F0, jitter, shimmer, energy, spectral moments, MFCCs, Perceptual Linear Prediction (PLP) coefficients, among others. For this, speech recordings of 27 PD patients from Poland were captured five times for 180 minutes after taking levodopa medication. Additionally, a neurologist expert estimated the UPDRS score of the patients in the five recording sessions. The motor UPDRS scores of the patients were estimated using multiple linear regression, Random Forest (RF) regression, and SVR. The authors reported that the lowest error between predictions and clinical scores was obtained for the vowel /a/ (MAE=1.85) when the regression analysis was performed with RF. Other studies have also considered regression analysis to estimate the dysarthria level of PD patients. Cernak et al. (2017) evaluated the changes in the voice quality of the speakers by considering the mFDA score related to larynx deficits (Table 1.3). The authors trained an SVR with phoneme posterior probabilities extracted from recordings of 50 PD patients and 50 HC speakers from Colombia. The speech tasks include the rapid repetition of /pa-ta-ka/, the reading of a standard text, and a monologue. The authors reported Spearman’s correlation coefficients of up to 0.57 between the predicted scores and the larynx mFDA score. Garcı́a et al. (2017) predicted the neurological state and dysarthria level of 50 PD patients according to the MDS-UPDRS- III and mFDA scores, respectively. Acoustic analysis was performed by considering different pitch, loudness, duration, and filterbank analysis parameters. These features were extracted from 4 speech tasks, including the rapid repetition of syllables (e.g.,/pa-ta-ka/), a monologue, and reading a text and different sentences. Then, the i–vector approach was considered to obtain the speaker models (or embeddings) of 50 PD patients and 50 HC speakers from Colombia (See Chapter 4). The authors reported that it was possible to predict the MDS-UPDRS-III with a Spearman correlation of 0.63 when phonation and articulation features extracted from the sentences were considered to train the i–vectors. Additionally, the mFDA was predicted with a Spearman correlation of 0.72 when considering the rapid repetition of /pa-ta-ka/ modeled with phonation, articulation, and prosody features. Vásquez-Correa et al. (2018) estimated the dysarthria level of 68 PD patients and 50 HC speakers from Colombia. The set of speech tasks included the sustained phonation of Spanish vowels, the reading of 10 sentences, a standard text, a monologue, and the rapid repetition of /pa-ta-ka/, /pa-ka-ta/, /pe-ta-ka/, /pa/, /ta/, and /ka/. 3.2. SPEECH ANALYSIS OF COCHLEAR IMPLANT USERS 27 Automatic acoustic analysis was performed with i–vector speaker models obtained from phonation, articulation, prosody, and intelligibility-based features. Additionally, three variations of ridge regression (linear, kernel, bayesian) and two variations of SVR were considered to estimate the mFDA scores of the patients and the HC controls. The authors reported that the higher Spearman’s correlation coefficient was 0.69 for articulation features extracted from continuous speech. Karan et al. (2020) combined F0 and Hilbert’s spectral features to estimate the mFDA score of 70 PD patients. The authors considered speech recordings with the sustained phonation of the vowels /a/, /e/, /i/, /o/, and /u/ and the reading of 10 isolated words. Regression analysis was performed with an ε-SVR. The authors reported Spearman’s correlations of 0.75 (for the vowel /o/) and 0.77 (for the word reina; “Queen”). Table 3.1 summarizes the studies related to the severity estimation of PD. In general, the sustained phonation of vowels and the reading of a standard text are the most frequently used speech tasks to assess the patient’s neurological state. As described in Section 1.2.1, such a task allows detection of speech problems. In the case of the reading task, the acoustic analysis allows evaluating articulation and prosody problems. The most common biomarkers considered to model speech problems include pitch (F0, jitter), harmonicity, e.g., harmonics-to-noise ratio, and the spectral energy of the signal. Furthermore, the SVR has been suitable for modeling the relationship between the acoustic features and the clinical score. 3.2 Speech analysis of cochlear implant users Oral communication skills of severely and profoundly hearing-impaired speakers can be improved by cochlear implantation. Such an improvement has been observed by a better contrast to produce consonants, a decreased production of average F0 values, loudness, and duration of speech segments. Nevertheless, the speech production of CI users is affected even after rehabilitation by cochlear implantation. Plant and Oster (1986) investigated pitch, duration, and articulation changes on the speech of one female speaker recorded in two sessions: before and after cochlear implantation. The speech tasks consisted of the reading of a text and a list of words. Pitch and duration were evaluated from the reading of the text by computing the average and standard deviation of the F0 contour, the total phonation time, the average duration of the pauses, and an estimated value of articulation rate (the number of syllables divided by the total phonation time). Articulation was evaluated by extracting the vowels from the list of words and computing the ratio between the first and second formants (F1/F2) to detect shifts in the vowel space area. The authors reported that after implantation, the speech parameters from the CI uses moved towards 28 CHAPTER 3. STATE-OF-THE-ART Table 3.1: Summary of works related to the severity estimation of PD. Longitudinal analysis refers to studies that consider several speech recordings captured in different sessions from the same patients. Authors Subjects Acoustic Method Clinical Longitudinal parameters (best result) scale analysis Asgari 2010 61 PD/21 HC Loudness, duration SVR UPDRS-III No entropy, harmonicity pitch, spectral energy Tsanas 2010 42 PD Pitch, harmonicity CART Total UPDRS Yes nonlinear analysis UPDRS-III Skodda 2013 80 PD Pitch, articulation Shapiro-Wilk UPDRS-III Yes fluency, harmonicity statistical test Bayestehtashk. 2015 168 PD Loudness, duration SVR UPDRS-III No entropy, harmonicity pitch, spectral energy Grósz 2015‡ 61 PD/ 50 HC Articulation Gaussian MDS-UPDRS-III No processes Orozco-Arroyave 2016 158 PD∗ Articulation, SVR MDS-UPDRS-III No intelligibility Gómez-Vilda 2017 8 PD/ 100 HC Pitch, harmonicity Bayesian H&Y Yes vocal folds tremor, likelihood body mass features Sztahó 2017 51 PD/27 HC Speech rate, SVR H&Y No duration, rhythm Cernak 2017 50 PD/ 50 HC Phoneme posterior SVR mFDA (Larynx) No probabilities Garcı́a-Ospina 2017 50 PD/ 50 HC Pitch, loudness i–vectors mFDA No articulation, duration MDS-UPDRS-III Vásquez-Correa 2018 68 PD/ 50 HC Speaker embeddings SVR mFDA No with i–vectors Hemmerling 2020 27 PD Pitch, loudness, SVR UPDRS-III No spectral energy, filterbank features Karan 2020 70 PD Pitch, Hilbert SVR mFDA No spectral features ∗ This study includes speakers from Colombia (50), Germany (88), and Czech republic (20) ‡ Winners of the Parkinson’s Condition sub-challenge (Schuller et al., 2015) “normality” values, which were obtained by performing the same analysis on the recording of an age-matched typical hearing speaker. As stated by the authors, the main limitation of that study was that only one speaker was considered. Furthermore, the authors believe that speech improvement by the CI users may be the result of training. Perkell et al. (1992) performed acoustic analysis considering speech recordings of four postlingually deafened CI users. The recording sessions were performed pre- and post-activation of the speech processor. Post-activation recordings were captured at different week intervals. The features considered for analysis were F0, F1, F2, Sound Pressure Level (SPL), duration, and amplitude difference between the first two harmonic peaks 3.2. SPEECH ANALYSIS OF COCHLEAR IMPLANT USERS 29 in the log-magnitude spectrum. The speech tasks consisted of reading nine vowels (included in predefined words) spoken in a carrier sentence. The authors reported that, after activation, many of the acoustic parameters moved toward values reported in previous studies, which considered healthy speakers. However, these results were based on the outcome of only four speakers. Lane et al. (1995) measured the VOT in stop-initial syllables produced by five CI users. Short-term and long-term analyses were performed. For the short-term analysis, the recordings were captured after turning off the speech processor of the patients for 24 hours, then turned on, and then off again. For long-term analysis, speech recordings were captured before and after activation of the speech processor in intervals of 0, 4, 12, 26, 52, 104 weeks. The speech task consisted of the reading of the six English stop consonants embedded in a carrier sentence. The measurements for the VOT were performed manually. The authors examined the effect of processor activation and found increased VOT measurements in the voiced stop consonants for the short-time analysis and increased VOT values for the long-term analysis. The authors suggest that changes in voiced stops are related to concurrent changes in pitch and SPL. For the case of voiceless stops, the changes are linked to auditory validation of phonemic settings. One limitation of this study is the reduced amount of speakers considered for the experiments. Gould et al. (2001) examined speech intelligibility of four postlingually deafened adults before and after 6 and 12 months of activation. The participants were instructed to produce ten repetitions. Speech intelligibility was measured for vowels and consonants individually using a metric called the percentage of transmitted information. The authors reported an overall improvement in word intelligibility; however, such an improvement was not consistent for individual consonants or vowels. Blamey et al. (2001) analyzed the speech production of nine children for six years after implantation. Speech intelligibility was assessed by considering phonetic transcriptions of conversational speech. The transcriptions were used to measure the percentage of correctly produced words. The authors observed an increase in speech intelligibility, length, and phonemic accuracy during the six years. However, the rate of improvement was considerably slower than that observed in normally-hearing children who developed their linguistic skills at a younger age. Hassan et al. (2012) evaluated speech nasalization considering 25 postlingual CI users and 25 age-matched HC from Saudi Arabia. The patients were divided into three groups according to the duration of hearing loss before implantation: (1) less than three years (7 patients), (2) between 3 and 6 years (8 patients), and (3) more than six years (10 patients). For evaluation, percentage scores of nasalance were obtained from two sentences read by participants. The scores were obtained with a nasometer which measures the acoustic output from the oral and nasal cavities. Nasalance scores were obtained for each patient before implantation and after 6, 12, and 24 months of CI activation. 30 CHAPTER 3. STATE-OF-THE-ART The authors reported that for the three groups of patients, there is a tendency from the nasalance scores to decrease over time; however, the level of nasality was still higher than in the control group. Furthermore, the authors found that the degree of nasality and the improvement over time depend on hearing loss duration. In the study presented by Ubrig et al. (2011) deviations in the phonation of CI users were investigated. For this, the authors considered speech recordings of 40 postlingual CI users and 12 postlingually hearing-impaired adults without implants from Brazil. Two recording sessions were performed for the CI users: before implantation and 6-9 months after activation of the device. Acoustic analysis was performed by computing the average and standard deviation of the F0 contour obtained from the recordings of the sustained phonation of the vowel /a/ and the reading of a standard text. The authors found a significant reduction of F0 variability when comparing the first to the second recording session. Other works have investigated the impact of the onset of hearing loss in the speech of CI users, i.e., pre-/post-lingual hearing loss. Vowel articulation of pre- and post-lingual deafened CI users was evaluated by Neumeyer et al. (2010). Speech recordings of 10 CI users (5 prelingual) and ten age-matched normal hearing speakers from Germany were considered for the test. Articulation analysis was performed by computing the vowel space of /a/, /e/, /i/, /o/, and /u/ which are extracted from target words included within 20 standard sentences. The acoustic parameters extracted from the vowels include the first and second formant frequencies. The authors reported a reduction of the vowel space area for the CI users compared to normal hearing speakers; particularly, such a reduction is mainly caused by the misarticulation of back vowels (/o/, /u/). One reason the authors give is that such vowels are produced with tongue movements that are not visible to the CI users. Additionally, the authors did not report differences between pre- and post-lingual CI users. The authors suggest that since postlingual CI users spent years without sufficient hearing and auditory feedback before implantation, their articulatory capability was diminished. Pre- and post-lingual CI users have been found to have limited production contrast of sibilant sounds, e.g., /s/ and /S/. Todd et al. (2011) analyzed speech recordings from 33 CI children (all prelingual) and 43 age-matched HC English native speakers from the United States. All children were asked to read 18 words with the sibilant sound (/s/ or /S/) in the initial position. The target phonemes were manually transcribed and evaluated by trained native speakers. The acoustic analysis was performed by computing the energy in the Bark scale from a Hamming window of 40 ms located in the middle of the sibilant sound. Then, only the Bark band with the highest energy was selected for evaluation. From the transcription analysis, it was observed that CI children produced the /s/ with less accuracy than /S/. Furthermore, the children produced these two phonemes with less accuracy than the HC. Regarding the acoustic analysis, the authors found that CI children produced 3.2. SPEECH ANALYSIS OF COCHLEAR IMPLANT USERS 31 the sibilant sounds with less energy than the control group, which results in a reduced contrast between /s/ and /S/. The authors suggest that such a diminished contrast may be caused by a poor frequency resolution in the implant. Similarly, Neumeyer et al. (2015) analyzed the German sibilant sounds /s/ and /S/ produced by 48 CI users (24 prelinguals) and 48 HC speakers. The patients were divided into four groups depending on the onset of hearing loss (pre-/post-lingual) and the time between hearing loss and cochlear implantation (before/after language acquisition). The study participants were asked to read a carrier sentence containing two words that differed only in one consonant and with different meanings: Tasche (bag) and Tasse (cup). Acoustic analysis was performed by manually segmenting the sibilant sounds from the recordings and then computing the first spectral moment. From the results, the authors concluded that the sibilant production of CI users deviates from normal speech, that onset of deafness plays a role in the degree of the deviation, but that the duration between onset of hearing loss and implantation has no significant effect impact on the sibilant production. The authors explained that such deviations might occur because the spectral resolution of the implant is lower in higher frequencies; thus, CI users shift the production of the sibilant sounds into the frequency range perceived by them. The speech intelligibility of pre- and post-lingual CI users can also be affected in different ways. Ruff et al. (2017) performed the automatic evaluation of the speech production intelligibility using an ASR system. The authors considered recordings of 50 CI users (14 prelingual, 36 postlingual) and 50 HC German native speakers for the experiments. The patients were divided into three groups: (1) prelingually deafened CI users with more than two years before surgery, (2) postlingually deafened CI users with less than two years before surgery, and (3) postlingually deafened CI users with more than two years before surgery. The study participants were asked to read a total of 97 words that contain every phoneme of the German language in different positions within the words. Then, the Word Recognition Rate (WR) was computed from the automatic transcriptions obtained from the ASR. The system was trained with 27 hours of speech recordings using the 97 words from the test as the vocabulary. The authors found that CI users with the postlingual onset of hearing loss and short duration of deafness (< 2 years before surgery) have higher WR than postlingual with a long duration of deafness and prelingual. Furthermore, the postlingual CI users with a short duration of deafness showed WR similar to the HC speakers. Gautam et al. (2019) presented a review of more than 25 studies (from 1983 to 2017) related to speech and voice changes due to hearing loss and the effect of CI in adults and children. The acoustic parameters evaluated in those works include pitch, loudness, consonant contrast, speech duration/rate, vowel articulation (VSA), and VOT. Changes in speech and voice due to hearing loss include: (1) increased pitch, loudness, and duration of speech, (2) reduced VSA and VOT, and (3) slower speech rate. The 32 CHAPTER 3. STATE-OF-THE-ART studies in the literature have reported that most of these parameters move towards normality after cochlear implantation; however, speech and voice deviations are still present. Table 3.2 shows a summary of the works reviewed in this section. Although speech production of CI users has been addressed before, the number of studies considering automatic methods for acoustic analysis is limited. From the works reviewed, it can be observed that speech and voice parameters such as pitch and loudness deviate from normality values even after implantation. Furthermore, poor contrast to produce some phonemes such as /s/ and /S/ has been associated with the limited resolution of the CI to provided good perception to the patients. Table 3.2: Summary of works related to acoustic analysis of speech production of CI users. Authors Subjects Acoustic Method Effect of Automatic parameters hearing loss analysis Plant 1986 1 CI/1 HC Pitch Mean and variation of F0 Reduced F0 No Duration Voiced segments Longer duration Vowel articulation Formant frequencies Reduced VSA Perkell 1992 4 CI Pitch Mean F0 Reduced F0 No Duration Vowel duration Longer duration Loudness Mean SPL Reduced loudness Vowel articulation Formant frequencies Reduced VSA Lane 1995 5 CI Duration Voiced and voiceless Reduced VOT No VOT Gould 2001 4 CI Intelligibility Percentage of information Poor speech No transmitted intelligibility Blamey 2001 9 CI Intelligibility Percentage of correct Poor speech No words intelligibility Neumeyer 2010 10 CI Vowel articulation Formant frequencies to Reduced VSA No estimate the VSA Todd 2011 33 CI Consonant articulation Bark energies to evaluate the Poor contrast No production contrast between the phonemes /s/ and /S/ Ubrig 2011 40 CI/12 HI∗ Pitch Mean and standard deviation Higher variation No of F0 of F0 No Hassan 2012 25 CI/ 25 HC Nasality Nasometer to estimate Higher level of No the nasality level nasality Neumeyer 2015 48 CI Consonant articulation Spectral moment to evaluate the Poor contrast No production contrast between the phonemes /s/ and /S/ Ruff 2017 50 CI/ 50 HC Intelligibility Word recognition rate Lower word Yes using an ASR system recognition rate Gautam 2019 NA† Pitch Mean F0 and jitter Increased pitch - Loudness Mean SPL and shimmer Increased loudness Duration Word/syllable duration, VOT Longer durations Speech rate Speaking rate Slower rate Vowel articulation Formant frequencies, VSA Reduced VSA Consonant articulation /s/ vs /S/; /r/ vs /l/ Poor contrast ∗HI: Hearing impaired. †Information about the number of speakers not available. 3.3. AGING AND SPEECH 33 3.3 Aging and speech Some studies have analyzed the impact of aging in speech. Xue and Deliyski (2001) considered sustained phonations of the English vowel /a/ and computed fifteen phonation measures of the Multi-Dimensional Voice Program. The set of measures includes F0, jitter, Pitch Perturbation Quotient (PPQ), Relative Average Perturbation (RAP), variability of F0, Amplitude Perturbation Quotient (APQ), shimmer, Noise to Harmonics Ratio (NHR), among others. A total of 44 speakers (21 male and 23 female) aged between 70 and 80 years were considered and compared with respect to the norms for young and middle-aged adults published by Deliyski and Gress (1998). The authors performed statistical analyses and reported that the voice of elderly people is significantly different (usually poorer) than the voice of young and middle-aged adults. Goy et al. (2013) considered several phonation measures to assess the stability of vocal fold vibration and to quantify the noise in the voice of 159 younger speakers at ages between 18 and 28 years, and 133 older adults with ages between 63 and 86 years. The authors concluded that the instability of the vocal fold vibration increases with age. The dysphonia severity index was also measured and only older females exhibited higher values than those in younger females. No statistical differences were observed between younger and older males. Another study that evaluates the influence of aging in the speech of elderly people considering phonation and articulation analyses is presented by Torre and Barlow (2009). A total of 27 young speakers with mean age of 25.6 years and 59 older people with mean age of 75.2 years were considered. Each participant was asked to read a set of 22 consonant-vowel-consonant words. The vowels and oral stops of each word where extracted and analyzed using Praat (Boersma and Weenink, 2001). The authors analyzed several acoustic properties including F0, the first three formant frequencies and the VOT. According to the results, there was a decrease of F0 with age for women and a increase of F0 with age for men. This finding is consistent with the results reported by Benjamin (1981). The authors highlighted also that older men showed shorter VOTs than both younger men and younger women, which is also reported by Benjamin (1982). A greater variability in F0, the three formants, and the VOT is systematically observed in the speech productions by older adults compared to their younger same-sex counterparts. As the natural aging process in humans carries several alterations in speech production and perception, the impact of aging in the detection of voice disorders is still an open problem and its relevance in the clinical practice was studied by Pernambuco et al. (2017). The relationship between age and s