Analysis of Pathological Speech Signals 

Análisis de Señales de Voz Patológicas 

 
Tomás Arias Vergara 

 
Tesis doctoral presentada para optar al título de Doctor en Ingeniería Electrónica y de 

Computación  

 
Directores 

Prof. Juan Rafel Orozco Arroyave, Doctor (PhD) en Ingeniería Electrónica y de Computación 

Prof. Elmar Nöth, Doctor (PhD) en Ciencias de la Computación 

Prof. Maria Schuster, Doctor (PhD) en Medicina Clínica 

 
Universidad de Antioquia 

Facultad de Ingeniería 

Doctorado en Ingeniería Electrónica y de Computación 

  Medellín, Antioquia, Colombia  

 
y 

 
Friedrich-Alexander-Universität Erlangen–Nürnberg 

Facultad de Ingeniería 

Doctorado en Ciencias de la Computación 

Erlangen, Alemania 

2022    


Cita (Arias-Vergara, 2022) 

Referencia 

 
Estilo APA 7 (2020) 

Arias-Vergara, T.. (2022). Analysis of Pathological Speech Signal [Tesis doctoral]. 

Universidad de Antioquia, Medellín, Colombia.  

  
Doctorado en Ingeniería Electrónica y de Computación, Cohorte XVII.  

Grupo de Investigación en Telecomunicaciones Aplicadas - GITA 

Centro de Investigación Ambientales y de Ingeniería (CIA).  

 
Seleccione biblioteca, CRAI o centro de documentación UdeA (A-Z)  

 
Repositorio Institucional: http://bibliotecadigital.udea.edu.co 

 
Universidad de Antioquia - www.udea.edu.co 

Rector: John Jairo Arboleda Céspedes 

Decano/Director: Jesús Francisco Vargas Bonilla 

Jefe departamento: Augusto Enrique Salazar Jiménez 

 
El contenido de esta obra corresponde al derecho de expresión de los autores y no compromete el pensamiento 

institucional de la Universidad de Antioquia ni desata su responsabilidad frente a terceros. Los autores asumen la 

responsabilidad por los derechos de autor y conexos. 

https://co.creativecommons.net/tipos-de-licencias/
https://co.creativecommons.net/tipos-de-licencias/


Acknowledgments

The development of this thesis wouldn’t be possible without the help and support of many
people. I’m very thankful to my family for their support during all my academic life. Especially
to my mom for her wisdom and guidance. She has always given me reasons to keep going, go
beyond my limits, and dream for the best. Thanks to my brothers, Matias and Simon, for their
invaluable support in many difficult situations that I could have never faced alone.

I’m very grateful to my supervisor Prof. Dr.-Ing. Juan Rafael Orozco-Arroyave, Prof. Dr.-Ing.
Elmar Nöth, and Prof. Dr. med. Maria Schuster. I can only offer my sincere appreciation for all
of their advice, encouragement, and learning opportunities to them. Rafa has been my supervisor
since I was an undergraduate student. He allowed me to explore the world of academic research.
His guidance helped me accomplish several of my career goals and influenced many of my life
decisions. Elmar opened the doors of the Pattern Recognition Lab and from the very first moment
I arrived in Germany, he was very supportive academically and personally. I’m very thankful for
all the fruitful discussions we had and his time helping me be a better researcher. I’m thankful to
Maria and the trust she put in me to carry on this project. She always encouraged me to get the
best out of every task I’ve performed and offered me the best conditions to accomplish my goals.

I also want to thank my colleagues from the GITA lab, Camilo, Paula, Parra, Patricia, Orlando,
Cristian, Daniel, Lucho, Manuel, and Nicanor. In particular, I’m very thankful to Camilo and
Paula. Together with Camilo, we developed new ideas, discussed the results of many experiments,
and shared many great moments. Although, in the end, we went on different paths, I will always
be grateful to him for all of the help. I’m also very thankful to Paula. There were difficult moments
towards the end of my Ph.D. where she was my only support and the one that gave me the strength
to continue given the circumstances. She became my “partner in crime” and I hope to support her
as much as she did to me, now that she started her Ph.D.

I would like to thank my colleagues at the Pattern Recognition Lab, Philipp, Sebastian, Tino,
Hendrik, and Dalia. They helped me a lot when I arrived in Germany and always were very kind
and friendly. I want to express my deep gratitude to Philipp. He has helped me a lot with different
technical and general things during my stay in Germany. I consider him a close friend of mine,
and I’m glad for the opportunity to have worked with him at the Pattern Recognition Lab.

And last but not least, I want to thank the patients and volunteers of the Fundalianza Parkinson
Colombia, the Clinic of the Ludwig-Maximillians University in Munich, and the people from the
Augustinum retirement home. Without their help and willingness to collaborate in this work, none
of this could have been possible. Thanks to them for letting me be part of their group.


Abstract

The present thesis addresses the automatic analysis of speech disorders resulting from Parkinson’s
disease and hearing loss. For Parkinson’s disease, the progression of speech symptoms are
evaluated considering speech recordings captured in the short-term (4 months) and long-term (5
years). Machine learning methods are used to perform three tasks: (1) automatic classification of
patients vs. healthy speakers, (2) regression analysis to predict the dysarthria level and neurological
state, and (3) speaker embeddings to analyze the progression of the speech symptoms over time.
For hearing loss, automatic acoustic analysis is performed to evaluate whether the duration and
onset of deafness (before or after speech acquisition) influences the speech production of cochlear
implant users. Additionally, articulation, prosody, and phonemic analyses are performed to show
that cochlear implant users present altered speech production even after hearing rehabilitation.

Automatic acoustic analysis is performed considering phonation, articulation, prosody, and
phonemic features. Phoneme precision is characterized using the posterior probabilities obtained
from recurrent neural networks trained in German and Spanish. The phonemic analysis considers
three main dimensions: manner of articulation, place of articulation, and voicing. This thesis also
proposes a methodology for automatically detecting voice onset time in voiceless stop consonants.

Furthermore, this thesis studies the acoustic cues that reflect changes in elderly people due to
the aging process. Regression analysis is performed to estimate a person’s age using the phonation,
articulation, prosody, and phonemic features. Additionally, the use of smartphones for health care
applications is considered here.


Zusammenfassung

Die vorliegende Dissertation befasst sich mit der automatischen Analyse von Sprachstörun-
gen infolge von Parkinson und Hörverlust. Bei der Parkinson-Krankheit wird der Verlauf der
Sprachsymptome anhand von Sprachaufzeichnungen bewertet, die kurzzeitig (4 Monate) und
langfristig (5 Jahre) aufgenommen wurden. Methoden des maschinellen Lernens werden verwen-
det, um drei Aufgaben zu erfüllen: (1) automatische Klassifikation von Patienten vs. gesunde
Sprecher, (2) Regressionsanalyse zur Vorhersage des Dysarthrie-Levels und des neurologischen
Zustands und (3) Sprechereinbettungen zur Analyse des Verlaufs der Sprachsymptome im Laufe
der Zeit. Bei den Patienten mit Hörverlust wird eine automatische akustische Sprachanalyse
durchgeführt, um zu beurteilen, ob die Dauer und das Einsetzen der Taubheit (vor oder nach dem
Spracherwerb) die Sprachproduktion von Cochlea-Implantat-Trägern beeinflusst. Darüber hinaus
werden Artikulations-, Prosodie- und Phonemanalysen durchgeführt, um zu zeigen, dass Träger
von Cochlea-Implantaten auch nach einer Hörrehabilitation eine veränderte Sprachproduktion
unterschiedlichen Ausmasses aufweisen.

Für automatischen akustischen Analysen werden wird Phonation, Artikulation, Prosodie
und phonemischen Merkmalen berücksichtigt. Die Phonempräzision wird durch die Posterior-
Wahrscheinlichkeiten charakterisiert, die aus rekurrenten neuronalen Netzen gewonnen werden,
die auf Deutsch und Spanisch trainiert wurden. Die phonemische Analyse fokussiert auf drei
Hauptdimensionen: Artikulationsart, Artikulationsort und Stimmgebung. Diese Arbeit schlägt
auch eine Methodik zur automatischen Erkennung der Stimmeinsatzes nach stimmlosen Stopp-
konsonanten vor.

Darüber hinaus untersucht diese Arbeit die akustischen sprachlichen Charakteristika, die
Veränderungen bei älteren Menschen aufgrund des Alterungsprozesses widerspiegeln. Eine Re-
gressionsanalyse wird durchgeführt, um das Alter einer Person unter Verwendung der Phonation,
Artikulation, Prosodie und phonemischen Merkmale zu schätzen. Darüber hinaus wird hier der
Einsatz von Smartphones für Anwendungen im Gesundheitswesen betrachtet.


Resumen

La presente tesis aborda el análisis automático de los trastornos del habla derivados de la en-
fermedad de Parkinson y la pérdida auditiva. En el caso de la enfermedad de Parkinson, el
progreso de los sı́ntomas del habla se evalúa considerando las grabaciones capturadas a corto (4
meses) y largo plazo (5 años). Métodos de aprendizaje automático son utilizados para realizar
tres tareas: (1) clasificación automática de pacientes contra a hablantes sanos, (2) análisis de
regresión para predecir el nivel de disartria y el estado neurológico, y (3) modelos de hablante para
análisis longitudinal del progreso de los desórdenes en la voz. En el caso de la pérdida auditiva,
se realiza un análisis acústico automático para evaluar si la duración y el inicio de la sordera
(antes o después de la adquisición del habla) influye en la producción del habla de los usuarios
de implantes cocleares. Además, se realizan análisis de articulación, prosodia y fonémicos para
demostrar que los usuarios de implantes cocleares presentan una producción del habla alterada
incluso después de la rehabilitación auditiva.

El análisis acústico automático se realiza considerando fonación, articulación, prosodia y
caracterı́sticas fonémicas. La precisión de la producción de fonemas se caracteriza mediante el
cálculo de las probabilidades obtenidas de redes neuronales recurrentes entrenadas en Alemán y
Español. El análisis fonémico considera tres dimensiones principales: forma de articulación, lugar
de articulación y sonorización. Esta tesis también propone una metodologı́a para la detección
automática del tiempo de inicio de la voz en consonantes oclusivas sordas.

Además, en este trabajo se analiza la influencia de la edad en el análisis acústico. El análisis de
regresión se realiza para estimar la edad de una persona utilizando las caracterı́sticas de fonación,
articulación, prosodia y fonética. También, en esta tesis se considera el uso de smartphones para
aplicaciones en el sector médico.


Contents

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Speech disorders in selected populations . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Parkinson’s disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.2 Hearing loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.3 Aging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4.1 General objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4.2 Specific objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.5 Contribution of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.6 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Speech production process 12
2.1 Speech chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.1 Physiological processes of speech production . . . . . . . . . . . . . . . 12

2.2 Impact of Parkinson’s disease on speech motor control . . . . . . . . . . . . . . 15

2.2.1 Neuropathophysiology of motor control related to Parkinson’s disease . . 15

2.2.2 Motor speech disorders in Parkinson’s disease . . . . . . . . . . . . . . . 18

2.3 Auditory system and speech control . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.1 Overview of the auditory system . . . . . . . . . . . . . . . . . . . . . . 18

2.3.2 Cochlear implants (CIs) . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.3 Auditory feedback and speech control . . . . . . . . . . . . . . . . . . . 21

3 State-of-the-art 23
3.1 Severity estimation of Parkinson’s disease from speech . . . . . . . . . . . . . . 23


CONTENTS

3.2 Speech analysis of cochlear implant users . . . . . . . . . . . . . . . . . . . . . 27

3.3 Aging and speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4 Smartphone-based applications for health care . . . . . . . . . . . . . . . . . . . 34

3.4.1 Applications for Parkinson’s disease . . . . . . . . . . . . . . . . . . . . 34

3.4.2 Applications for hearing loss . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Automatic analysis of pathological speech signals 38
4.1 Speech processing techniques-an overview . . . . . . . . . . . . . . . . . . . . . 38

4.1.1 Short-time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1.2 Time-frequency analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.1.3 Filterbank analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.1.4 Voice Activity Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2 Pathological speech modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2.1 Phonation analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2.2 Articulation analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.2.3 Phonemic analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2.4 Prosody analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3 Machine learning methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.3.1 Support Vector Machine for classification . . . . . . . . . . . . . . . . . 64

4.3.2 Support Vector Machine for regression . . . . . . . . . . . . . . . . . . 70

4.3.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.3.4 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . 77

4.3.5 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 79

4.4 Speaker models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.4.1 Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.4.2 i–vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.4.3 x–vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5 Data collection 92
5.1 Parkinson’s disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.1.1 PCGITA (Spanish) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.1.2 PD At-home (Spanish) . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.1.3 PD Longitudinal (Spanish) . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.1.4 Apkinson (Spanish) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.2 Cochlear implants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94


CONTENTS

5.2.1 LMU TAPAS (German) . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.2.2 LMU Onset (German) . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.3 Supporting datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.3.1 Young healthy controls (Spanish) . . . . . . . . . . . . . . . . . . . . . 95

5.3.2 PhonDat 1 Corpus (German) . . . . . . . . . . . . . . . . . . . . . . . . 96

5.3.3 Verbmobil subset (German) . . . . . . . . . . . . . . . . . . . . . . . . 96

5.3.4 TEDx Spanish Corpus - TSC (Spanish) . . . . . . . . . . . . . . . . . . 96

6 Experiments and results 97
6.1 Models for speech analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.1.1 Phoneme posterior probabilities . . . . . . . . . . . . . . . . . . . . . . 97

6.1.2 Automatic detection of voice onset time . . . . . . . . . . . . . . . . . . 105

6.2 Parkinson’s disease patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.2.1 Automatic methods for the assessment of PD from speech . . . . . . . . 112

6.2.2 Speaker embeddings to monitor Parkinson’s disease . . . . . . . . . . . 121

6.3 Cochlear Implant users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6.3.1 Quantification of phoneme precision to evaluate onset and duration of
deafness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6.3.2 Segmental and suprasegmental speech analysis of postlingually deafened
CI users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

6.4 Aging and speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

6.5 Smartphone-based applications for health care . . . . . . . . . . . . . . . . . . . 159

6.5.1 Apkinson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

6.5.2 Cochlear Implant Testing App - CITA . . . . . . . . . . . . . . . . . . . 163

7 Summary 166
7.1 Automatic methods for speech analysis . . . . . . . . . . . . . . . . . . . . . . 166

7.2 Parkinson’s disease patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

7.3 Cochlear implant users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

7.4 Aging and speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

7.5 Smartphone-based applications for health care . . . . . . . . . . . . . . . . . . . 171

Appendices 172

A Speech tasks 173
A.1 Spanish speech protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173


CONTENTS

A.1.1 Vowel phonation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
A.1.2 Sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
A.1.3 Read text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
A.1.4 Speech diadochokinesia . . . . . . . . . . . . . . . . . . . . . . . . . . 174
A.1.5 Monologue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

A.2 German speech protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
A.2.1 Read text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
A.2.2 Rhino sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
A.2.3 PLAKSS words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

B Publications 177
B.1 Journal publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
B.2 Conference publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

List of Figures 182

List of Tables 188

Acronyms 192

Bibliography 195


Chapter 1

Introduction

1.1 Motivation

Oral communication of adults and children can be affected by developmental or acquired speech
disorders resulting from motor/neurological impairments (e.g., brain injuries, Parkinson’s disease)
or sensory/perceptual disorders (e.g., hearing loss)1. On the one hand, neurological diseases
such as Parkinson’s disease (PD) affect certain regions in the brain and the muscles involved
in the speech production process, leading to different motor speech-based impairments such as
imprecise articulation, slower speaking rate, monotonous speech, hoarse quality of voice, among
others (Ho et al., 1999; Trail et al., 2005). On the other hand, perceptual disorders such as
sensorineural hearing loss cause decreased speech intelligibility, changes in terms of phoneme
articulation, abnormal nasalization, slower speaking rate, and decreased variability in fundamental
frequency (Hudgins and Numbers, 1942; Langereis et al., 1997; Leder et al., 1987). One of the
aims of pathological speech processing is the development of technology to support the diagnosis
and monitoring of different medical conditions through speech (Gupta et al., 2016). This thesis
focuses on the automatic acoustic analysis of speech signals captured from PD patients and
people with hearing loss. Furthermore, as the speech of elderly people changes due to the aging
process, a clinical condition, or both, the description of acoustic cues in the speech that reflect
such differences is a topic that deserves special attention.

PD is a neurodegenerative disease characterized by the progressive loss of dopaminergic
neurons in the substantia nigra of the midbrain (Hornykiewicz, 1998). The primary motor
symptoms of PD include tremor, slowness, rigidity of the limbs and trunk, postural instability,
swallowing disorders, and speech impairments. Many of the symptoms are controlled with

1www.asha.org/Practice-Portal/Clinical-Topics/Articulation-and-Phonology

1


2 CHAPTER 1. INTRODUCTION

medication, however, there is no clear evidence indicating positive effects of those treatments
on the speech impairments (Skodda et al., 2010), but there is evidence showing that speech
therapy combined with the pharmacological treatment improves the communication ability of PD
patients (Schultz and Grant, 2000). The evaluation of PD requires the patient to be present at the
clinic, which is time-consuming and expensive for both, the patient and healthcare system (Yang
et al., 2020), however, the continuous monitoring of PD patients could help to make timely
decisions regarding their medication and therapy.

In the case of hearing loss, there are different treatments available for different types and
degrees of deafness. A Cochlear implant (CI) is the most suitable device for severe and profound
deafness when hearing aids do not improve sufficiently speech perception. A CI uses a sound
processor to capture audio signals and send them to a receiver implanted under the skin behind
the ear. The receiver transforms the signal into electrical impulses which are sent to electrodes
implanted in the cochlea. However, CI users often present altered speech production and limited
understanding even after hearing rehabilitation. Thus, if the deficits of speech would be better
known the rehabilitation might be properly addressed (Pomaville and Kladopoulos, 2013). CI
users require assistance before, during, and after surgery from audiologists, medical specialists in
Otorhinolaryngology, and speech-language pathologists 2; however, speech production quality is
seldom assessed in outcome evaluations, thus including speech technology could lead to a reliable
outcome evaluation contributing to the rehabilitation success.

This thesis addresses the automatic evaluation of speech production from PD patients and
CI users by combining signal processing techniques with machine learning methods. Such
methods are also considered to analyze the effect of age as another possible source of changes in
speech production. Additionally, since the use of smartphones for health care has become more
frequent, some of the speech processing techniques addressed in this thesis are implemented in
Android-based applications.

1.2 Speech disorders in selected populations

1.2.1 Parkinson’s disease

Clinical diagnosis

Parkinson’s disease is characterized by a combination of some symptoms regarding motor control.
Moreover, next to motor control, other symptoms such as mood changes, cognitive decline, and

2www.asha.org/Practice-Portal/Professional-Issues/Cochlear-Implants/


1.2. SPEECH DISORDERS IN SELECTED POPULATIONS 3

sleep disorders might occur (Poewe, 2008). There is no standard method to diagnose PD. Doctors
rely on the clinical history and physical examination to assess the patients. Additionally, the
severity of the disease is evaluated by neurologist experts using different scales such as the
Movement Disorder Society–Unified Parkinson Disease Rating Scale (MDS-UPDRS) (Goetz
et al., 2008). This is a perceptual scale used to assess motor and non-motor abilities of the patients
with 65 items distributed in four sections:

• Section 1 (MDS-UPDRS-I, 13 items) concerns the non-motor experiences of daily living
such as cognitive impairment, depressed mood, and fatigue.

• Section 2 (MDS-UPDRS-II, 13 items) considers motor experiences of daily living such as
eating, dressing, handwriting, and tremor.

• Section 3 (MDS-UPDRS-III, 33 items) is used to evaluate the motor capabilities of the
patient including speech production, upper/lower limbs movement, postural stability, and
gait.

• Section 4 (MDS-UPDRS-IV, 6 items) concerns motor complications such as time spent
without medication (OFF state), time spent with dyskinesia (involuntary movements),
among others.

Speech production is evaluated by the neurologist during the patient’s visit to the clinic. The
patients are asked to talk about different subjects in order to assess several aspects including
speech’s volume, intelligibility, modulation of words, among others. The speech item of the
MDS-UPDRS scale considers the following categories for the evaluation (Table 1.1):

Table 1.1: Speech scoring system from the MDS-UPDRS-III.
Score Category Definition

0 Normal No speech problems
1 Slight Loss of voice intensity or modulation
2 Mild Some words are unclear
3 Moderate Speech is difficult to understand
4 Severe Speech is unintelligible

The MDS-UPDRS-III also includes the Hoehn & Yahr (H&Y) scale, which comprises a set
of five severity levels where 1 is associated with a minimal or no functional disability and 5 is
assigned to patients who are confined in bed or wheelchair unless aided. There are two variants of


4 CHAPTER 1. INTRODUCTION

the scale, the original one with integer values for the stages from 1 to 5, and a modified one with
the addition of stages 1.5 and 2.5 for a total of 7 severity levels (Hoehn et al., 1998).

The MDS-UPDRS scale is suitable to assess the neurological state of the patients. However,
speech production is evaluated only in one item. Regarding the complexity of speech, a single
item summarizing different aspects such as voice, articulation, fluency, intonation, speaking rate,
and intelligibility is not sufficient. The symptoms of motor speech disorders caused by PD are
often associated with hypokinetic dysarthria, resulting from problems controlling the muscles and
articulators involved in the speech production process. A more suitable clinical scale to evaluate
speech impairments is the Frenchay Dysarthria Assessment–2 (FDA–2) (Enderby and Palmer,
2008), which is a perceptual scale used to evaluate dysarthria considering 34 items distributed
in eight sections. Table 1.2 shows the aspects considered in the FDA–2 scale. The patients are
asked to perform different tasks in each section. The category complementary refers to factors
that might influence speech production. All sections (excluding Complementary) are rated on a
9-point scale.

Table 1.2: List of items evaluated in the FDA–2 scale.
Category Item

Reflexes Cough, swallow, dribble/drool

Respiration At rest, in speech

Lips At rest, spread, seal, alternate, in speech

Palate Fluids, maintenance, in speech

Laryngeal Time, pitch, volume, in speech

Tongue At rest, protrusion, elevation, lateral,
alternate, in speech

Intelligibility Producing words, sentences, conversation

Complementary Hearing, sight, teeth, language, mood, posture,
speech rate, sensation (upper lip and tongue tip)

A modified version of the FDA–2 scale, i.e., the mFDA, was proposed by Orozco-Arroyave
et al. (2018) and was designed to be applied considering only the speech recordings of the
patient; therefore, the patient is not required to visit the clinic for assessment. The mFDA is


1.2. SPEECH DISORDERS IN SELECTED POPULATIONS 5

administered considering different speech tasks including sustained phonation of the vowel /a/,
reading, monologues, and the alternating and sequential production of the syllables /pa-ta-ka/,
/pa-ka-ta/, /pe-ta-ka/, /pa/, /ta/, and /ka/. The scale has a total of 13 items and each one of them
ranges from 0 (normal or completely healthy) to 4 (very impaired), thus the total score of the
mFDA ranges from 0 to 52. Table 1.3 shows the details of the mFDA scale. The main limitation of

Table 1.3: List of items evaluated in the mFDA scale.
Category Item Speech task

Respiration Duration of the recording Sustained phonation of the vowel /a/
Breathing capacity Multiple repetition of /pa-ta-ka/, /pa-ka-ta/, /pe-ta-ka/

Lips Strength of lip closure Multiple repetitions of the syllable /pa/
Lips control Reading, monologue

Palate Nasality Reading, monologue
Velar movement Multiple repetitions of the syllable /ka/

Larinx Phonatory capability 1 Sustained phonation of the vowel /a/
Phonatory capability 2 Reading, monologue
Monotonicity Reading, monologue
Effort to produce speech Reading, monologue

Tongue Velocity to move the tongue 1 Multiple repetition of /pa-ta-ka/ and /pa-ka-ta/
Velocity to move the tongue 2 Multiple repetitions of the syllable /ta/

Intelligibility Speech intelligibility Reading, monologue

the MDS-UPDRS or mFDA is the lack of precision, since the severity of the disease is evaluated
based on a perceptual score which depends on the experience of the clinician.

Speech production

PD affects the speech of the patients in different ways. For instance, stability and periodicity
problems are caused by an inadequate closing of the vocal folds, which is related to rigidity
in the muscle (Hanson et al., 1984). Thus, perturbations in the vibration of vocal folds can be
measured by estimating fundamental frequency (F0) based features from the sustained phonation
of vowels (Almeida et al., 2019; Skodda et al., 2013; Tsanas et al., 2010). Articulation-based
deficits are mainly related with reduced amplitude and velocity of lip, tongue, and jaw movements
causing a reduced articulatory capability in PD patients to produce vowels and continuous
speech (Ackermann and Ziegler, 1991; Skodda et al., 2011). Such reduction can be measured by
computing the triangular Vowel Space Area (tVSA) formed with the formant frequencies F1 and
F2 extracted from the vowels /a/, /i/, and /u/, while articulation-based problems in continuous
speech can be detected by analyzing the transitions from voiced-to-voiceless sounds (and vice
versa) and computing spectral-based fratures such as the Mel-Frequency Cepstral Coefficients


6 CHAPTER 1. INTRODUCTION

(MFCCs) (Orozco-Arroyave, 2016; Skodda et al., 2011). PD can also influence speech at the
segmental (individual sounds/phonemes) and suprasegmental level (speech prosody). For instance
at the segmental level, some studies have found that the difficulties of PD patients to control
laryngeal movements affects the production of stop consonants e.g., /p/, /t/, /k/, /b/, /d/, /g/ (Fischer
and Goberman, 2010). Such difficulties are typically measured by means of the Voice Onset
Time (VOT), which is defined as the time interval between the initial burst of a stop consonant
and the onset of voicing for the following vowel. The changes in the duration of the VOT
produced by patients often differs when compared with respect to a group of age-matched healthy
speakers (Argüello-Vélez et al., 2020; Montaña et al., 2018; Novotný et al., 2015; Tykalova et al.,
2017). Speech deficits at the segmental level can also be detected by estimating the probability
of occurrence of phonemes in a speech sequence (phoneme posterior probabilities), which can
be achieved by training a deep neural network to learn the representation of several phoneme
classes grouped according to different phonological rules (Cernak et al., 2015; Vásquez-Correa
et al., 2019). Suprasegmental speech deficits include variation in intonation, reduced loudness,
variable speech rate, among others (Jones, 2009). These deficits can be measured by means of the
F0 contour, energy content of the signal, and the amount of speech units (words, voiced segments)
produced by the speakers. Chapter 2 contains more details about the relationship between PD and
the speech production system.

1.2.2 Hearing loss

Clinical diagnosis

Hearing loss can appear due to various reasons such as senescence, trauma, inflammation, aging,
and others, and often without a known cause. Hearing loss can be acquired or it can be congenital,
e.g. because of genetic alterations, intrauterine infections or malformations. The treatment for
hearing loss depends on the severity and cause. The grade of the impairment can be categorized
as normal, mild, moderate, severe, or profound depending on audiometry descriptors. Such
descriptors are usually obtained by a pure-tone audiometry test which consists of a threshold
search by reproducing sinusoidal waveforms (through speakers or headphones) at different
frequencies (125 Hz, 250 Hz, 500 Hz, and from 1000 Hz to 8000 Hz in steps of 1000 Hz) and
intensity levels. The patient is asked to indicate whether the sounds are perceived by raising a
hand or pressing a button. Figure 1.1 shows an audiogram indicating the degree of hearing loss for
different loudness and frequency values. For instance, a person that can only hear sounds between
40 dB and 60 dB might suffer from moderate hearing loss.


1.2. SPEECH DISORDERS IN SELECTED POPULATIONS 7

Figure 1.1: Audiogram indicating the degree and type of hearing loss for different loudness and
frequency values. The hearing thresholds correspond to the range of values adopted by the World
Health Organization (Olusanya et al., 2019).

Although, the pure-tone audiometry test provides useful information about the hearing status
of a person, expert clinicians do not rely solely on such a test to determine the adequate treatment
of the patient. Treatment options are provided to the patient depending on the type of hearing loss
which can be conductive, sensorineural, or a mixture of both (Weber and Klein, 1999). On the
one hand, conductive hearing loss occurs due to a damage produced in the outer or middle ear
or by a malformation (e.g. ear canal, middle ear), causing the person to perceive sounds with
low intensity levels. Usually, hearing aids can be used as a treatment option because it amplifies
the sounds to improve audio perception. There are types of conductive hearing loss that can be
treated with medication or surgery. On the other hand, sensorineural hearing loss is related to
disorders in the inner ear (cochlea) or the auditory nerve system resulting in disabling hearing
impairment. Usually, therapy consists of the amplification of sounds by hearing aids which
are adapted to the hearing loss at different frequencies in the hearing range. In more profound
hearing loss and deafness (in the following summarized as deafness), amplification of sounds
is not enough to provide sufficient hearing for speech perception. In this case, CIs are the most
suitable devices for treatment. Contrary to hearing aids, a CI bypasses the damaged portions of
the ear and directly stimulate the auditory nerve. In the cochlea, frequencies are arranged from
high frequencies at the base to the deep frequencies at the top. The inserted implant in the cochlea


8 CHAPTER 1. INTRODUCTION

follows this natural representation of the sounds called “tonotopy” and stimulates the nerves that
correspond to the region of excitation. Although hearing with a CI is quite different from normal
hearing, speech understanding can be restored (Lenarz, 2017; Pisoni et al., 2017). Regarding the
outcome after cochlear implantation, some aspects need to be considered. The time of occurrence
of sensorineural hearing loss also affects the speech perception and production of the CI users.
On the one hand, prelingual onset of deafness refers to people who lost their hearing capability
before the acquisition of spoken language, their speech production is affected because they have
never monitored their own speech (Smith, 1975). On the other hand, postlingual onset of deafness
refers to people who lost their hearing after speech acquisition, however, their speech production
might be affected by the lack of sufficient and stable auditory feedback (Leder and Spitzer, 1990).

Speech production

People suffering from severe/profound deafness may experience different speech production
disorders. At a segmental level, such disorders include voicing errors, phoneme misarticulation,
vowel errors, among others (Gold, 1980; Waldstein, 1990). Voicing errors might be caused due
to failed attempts to coordinate respiration, phonation (voicing), and articulation resulting in a
confusion of the voiced-voiceless distinction. Thus, similar to the PD patients, voicing errors
can be detected by automatic extraction of voiced sounds, i.e., speech segments with F0 values
different than zero. Phoneme production errors are caused by different reasons. For instance, the
studies reviewed by Osberger and McGarr (1982) revealed that there was a general trend of hearing
impaired people to better produce the most visible phonemes, e.g., phonemes produced with
the lips or/and teeth. Consonant errors can also occur due to incorrect timing of the articulators
e.g., causing nasalization of non-nasal speech sounds due to improper velar control (Kato and
Yoshino, 1988; Stevens et al., 1976). Such phoneme articulation errors might cause a decreased
speech intelligibility, which can be evaluated with Automatic Speech Recognition (ASR) systems,
phoneme posterior probabilities, among others. At suprasegmental level, the speech of severely
and profoundly hearing impaired speakers also exhibits deviation from normal speech in timing
and voice quality. On the one hand, people suffering from hearing loss have been reported to
speak slower than healthy people due to the prolongation of speech and non-speech segments
(consonants, vowels, pauses), and the insertion of pauses within sentences (Oster, 1990). On the
other hand, voice quality problems include abnormally high F0 values (particularly in adolescent
and adult males) and insufficient or excessive variations of F0 within a sentence (Gold, 1980).
Thus, similar to the speech of PD patients, some of the suprasegmental aspects of speech can
be evaluated by computing F0-related features, duration, speech rate, energy, among others.


1.3. HYPOTHESES 9

Chapter 2 contains more details about the role of auditory feedback on the speech production
system.

1.2.3 Aging

Speech of the elderly sometimes can be called “slurred” with comprises slight changes in voicing,
articulation and prosody. The changes in organs and tissues involved in voice production which
are associated with the aging process include facial skeleton growth (Israel, 1973), pharyngeal
muscle atrophy (Zaino and Benventano, 1977), tooth loss (Adams, 1991), reduced mobility of
the jaw (Kahane, 1981), tongue musculature atrophy, and weakening of pharyngeal musculature.
The precise nature of vocal resonance is unclear, however a consistent pattern seems to be
a vocal tract lengthening with age (Linville, 1996). These changes alter the phonation and
articulation dimensions of speech, for instance elderly people exhibit a significantly greater
frequency perturbation than the young speakers (Benjamin, 1981). There are also differences
in the stability of F0 and amplitude of vocal fold vibration relative to young and middle-aged
adults (Xue and Deliyski, 2001). Changes in F0 and the formant frequencies have been also
observed in longitudinal analyses. Particularly, changes in the first formant frequency are believed
to compensate the decline of F0 in order to maintain the auditory distance between F0 and
F1 (Reubold et al., 2010). The influence of some of these parameters on speech assessment have
been addressed before when measuring speech intelligibility by considering an Automatic Speech
Recognition (ASR) system. In the experiments performed by Vipperla et al. (2010) on adult and
older voices, the authors found that elderly people show increased jitter and shimmer and these
variations have an impact on average phoneme recognition.

1.3 Hypotheses

Since different factors influencing speech production are considered in this thesis, the following
hypotheses are investigated:

• It is possible to evaluate the speech production of PD patients, CI users, and elderly speakers
using similar signal processing techniques.

• Since PD is a progressive disease that also affects speech, it is possible to assess the
progression per patient from speech signals captured in different recording sessions.

• The duration and onset of deafness influences speech production of CI users in different
ways, thus, automatic acoustic analysis can be used to detect these changes.


10 CHAPTER 1. INTRODUCTION

• It is possible to use smartphone applications to evaluate the speech production of PD patients
and CI users.

• Aging affects different aspects of speech production and such changes can be captured by
most of the features considered to analyze pathological speech.

1.4 Objectives

1.4.1 General objective

To propose a methodology for the monitoring of pathological speech signals combining different
signal processing techniques and machine learning methods.

1.4.2 Specific objectives

• To identify the contribution of different speech dimensions for the automatic assessment of
pathological speech signals.

• To analyze and select the most suitable features to detect changes in pathological speech
signals.

• To combine different speech processing techniques and machine learning methods for the
automatic assessment of pathological speech signals.

1.5 Contribution of this thesis

• Collection of a speech corpus from PD patients and CI users. The recordings were captured
in clinical settings and at-home of the patients using smartphones.

• A methodology for the automatic detection of VOT in voiceless stop sounds using a deep
neural network approach.

• A methodology to monitor the progression of PD patients over time using automatic acoustic
analysis.

• A methodology to quantify the phoneme production of CI users using a deep neural network
approach.


1.6. STRUCTURE OF THE THESIS 11

• A methodology to evaluate the impact of age on different acoustic measurements.

• Implementation of signal processing techniques on smartphones to evaluate speech produc-
tion of PD patients and CI users.

• Participation in the development of the mobile applicationApkinson, used to collect speech
and movement data from PD patients.

• Participation in the the development of the mobile application CITA (Cochlear Implant
Testing App), which is intended to collect data from CI users in order to evaluate the speech
perception and production of the patients. The source code of CITA is based on Apkinson.

1.6 Structure of the thesis

Chapter 2 includes information about the physiological processes of speech production, the
influence of PD in speech motor control and speech disorders associated with the disease. This
chapter also gives an overview of the auditory system, cochlear implants, and the role of auditory
feedback in speech motor control.

Chapter 3 includes information about the contributions in the state-of-the-art methods related to
predicting the severity of PD from speech signals, automatic methods used for analysis of speech
production in CI users, and smartphone-based applications developed to evaluate PD and hearing
loss.

Chapter 4 describes the speech processing techniques and acoustic features used to model
speech disorders. Additionally, this chapter includes the machine learning methods used in this
thesis for classification, regression analysis, and speaker models.

Chapter 5 includes details about the PD patients, CI users, and healthy speakers considered
in this thesis. Additional databases used to support the training of models used for automatic
speech analysis are also described.

Chapter 6 includes the experiments and results obtained for the automatic analysis of PD patients
and CI users from speech signals, and the effect of aging in speech production.

Chapter 7 summarizes the addressed aspects about pathological speech analysis.


Chapter 2

Speech production process

2.1 Speech chain

In the speech chain model described by Denes and Pinson (1993), oral communication consists
of a sequence of events happening on three levels: linguistic, physiological, and acoustic. The
process to produce intelligible speech starts in the speaker’s brain, at the linguistic level (Fig-
ure 2.1). First, the speaker collects his/her thoughts, decides what words to say, and places these
words to form sentences according to language dependent rules. The speech production process
continues at the physiological level, with the neural activity inside the brain sending the necessary
instructions to activate the muscles that control the vocal folds, tongue, lips, jaw, among others.
The speech production is completed at the acoustic level, where the movements of the vocal
muscles (combined with the air coming from the lungs) generates speech sound waves. Once the
speech is produced it travels through the air activating the hearing mechanism of the listeners.
The auditory feedback plays a key role in oral communication because it helps the speakers to
continuously monitor the quality and intelligibility of their own speech.

2.1.1 Physiological processes of speech production

In general, the speech production process involves the complex coordination and activation of
different muscles and limbs in the respiratory, laryngeal, and oral motor system. The respiratory
system is essential to produce speech by generating air pressure from the lungs during the
expiratory and inspiratory phases. The airflow passes a small valve, the glottis, which is formed by
the two vocal folds. During respiration, the vocal folds are in a lateral position. During phonation,
the vocal folds are closing resulting in vibrations of the soft mucosal tissue as a result of the

12


2.1. SPEECH CHAIN 13

Speech signal

Auditory
feedback

Auditory
nerve

Speech
motor
control

Figure 2.1: The speech production process starts in the brain, at the linguistic level, continues
with the neural and motor activity at the physiological level, and its completed with the generation
and transmission of sound waves at the acoustic level. The auditory feedback allows the speaker
to monitor its own speech. Based on Denes and Pinson (1993).

subglottal pressure and the airflow passing through the glottis (Van den Berg, 1958). During
oscillation, the vocal folds convert the air into a rapid sequence of airflow pulses generating audible
sounds (voice source sounds), which are perceived as a buzz whose frequency is proportional to
the vibration rate. During the production of the airflow pulses, the vocal folds have four main
stages: closed, opening, open, and closing (Figure 2.2). Speech sounds produced in this way are
commonly known as voiced sounds. If the vocal folds remain open, then the source of energy
for speech production is a stable stream of air coming from the lungs which is made audible by
other articulator (s) at some place in the vocal tract. The speech sounds that are not produced by
vibration of the vocal folds are commonly known as unvoiced sounds.

The oral motor system includes the articulatory mechanism necessary to modulate the voice
source which allows us to produce speech sounds with different acoustic properties. Such
properties depend on the shape of the vocal tract, which can be modified by moving the principal
articulators namely the tongue, lower jaw, lips, and velum. The oral motor system also includes
nasal, oral, and pharyngeal cavities which act as resonance chambers to transform the stream of
air into sounds with an additional acoustic characteristic (Benesty et al., 2007; Denes and Pinson,
1993; Fant, 1980). Figure 2.2 shows a diagram of the main articulators and resonators (oral, nasal


14 CHAPTER 2. SPEECH PRODUCTION PROCESS

pharyngeal cavities) involved in the speech production process. The air coming from the lungs
is the source to generate speech sounds. The muscles in the larynx act as a valve to control the
air stream coming from the lungs. The coordination and movements of the different articulators
together with the nasal, oral, and pharyngeal cavities provide the acoustic properties necessary
to generate different speech sounds. For instance, the vowel /a/ is commonly produced by a

Figure 2.2: Schematic views of the speech production system. (Left) Vocal folds vibration pattern
during the production of voiced speech segments. (Right) Components of the vocal tract used to
produce speech sounds. Based on Benesty et al. (2007) and Denes and Pinson (1993)

combination of tongue, jaw, and vocal folds movements. The vibration of the vocal folds creates
the voice source sound, which is then modulated by opening the mouth (lowering of the jaw) and
holding the tongue in a low position. Another example is the production of plosive sounds such as
/p/, which is produced by blocking (for a short period of time) the air stream with the lips building
enough air pressure to produce the sound when the closure is released. Generally, the vocal folds
remain open when producing the consonant /p/. Nasal cavities are also used to generate speech
sounds. For instance the nasal consonants /n/ and /m/ are produced during vibration of the vocal
folds and by blocking the air stream in the oral cavity with the lips (in the case of /m/) or the tip of
the tongue (in the case of /n/). Additionally, the velum partially blocks the air to the vocal cavity
and routes it to the nasal cavity.


2.2. IMPACT OF PARKINSON’S DISEASE ON SPEECH MOTOR CONTROL 15

2.2 Impact of Parkinson’s disease on speech motor control

2.2.1 Neuropathophysiology of motor control related to Parkinson’s dis-
ease

Motor deficits in PD can be analyzed by considering the interaction of the basal ganglia, the
motor cortex, and the thalamus (Figure 2.31). The basal ganglia are a group of neural formations
(subcortical structures) including the striatum (putamen and caudate nucleus), the Globus Pallidus
and its internal (GPi) and external (GPe) segments, the subthalamic nucleus (STN), and the
substantia nigra pars compacta (SNpc) and pars reticulata (SNpr). Anatomically, the STN belongs
to the subthalamus and the substantia nigra to the midbrain, however, they play a key role in
the functioning of the basal ganglia. Motor impairments in PD are mainly caused due to a
degeneration of dopaminergic neurons in the SNpc located in the midbrain.

The main function of the subcortical structures in the basal ganglia is to send signals to the
thalamus which then influence the activity in the motor cortex. This interaction can be analyzed
considering the most basic circuit model of the basal ganglia proposed by Albin et al. (1989) more
than 30 years ago. Although, more complex connections in the basal ganglia have been discovered
since then (Bostan and Strick, 2018; Redgrave et al., 2010), the basic model proposed in the late
80s is still valid to understand some of the most important aspects of motor control related to
PD (Milardi et al., 2019). Figure 2.4 shows a diagram of the neural circuits and neurotransmission
mechanism involved in the communication between cerebral cortex and basal ganglia. Basically,
the circuit model involves two main parallel loops:

1. The first loop is a cortex-to-cortex circuit in which the motor cortex sends signals to the
striatum, from which neural projections travel to the globus pallidus and then continue to
the thalamus which in turn sends information to the motor cortex.

2. The second loop involves activity from the substantia nigra, which projects dopaminergic
neurons to the striatum causing two opposite effects on two different receptors, the D1 and
D2 dopamine receptors: excitation (in D1) and inhibition (in D2).

The excitation and inhibition of movements are regulated by the dopaminergic input to the striatum
(from the SNpc) and go to the basal ganglia via the direct and indirect pathways:

1These figures are adapted versions of https://commons.wikimedia.org/wiki/File:
Basal_ganglia_circuits.svg and https://commons.wikimedia.org/wiki/File:
Midbrainsection.svg
Last retrieved 02/02/2021; under the Creative Commons Attribution-Share Alike 3.0 Unported license.

https://commons.wikimedia.org/wiki/File:Basal_ganglia_circuits.svg
https://commons.wikimedia.org/wiki/File:Basal_ganglia_circuits.svg
https://commons.wikimedia.org/wiki/File:Midbrainsection.svg
https://commons.wikimedia.org/wiki/File:Midbrainsection.svg


16 CHAPTER 2. SPEECH PRODUCTION PROCESS

Figure 2.3: Schematic views of the motor cortex, the thalamus, and components of the basal
ganglia. (A) shows a lateral view of the left hemisphere of the human brain. The dashed vertical
lines represent two coronal cuts (B and C) of posterior sections of the brain. (D) shows a superior
view of the midbrain signaling the substantia nigra (with SNpc and SNpr) in a healthy (left) and
Parkinson’s disease (right) brain. GPi: Globus pallidus internal segment; GPe: Globus pallidus
external segment; STN: Subthalamic nucleus; SNpc: substantia nigra pars compacta; SNpr:
substantia nigra pars reticulata. Adapted from Häggström (2021) and Madhero (2021)

• Direct pathway: The main function of the direct pathway is to excite the motor cortex and
to facilitate movement. This pathway begins in the motor cortex, where the neural impulses
enter the basal ganglia through the striatum via glutamatergic neurons, which produce an
excitatory neurotransmitter called glutamate. Then, the neurons from the striatum send
their axons to the GPi and SNpr via GABAergic inhibitory projections. The neurons from
the GPi/SNpr communicate with the thalamus, also via inhibitory projections. Then, the


2.2. IMPACT OF PARKINSON’S DISEASE ON SPEECH MOTOR CONTROL 17

thalamus excitatory pathways go to the motor cortex resulting in an increased motor activity.

• Indirect pathway: The main function is to inhibit motor activity by suppressing involuntary
movement. The pathway begins in the motor cortex by projecting glutamate to the striatum.
The neurons in the striatum send their axons to the GPe, then continue to the STN and the
GPi/SNpr, which in turn, suppress the activity of the thalamus on the motor cortex.

Figure 2.4: Diagram of the internal connections between motor cortex and basal ganglia. The
dashed red lines indicate inhibitory projections and the green lines indicate excitatory projections.
In the direct pathway the striatum communicates directly to the GPi and SNpr. In the indirect
pathway, the striatum communicates to the GPi and SNpr through the GPe and the STN. The
dopamine projected from the SNpc to the striatum causes excitatory and inhibitory effects on
D1 and D2 receptors, respectively. GABA: y-aminobutyric acid; GPi: Globus pallidus internal
segment; GPe: Globus pallidus external segment; STN: Subthalamic nucleus; SNpc: substantia
nigra pars compacta; SNpr: substantia nigra pars reticulata. Based on Obeso et al. (2000)

In summary, dopamine helps to regulate the excitability of the neurons in the striatum, which
is involved in the body movement. In a healthy brain, the signal that is forwarded from the motor
cortex (and continues to the body) is the result (in part) of a balanced activation of neurons in
the direct and indirect pathways. In PD patients, decreased dopamine levels cause an increased
inhibition in the GPe in the indirect pathway. In parallel, there is a decreased inhibition of the
GPi activity in the direct pathway. The result is an increased activity in the GPi/SNpr output of
the basal ganglia, which makes it difficult to the patients to control their movements (Obeso et al.,
2000).


18 CHAPTER 2. SPEECH PRODUCTION PROCESS

2.2.2 Motor speech disorders in Parkinson’s disease

The speech production disorders often associated with PD are known as hypokinetic dysarthria,
which is the result of a dysfunction in the basal ganglia internal pathways. As described by Duffy
(2000), hypokinetic dysarthria is characterized by a reduction in the range of movements, rigidity,
and slow repetitive movements affecting different dimensions of speech such as phonation,
articulation, and prosody. Phonation problems include tight breathiness, hoarse speech, voice
tremor, and bowing of the vocal cords. Phonatory deviations are usually evaluated during
sustained phonation of vowels. In the case of articulation, the reduced range of movements of
jaw, lips, and tongue results in prolongation of speech sounds, problems to initiate speech, and
imprecise articulation of sounds, which can be evident during speech tasks including conversations,
reading, alternating and sequential production of syllables (/pa/, /ta/, /ka/, and /pa-ta-ka/). In
the case of prosody, the most common speech disorders include a reduction in the variability of
pitch (monopitch) and loudness (monoloudness), rapid speech rate, reduced loudness. Prosodic
deviations are mainly detected during conversational and read speech tasks.

2.3 Auditory system and speech control

2.3.1 Overview of the auditory system

The auditory system is composed of the outer, middle, and inner ear (cochlea) and regions in the
brain including the auditory cortex. Figure 2.52 shows a diagram of the components in the ear.
Sound waves travel through the ear canal (an air-filled path) setting the eardrum into vibration.
The middle ear (an air-filled chamber) acts as a mechanical bridge between the eardrum and the
inner ear by means of three small bones (malleus, incus, and stapes). The movements of the
eardrum are transmitted by these bones to the oval window, which is the entrance to the inner
ear: the cochlea is a fluid-filled cavity (perilymphatic fluid) with three scales formed as a snake.
In the middle scale is the Corti organ on the basilar membrane which contains the hair cells. The
mechanical vibrations produced in the middle ear are transformed into electrical signals by hair
cells found in the basilar membrane within the cochlea (Figure 2.53). Specifically, when the
oval window is being push-in by the stapes, the fluids in the cochlea are moved towards the apex,

2This figure is an adapted version of https://en.wikipedia.org/wiki/File:Anatomy_of_the_
Human_Ear_cs.svg
Last retrieved 02/02/2021; under the Creative Commons Attribution-Share Alike 2.5 Generic license

3This figure is an adapted version of https://medienportal.siemens-stiftung.org/en/
cochlea-transparent-uncoiled-101976
Last retrieved 02/02/2021; under the Creative Commons Attribution-ShareAlike 4.0 international license

https://en.wikipedia.org/wiki/File:Anatomy_of_the_Human_Ear_cs.svg
https://en.wikipedia.org/wiki/File:Anatomy_of_the_Human_Ear_cs.svg
https://medienportal.siemens-stiftung.org/en/cochlea-transparent-uncoiled-101976
https://medienportal.siemens-stiftung.org/en/cochlea-transparent-uncoiled-101976


2.3. AUDITORY SYSTEM AND SPEECH CONTROL 19

generating pressure waves at different points in the basilar membrane, which in turn, bends the
hair cells releasing a neurotransmitter that fires auditory neurons that connect the ear with the
brain. There are two different hair cells - the inner hair cell that functions as a receptor and the
outer hair cell that amplify the incoming signal. A deviation of the basilar membrane leads
to a bending of the tiny hairs on the top of these cells that results in rhythmic elongation and
shortening of the outer cells according to the frequency representation at their location and by that
to an increased basilar membrane vibration. The flow of fluid inside the cochlea produced by
the inward movement of the oval window is accommodated by the round window at the other
end of the cochlea (Denes and Pinson, 1993).

The information about frequencies of the acoustic signals are encoded by the auditory system
by locating the places of the basilar membrane in which the pressure waves produce the maxi-
mum displacement (vibration) amplitude. For instance, the place of maximum displacement for
high frequencies occurs near the base (stiffest part), while for lower frequencies, the place of max-
imum vibration displacement occurs towards the apex (Loizou, 1999). After the sound waves are
transmitted and transformed into electrical impulses in the inner ear, the receptor neurons transmit
the signals over a pathway of nerves (passing through regions of the medulla and the midbrain)
connected to the auditory cortex. The phenomenon of frequency-localization-organization called
“tonotopy” persists from the cochlea over the neurons to the cortex.

Figure 2.5: (Left). Schematic view of the outer, middle, and inner ear. (Right). portion of the
cochlea in the inner ear. Sound waves are transformed into electrical signals by the bending of
the hair cells inside the basilar membrane. Adapted from Brockmann (2021).


20 CHAPTER 2. SPEECH PRODUCTION PROCESS

2.3.2 Cochlear implants (CIs)

As described in Section 1.2.2, sensorineural hearing loss is caused by disorders in the inner ear
occurring at birth, due to a disease, as the result of an infection, among others. For instance,
Meningitis is an infection that can destroy the hair cells within the cochlea. Thus, without the
hair cells, the connection between the ear and the central nervous system is broken (Weber and
Klein, 1999). CIs bypass the damaged parts by triggering the hearing nerves via a direct electrical
stimulation through electrodes inserted in the cochlea (Figure 2.64). In general, a CI consists of
an external speech processor, which captures, preprocesses, and transforms the speech signals
into electrical impulses which are sent to an array of electrodes implanted inside the cochlea of
the patient. Commonly, the insertion of the electrodes is performed through the round window.
The insertion depth depends on the size of the cochlea and can reach distances close to the
apex (Carlson, 2020; Lenarz, 2017). The implants may have 12 or 22 (only half of them are active)
electrodes along the cochlea. There are a number of factors that can influence the frequency
resolution of the sounds perceived with help of a CI (Brant and Eliades, 2020; Loizou, 1999).
Some factors are:

1. The distance between the electrode contacts and the auditory neurons. Neural activation
decreases as the result of a decreased strength of the electrical stimulation in the targeted
neuron region.

2. The spread of the electrical stimulation. The propagation of the electrical current in the
electrodes, is spread by the perilymphatic fluid along the cochlea, thus, the electrical
excitation is not focused on a single region and might excite the surrounding neurons.

3. The number of auditory neurons available for electrical stimulation is limited. In order for
the CI to work properly, there has to be neural tissue left to receive electrical current.

4. The insertion by the surgeon sometimes is difficult resulting in a diminished number of
activated electrodes.

Considering what is mentioned above, it is clear why a CI user may notice differences between the
sounds perceived and the sounds produced, even after cochlear implantation (Lane et al., 1995).

4This figure is an adapted version of https://www.embopress.org/doi/full/10.15252/emmm.
201911618
Last retrieved 02/02/2021; under the Creative Commons Attribution 4.0 license

https://www.embopress.org/doi/full/10.15252/emmm.201911618
https://www.embopress.org/doi/full/10.15252/emmm.201911618


2.3. AUDITORY SYSTEM AND SPEECH CONTROL 21

Figure 2.6: (Left). Schematic view of a cochlea (and cross-section) with normal hearing. (Right).
Schematic view of a cochlea (and cross-section) with implant. Commonly, the electrode array is
implanted through the round window. The electrical stimulation of the electrode contact is spread
in a region of the target neurons. Adapted from Dieter et al. (2020).

2.3.3 Auditory feedback and speech control

Auditory feedback is the precondition of constant survey and correction of our own speech and by
that for the development and maintenance of speech movements. As described by Tourville et al.
(2008), speech motor control is characterized by feedback and feedforward control. On the one
hand, in feedback control the performance of the movements is evaluated during execution and
any deviation is corrected according to sensory information. On the other hand, in feedforward
control the performance of the movements depends on previously learned commands without
relying on sensory information. These mechanisms of speech control are often examined and
include different aspects. Some examples of the impact of auditory feedback on these two
processes include:

• Voice control, when a speaker raises his/her voice because the self-perceived loudness is
too low or simply to overcome background noise (Lombard effect; Lombard (1911)).

• Speech disfluency caused by delayed auditory feedback (Stuart et al., 2002)

• Adaptation of formant frequencies when a speaker hears persistent shift of formants of their
own speech (Purcell and Munhall, 2006; Tourville et al., 2008).

Normally, speech production is constantly monitored and compared to an internal speech model
in the brain which is acquired and maintained with the use of auditory feedback (Perkell et al.,
2000). In the Directions Into Velocities of Articulators (DIVA) model of speech production
proposed by Guenther (1994), the speech movements are planned considering a speech sound
map (in the motor cortex) that is activated to: (1) learn speech sound targets and (2) to control
the necessary articulatory movements to achieve different acoustic goals (Guenther and Hickok,


22 CHAPTER 2. SPEECH PRODUCTION PROCESS

2016). With ongoing hearing loss the speech sound map can slightly change, but moreover, the
sensory-motor control is decreasing as one tends to use only as much force and effort for all
movements as necessary (Guenther et al., 2004; Perkell et al., 2007). This has a considerable
impact on speech of people with hearing impairment. For instance, when hearing loss occurs after
speech acquisition (post-lingual onset of deafness), at first somatosensory feedback maintains
precise speech production. If there is a persistent lack of auditory feedback, speech production
may eventually deteriorate due to a diminished precision of articulation.

Summary

The speech production process requires the complex coordination of regions in the brain, vocal
tract, and auditory system. Depending on the clinical condition, different aspects of speech can be
affected, and thus, it is possible to detect these changes using automatic acoustic analysis. The
following chapter describes the techniques and methods used to model pathological speech signals
and detect speech production changes by analyzing aspects related to phonation, articulation, and
prosody.


Chapter 3

State-of-the-art

3.1 Severity estimation of Parkinson’s disease from speech

Typically, the assessment of the neurological state of PD patients from the speech is performed
using regression analysis, which consists of training a model to learn the relationship between
acoustic features (extracted from the speech signals) and the clinical score of the patient.

Several studies have addressed the prediction of clinical scores of PD patients. Asgari and
Shafran (2010) proposed a methodology to predict the UPDRS-III score (motor sub-score) from
speech recordings of 61 PD patients and 21 Healthy Controls (HC). Phonation, articulation, and
prosody analyses were performed by extracting acoustic features from the sustained phonation of
the vowel /a/, the rapid repetition of /pa-ta-ka/, and the reading of three standard texts. The set
of features considered are F0, jitter (cycle-to-cycle variation of pitch), shimmer (cycle-to-cycle
variation of the glottal waveform), spectral entropy (entropy of the log power spectrum), cepstral
coefficients (shape of the spectral envelope), the number and duration of voiced and unvoiced
frames, among others. A feature vector was formed for each speaker, and a Support Vector
Regressor (SVR) was trained to predict the patients’ UPDRS scores. The authors reported that
it is possible to estimate the UPDRS-III with a Mean Absolute Error (MAE) of 5.66 using an
ε-SVR with a cubic polynomial kernel. Tsanas et al. (2010) performed regression analysis to
estimate the UPDRS scores from 42 PD patients (28 male, 14 female). Speech recordings with
the sustained phonation of vowels were captured once per week for six months. However, the
neurological state of the patients was assessed only three times during that period: at the beginning,
three months later, and at the end. Thus, the authors used a piece-wise linear interpolation in
order to obtain the missing UPDRS scores. Speech signals were modeled considering acoustic
features based on pitch/amplitude perturbation, noise, and entropy. Regression analysis was

23


24 CHAPTER 3. STATE-OF-THE-ART

performed using least squares, Least Absolute Shrinkage and Selection Operator (LASSO), and
Classification And Regression Trees (CARTs). Additionally, the MAE was used to evaluate the
proposed approach’s performance to estimate the total UPDRS and the scores from the motor
section (UPDRS-III). The authors reported that the CARTs approach was the best approach with
an MAE of 7.5 points in the evaluation of the total value of the UPDRS scale. The scores of
the motor section in the UPDRS were estimated with an MAE of 6 points. Skodda et al. (2013)
presented a study where the speech deterioration was evaluated over time. The speech of 80 PD
patients (48 male, 32 female) was recorded from 2002 to 2012 in two recording sessions. The
time between the first and second sessions ranged from 12 to 88 months. A control group of 60
healthy persons (30 male, 30 female) was also considered. The participants were asked to read
a text and to produce a sustained phonation of the vowel /a/. In both sessions, the patients were
assessed by neurologist experts according to the UPDRS-III. The audio signals were perceptually
evaluated considering four aspects of speech: voice, articulation, prosody, and fluency. Acoustic
analysis was performed to describe these speech aspects. Voice was modeled with a set of
features, including jitter, shimmer, and pitch average. The Vowel Articulation Index (VAI) and
the proportion of pauses within polysyllabic words were considered for articulation. Prosody
is analyzed with the estimation of the standard deviation of the pitch. In addition, fluency was
evaluated considering the speech rate and the pause ratio. To assess the progression of speech
and voice impairments, the authors compared the extracted features in the first and the second
session. The authors found significant differences for shimmer, speech rate, pause ratio, and
VAI when features extracted from the first session were compared to the same features extracted
from the second session. However, as the authors stated, the results were not conclusive due to
methodological limitations, like a long time between the two recording sessions. Bayestehtashk
et al. (2015) considered three regression techniques to predict the UPDRS scores, including
ridge regression, LASSO regression, and linear SVR. Speech recordings of 168 patients were
collected in a single recording session. Automatic methods for acoustic analysis of PD was also
addressed in the Parkinson’s Condition sub-challenge, which was part of the INTERSPEECH
2015 Computational Paralinguistic Challenge (Schuller et al., 2015). The challenge consisted
on predicting the MDS-UPDRS-III score, using recordings of 50 patients (25 male, 25 female)
included in the PC-GITA database (Orozco-Arroyave et al., 2014) were considered to form the
train and development subsets. The test set included a total of 11 new patients recorded in non-
controlled noise conditions, i.e., not using a sound-proof booth. A total of 42 speech tasks were
considered. The neurological state of the patients was assessed by a neurologist expert according
to the MDS-UPDRS-III subscale. The winners of the challenge (Grósz et al., 2015) reported


3.1. SEVERITY ESTIMATION OF PARKINSON’S DISEASE FROM SPEECH 25

a Spearman’s correlation of 0.65 between the real MDS-UPDRS-III scores and the predicted
values using deep rectifier neural networks and Gaussian processes. Orozco-Arroyave et al.
(2016) presented a methodology to estimate the neurological state (MDS-UPDRS-III) of 158 PD
patients: 50 Colombian (25 male), 88 Germans (47 male), and 20 Czech (all male). The regression
process was performed using a linear ε-SVR. The speech tasks considered are reading isolated
words, sentences, a standard text, and a monologue. In order to model articulation problems, the
authors extracted the energy in the transitions from unvoiced to voiced (onset) and from voiced to
unvoiced (offset) segments considering different frequency bands distributed according to the Bark
and the Mel scales. Speech intelligibility was evaluated using an automatic speech recognition
system. According to the authors, the neurological state of the patients (MDS-UPDRS-III) can
be estimated with a Spearman’s correlation of up to 0.74 when several speech tasks are modeled
considering the fusion of articulation and intelligibility measures. The openSMILE toolkit was
considered for feature extraction, which allows computing more than 6000 descriptors (Eyben
et al., 2010). The authors reported that the neurological state of the patients could be assessed with
an MAE of 5.5. A study for the monitoring of PD progression was also presented by Gómez-Vilda
et al. (2017). The authors considered speech recordings from 8 male patients captured twice
for four weeks between sessions. Speech recordings of 100 healthy speakers were considered
as a baseline. The participants were asked to perform the sustained phonation of the vowels /a/,
/e/, /i/, /o/,/u/, and read a short sentence and a standard text. The authors used two methods to
estimate the features: (i) vocal tract inversion using an adaptive lattice filter and (ii) biomechanical
inversion of a 2-mass model of the vocal folds. The features include jitter, shimmer, harmonicity,
vocal fold body mechanical stress, and tremor during vibration of the vocal folds. During the
recording sessions, the patients continued their pharmacological treatment and received speech
therapy. Each patient was evaluated according to the H&Y scale. The relationship between the
neurological scale and the acoustic features was evaluated using hypothesis testing based on
Bayesian Likelihood. According to the authors, the tremor and biomechanical features evolve
differently with the treatment. The authors suggest defining different time intervals between
evaluations to obtain more conclusive results. Sztahó et al. (2017) proposed a method to estimate
the severity of PD using rhythm-based features. The authors considered speech recordings of 51
PD patients (25 male) and 27 healthy speakers (14 male) from Hungary. All of the patients were
evaluated according to the H&Y scale. The speech tasks consisted of a monologue and the reading
of a standard text. The set of rhythm features includes the standard deviation of the duration of
consonants and vowels, the average duration of the speech/pauses, the pause ratio, percentage of
consonants/vowels, the articulation rate, and the raw and normalized Pairwise Variability Index


26 CHAPTER 3. STATE-OF-THE-ART

(rPVI, nPVI) of the consonants and vowels. Regression analysis was performed to estimate the
severity of the disease using linear regression, SVR, Artificial Neural Networks (ANN), and Deep
Neural Networks (DNN). The authors obtained Spearman’s correlation coefficient of up to 0.744
(SVR, reading task) between the predicted and the target H&Y scores. Hemmerling and Wojcik-
Pedziwiatr (2020) estimated the severity of PD by extracting acoustic features from the sustained
phonation of the vowels /a/, /e/, /i/, /o/, and /u/. The set of features includes average F0, jitter,
shimmer, energy, spectral moments, MFCCs, Perceptual Linear Prediction (PLP) coefficients,
among others. For this, speech recordings of 27 PD patients from Poland were captured five times
for 180 minutes after taking levodopa medication. Additionally, a neurologist expert estimated
the UPDRS score of the patients in the five recording sessions. The motor UPDRS scores of
the patients were estimated using multiple linear regression, Random Forest (RF) regression,
and SVR. The authors reported that the lowest error between predictions and clinical scores was
obtained for the vowel /a/ (MAE=1.85) when the regression analysis was performed with RF.

Other studies have also considered regression analysis to estimate the dysarthria level of
PD patients. Cernak et al. (2017) evaluated the changes in the voice quality of the speakers by
considering the mFDA score related to larynx deficits (Table 1.3). The authors trained an SVR
with phoneme posterior probabilities extracted from recordings of 50 PD patients and 50 HC
speakers from Colombia. The speech tasks include the rapid repetition of /pa-ta-ka/, the reading
of a standard text, and a monologue. The authors reported Spearman’s correlation coefficients of
up to 0.57 between the predicted scores and the larynx mFDA score. Garcı́a et al. (2017) predicted
the neurological state and dysarthria level of 50 PD patients according to the MDS-UPDRS-
III and mFDA scores, respectively. Acoustic analysis was performed by considering different
pitch, loudness, duration, and filterbank analysis parameters. These features were extracted from
4 speech tasks, including the rapid repetition of syllables (e.g.,/pa-ta-ka/), a monologue, and
reading a text and different sentences. Then, the i–vector approach was considered to obtain
the speaker models (or embeddings) of 50 PD patients and 50 HC speakers from Colombia
(See Chapter 4). The authors reported that it was possible to predict the MDS-UPDRS-III with
a Spearman correlation of 0.63 when phonation and articulation features extracted from the
sentences were considered to train the i–vectors. Additionally, the mFDA was predicted with
a Spearman correlation of 0.72 when considering the rapid repetition of /pa-ta-ka/ modeled
with phonation, articulation, and prosody features. Vásquez-Correa et al. (2018) estimated the
dysarthria level of 68 PD patients and 50 HC speakers from Colombia. The set of speech tasks
included the sustained phonation of Spanish vowels, the reading of 10 sentences, a standard
text, a monologue, and the rapid repetition of /pa-ta-ka/, /pa-ka-ta/, /pe-ta-ka/, /pa/, /ta/, and /ka/.


3.2. SPEECH ANALYSIS OF COCHLEAR IMPLANT USERS 27

Automatic acoustic analysis was performed with i–vector speaker models obtained from phonation,
articulation, prosody, and intelligibility-based features. Additionally, three variations of ridge
regression (linear, kernel, bayesian) and two variations of SVR were considered to estimate the
mFDA scores of the patients and the HC controls. The authors reported that the higher Spearman’s
correlation coefficient was 0.69 for articulation features extracted from continuous speech. Karan
et al. (2020) combined F0 and Hilbert’s spectral features to estimate the mFDA score of 70 PD
patients. The authors considered speech recordings with the sustained phonation of the vowels /a/,
/e/, /i/, /o/, and /u/ and the reading of 10 isolated words. Regression analysis was performed with
an ε-SVR. The authors reported Spearman’s correlations of 0.75 (for the vowel /o/) and 0.77 (for
the word reina; “Queen”).

Table 3.1 summarizes the studies related to the severity estimation of PD. In general, the
sustained phonation of vowels and the reading of a standard text are the most frequently used
speech tasks to assess the patient’s neurological state. As described in Section 1.2.1, such a
task allows detection of speech problems. In the case of the reading task, the acoustic analysis
allows evaluating articulation and prosody problems. The most common biomarkers considered
to model speech problems include pitch (F0, jitter), harmonicity, e.g., harmonics-to-noise ratio,
and the spectral energy of the signal. Furthermore, the SVR has been suitable for modeling the
relationship between the acoustic features and the clinical score.

3.2 Speech analysis of cochlear implant users

Oral communication skills of severely and profoundly hearing-impaired speakers can be improved
by cochlear implantation. Such an improvement has been observed by a better contrast to produce
consonants, a decreased production of average F0 values, loudness, and duration of speech
segments. Nevertheless, the speech production of CI users is affected even after rehabilitation
by cochlear implantation. Plant and Oster (1986) investigated pitch, duration, and articulation
changes on the speech of one female speaker recorded in two sessions: before and after cochlear
implantation. The speech tasks consisted of the reading of a text and a list of words. Pitch and
duration were evaluated from the reading of the text by computing the average and standard
deviation of the F0 contour, the total phonation time, the average duration of the pauses, and an
estimated value of articulation rate (the number of syllables divided by the total phonation time).
Articulation was evaluated by extracting the vowels from the list of words and computing the
ratio between the first and second formants (F1/F2) to detect shifts in the vowel space area. The
authors reported that after implantation, the speech parameters from the CI uses moved towards


28 CHAPTER 3. STATE-OF-THE-ART

Table 3.1: Summary of works related to the severity estimation of PD. Longitudinal analysis refers
to studies that consider several speech recordings captured in different sessions from the same
patients.

Authors Subjects Acoustic Method Clinical Longitudinal
parameters (best result) scale analysis

Asgari 2010 61 PD/21 HC Loudness, duration SVR UPDRS-III No
entropy, harmonicity
pitch, spectral energy

Tsanas 2010 42 PD Pitch, harmonicity CART Total UPDRS Yes
nonlinear analysis UPDRS-III

Skodda 2013 80 PD Pitch, articulation Shapiro-Wilk UPDRS-III Yes
fluency, harmonicity statistical test

Bayestehtashk. 2015 168 PD Loudness, duration SVR UPDRS-III No
entropy, harmonicity
pitch, spectral energy

Grósz 2015‡ 61 PD/ 50 HC Articulation Gaussian MDS-UPDRS-III No
processes

Orozco-Arroyave 2016 158 PD∗ Articulation, SVR MDS-UPDRS-III No
intelligibility

Gómez-Vilda 2017 8 PD/ 100 HC Pitch, harmonicity Bayesian H&Y Yes
vocal folds tremor, likelihood
body mass features

Sztahó 2017 51 PD/27 HC Speech rate, SVR H&Y No
duration, rhythm

Cernak 2017 50 PD/ 50 HC Phoneme posterior SVR mFDA (Larynx) No
probabilities

Garcı́a-Ospina 2017 50 PD/ 50 HC Pitch, loudness i–vectors mFDA No
articulation, duration MDS-UPDRS-III

Vásquez-Correa 2018 68 PD/ 50 HC Speaker embeddings SVR mFDA No
with i–vectors

Hemmerling 2020 27 PD Pitch, loudness, SVR UPDRS-III No
spectral energy,
filterbank features

Karan 2020 70 PD Pitch, Hilbert SVR mFDA No
spectral features

∗ This study includes speakers from Colombia (50), Germany (88), and Czech republic (20)
‡ Winners of the Parkinson’s Condition sub-challenge (Schuller et al., 2015)

“normality” values, which were obtained by performing the same analysis on the recording of an
age-matched typical hearing speaker. As stated by the authors, the main limitation of that study was
that only one speaker was considered. Furthermore, the authors believe that speech improvement
by the CI users may be the result of training. Perkell et al. (1992) performed acoustic analysis
considering speech recordings of four postlingually deafened CI users. The recording sessions
were performed pre- and post-activation of the speech processor. Post-activation recordings were
captured at different week intervals. The features considered for analysis were F0, F1, F2, Sound
Pressure Level (SPL), duration, and amplitude difference between the first two harmonic peaks


3.2. SPEECH ANALYSIS OF COCHLEAR IMPLANT USERS 29

in the log-magnitude spectrum. The speech tasks consisted of reading nine vowels (included in
predefined words) spoken in a carrier sentence. The authors reported that, after activation, many
of the acoustic parameters moved toward values reported in previous studies, which considered
healthy speakers. However, these results were based on the outcome of only four speakers. Lane
et al. (1995) measured the VOT in stop-initial syllables produced by five CI users. Short-term and
long-term analyses were performed. For the short-term analysis, the recordings were captured
after turning off the speech processor of the patients for 24 hours, then turned on, and then off
again. For long-term analysis, speech recordings were captured before and after activation of the
speech processor in intervals of 0, 4, 12, 26, 52, 104 weeks. The speech task consisted of the
reading of the six English stop consonants embedded in a carrier sentence. The measurements
for the VOT were performed manually. The authors examined the effect of processor activation
and found increased VOT measurements in the voiced stop consonants for the short-time analysis
and increased VOT values for the long-term analysis. The authors suggest that changes in voiced
stops are related to concurrent changes in pitch and SPL. For the case of voiceless stops, the
changes are linked to auditory validation of phonemic settings. One limitation of this study is
the reduced amount of speakers considered for the experiments. Gould et al. (2001) examined
speech intelligibility of four postlingually deafened adults before and after 6 and 12 months of
activation. The participants were instructed to produce ten repetitions. Speech intelligibility
was measured for vowels and consonants individually using a metric called the percentage of
transmitted information. The authors reported an overall improvement in word intelligibility;
however, such an improvement was not consistent for individual consonants or vowels. Blamey
et al. (2001) analyzed the speech production of nine children for six years after implantation.
Speech intelligibility was assessed by considering phonetic transcriptions of conversational speech.
The transcriptions were used to measure the percentage of correctly produced words. The authors
observed an increase in speech intelligibility, length, and phonemic accuracy during the six years.
However, the rate of improvement was considerably slower than that observed in normally-hearing
children who developed their linguistic skills at a younger age. Hassan et al. (2012) evaluated
speech nasalization considering 25 postlingual CI users and 25 age-matched HC from Saudi
Arabia. The patients were divided into three groups according to the duration of hearing loss
before implantation: (1) less than three years (7 patients), (2) between 3 and 6 years (8 patients),
and (3) more than six years (10 patients). For evaluation, percentage scores of nasalance were
obtained from two sentences read by participants. The scores were obtained with a nasometer
which measures the acoustic output from the oral and nasal cavities. Nasalance scores were
obtained for each patient before implantation and after 6, 12, and 24 months of CI activation.


30 CHAPTER 3. STATE-OF-THE-ART

The authors reported that for the three groups of patients, there is a tendency from the nasalance
scores to decrease over time; however, the level of nasality was still higher than in the control
group. Furthermore, the authors found that the degree of nasality and the improvement over time
depend on hearing loss duration. In the study presented by Ubrig et al. (2011) deviations in the
phonation of CI users were investigated. For this, the authors considered speech recordings of 40
postlingual CI users and 12 postlingually hearing-impaired adults without implants from Brazil.
Two recording sessions were performed for the CI users: before implantation and 6-9 months
after activation of the device. Acoustic analysis was performed by computing the average and
standard deviation of the F0 contour obtained from the recordings of the sustained phonation of
the vowel /a/ and the reading of a standard text. The authors found a significant reduction of F0
variability when comparing the first to the second recording session.

Other works have investigated the impact of the onset of hearing loss in the speech of CI users,
i.e., pre-/post-lingual hearing loss. Vowel articulation of pre- and post-lingual deafened CI users
was evaluated by Neumeyer et al. (2010). Speech recordings of 10 CI users (5 prelingual) and ten
age-matched normal hearing speakers from Germany were considered for the test. Articulation
analysis was performed by computing the vowel space of /a/, /e/, /i/, /o/, and /u/ which are extracted
from target words included within 20 standard sentences. The acoustic parameters extracted from
the vowels include the first and second formant frequencies. The authors reported a reduction of
the vowel space area for the CI users compared to normal hearing speakers; particularly, such
a reduction is mainly caused by the misarticulation of back vowels (/o/, /u/). One reason the
authors give is that such vowels are produced with tongue movements that are not visible to the
CI users. Additionally, the authors did not report differences between pre- and post-lingual CI
users. The authors suggest that since postlingual CI users spent years without sufficient hearing
and auditory feedback before implantation, their articulatory capability was diminished. Pre- and
post-lingual CI users have been found to have limited production contrast of sibilant sounds, e.g.,
/s/ and /S/. Todd et al. (2011) analyzed speech recordings from 33 CI children (all prelingual)
and 43 age-matched HC English native speakers from the United States. All children were asked
to read 18 words with the sibilant sound (/s/ or /S/) in the initial position. The target phonemes
were manually transcribed and evaluated by trained native speakers. The acoustic analysis was
performed by computing the energy in the Bark scale from a Hamming window of 40 ms located
in the middle of the sibilant sound. Then, only the Bark band with the highest energy was selected
for evaluation. From the transcription analysis, it was observed that CI children produced the /s/

with less accuracy than /S/. Furthermore, the children produced these two phonemes with less
accuracy than the HC. Regarding the acoustic analysis, the authors found that CI children produced


3.2. SPEECH ANALYSIS OF COCHLEAR IMPLANT USERS 31

the sibilant sounds with less energy than the control group, which results in a reduced contrast
between /s/ and /S/. The authors suggest that such a diminished contrast may be caused by a
poor frequency resolution in the implant. Similarly, Neumeyer et al. (2015) analyzed the German
sibilant sounds /s/ and /S/ produced by 48 CI users (24 prelinguals) and 48 HC speakers. The
patients were divided into four groups depending on the onset of hearing loss (pre-/post-lingual)
and the time between hearing loss and cochlear implantation (before/after language acquisition).
The study participants were asked to read a carrier sentence containing two words that differed
only in one consonant and with different meanings: Tasche (bag) and Tasse (cup). Acoustic
analysis was performed by manually segmenting the sibilant sounds from the recordings and then
computing the first spectral moment. From the results, the authors concluded that the sibilant
production of CI users deviates from normal speech, that onset of deafness plays a role in the
degree of the deviation, but that the duration between onset of hearing loss and implantation has
no significant effect impact on the sibilant production. The authors explained that such deviations
might occur because the spectral resolution of the implant is lower in higher frequencies; thus, CI
users shift the production of the sibilant sounds into the frequency range perceived by them. The
speech intelligibility of pre- and post-lingual CI users can also be affected in different ways. Ruff
et al. (2017) performed the automatic evaluation of the speech production intelligibility using an
ASR system. The authors considered recordings of 50 CI users (14 prelingual, 36 postlingual) and
50 HC German native speakers for the experiments. The patients were divided into three groups:
(1) prelingually deafened CI users with more than two years before surgery, (2) postlingually
deafened CI users with less than two years before surgery, and (3) postlingually deafened CI users
with more than two years before surgery. The study participants were asked to read a total of 97
words that contain every phoneme of the German language in different positions within the words.
Then, the Word Recognition Rate (WR) was computed from the automatic transcriptions obtained
from the ASR. The system was trained with 27 hours of speech recordings using the 97 words from
the test as the vocabulary. The authors found that CI users with the postlingual onset of hearing
loss and short duration of deafness (< 2 years before surgery) have higher WR than postlingual
with a long duration of deafness and prelingual. Furthermore, the postlingual CI users with a
short duration of deafness showed WR similar to the HC speakers. Gautam et al. (2019) presented
a review of more than 25 studies (from 1983 to 2017) related to speech and voice changes due
to hearing loss and the effect of CI in adults and children. The acoustic parameters evaluated in
those works include pitch, loudness, consonant contrast, speech duration/rate, vowel articulation
(VSA), and VOT. Changes in speech and voice due to hearing loss include: (1) increased pitch,
loudness, and duration of speech, (2) reduced VSA and VOT, and (3) slower speech rate. The


32 CHAPTER 3. STATE-OF-THE-ART

studies in the literature have reported that most of these parameters move towards normality after
cochlear implantation; however, speech and voice deviations are still present.

Table 3.2 shows a summary of the works reviewed in this section. Although speech production
of CI users has been addressed before, the number of studies considering automatic methods for
acoustic analysis is limited. From the works reviewed, it can be observed that speech and voice
parameters such as pitch and loudness deviate from normality values even after implantation.
Furthermore, poor contrast to produce some phonemes such as /s/ and /S/ has been associated
with the limited resolution of the CI to provided good perception to the patients.

Table 3.2: Summary of works related to acoustic analysis of speech production of CI users.
Authors Subjects Acoustic Method Effect of Automatic

parameters hearing loss analysis
Plant 1986 1 CI/1 HC Pitch Mean and variation of F0 Reduced F0 No

Duration Voiced segments Longer duration
Vowel articulation Formant frequencies Reduced VSA

Perkell 1992 4 CI Pitch Mean F0 Reduced F0 No
Duration Vowel duration Longer duration
Loudness Mean SPL Reduced loudness
Vowel articulation Formant frequencies Reduced VSA

Lane 1995 5 CI Duration Voiced and voiceless Reduced VOT No
VOT

Gould 2001 4 CI Intelligibility Percentage of information Poor speech No
transmitted intelligibility

Blamey 2001 9 CI Intelligibility Percentage of correct Poor speech No
words intelligibility

Neumeyer 2010 10 CI Vowel articulation Formant frequencies to Reduced VSA No
estimate the VSA

Todd 2011 33 CI Consonant articulation Bark energies to evaluate the Poor contrast No
production contrast between
the phonemes /s/ and /S/

Ubrig 2011 40 CI/12 HI∗ Pitch Mean and standard deviation Higher variation No
of F0 of F0 No

Hassan 2012 25 CI/ 25 HC Nasality Nasometer to estimate Higher level of No
the nasality level nasality

Neumeyer 2015 48 CI Consonant articulation Spectral moment to evaluate the Poor contrast No
production contrast between
the phonemes /s/ and /S/

Ruff 2017 50 CI/ 50 HC Intelligibility Word recognition rate Lower word Yes
using an ASR system recognition rate

Gautam 2019 NA† Pitch Mean F0 and jitter Increased pitch -
Loudness Mean SPL and shimmer Increased loudness
Duration Word/syllable duration, VOT Longer durations
Speech rate Speaking rate Slower rate
Vowel articulation Formant frequencies, VSA Reduced VSA
Consonant articulation /s/ vs /S/; /r/ vs /l/ Poor contrast

∗HI: Hearing impaired.
†Information about the number of speakers not available.


3.3. AGING AND SPEECH 33

3.3 Aging and speech

Some studies have analyzed the impact of aging in speech. Xue and Deliyski (2001) considered
sustained phonations of the English vowel /a/ and computed fifteen phonation measures of the
Multi-Dimensional Voice Program. The set of measures includes F0, jitter, Pitch Perturbation
Quotient (PPQ), Relative Average Perturbation (RAP), variability of F0, Amplitude Perturbation
Quotient (APQ), shimmer, Noise to Harmonics Ratio (NHR), among others. A total of 44
speakers (21 male and 23 female) aged between 70 and 80 years were considered and compared
with respect to the norms for young and middle-aged adults published by Deliyski and Gress
(1998). The authors performed statistical analyses and reported that the voice of elderly people is
significantly different (usually poorer) than the voice of young and middle-aged adults. Goy et al.
(2013) considered several phonation measures to assess the stability of vocal fold vibration and to
quantify the noise in the voice of 159 younger speakers at ages between 18 and 28 years, and 133
older adults with ages between 63 and 86 years. The authors concluded that the instability of the
vocal fold vibration increases with age. The dysphonia severity index was also measured and only
older females exhibited higher values than those in younger females. No statistical differences
were observed between younger and older males. Another study that evaluates the influence of
aging in the speech of elderly people considering phonation and articulation analyses is presented
by Torre and Barlow (2009). A total of 27 young speakers with mean age of 25.6 years and 59
older people with mean age of 75.2 years were considered. Each participant was asked to read
a set of 22 consonant-vowel-consonant words. The vowels and oral stops of each word where
extracted and analyzed using Praat (Boersma and Weenink, 2001). The authors analyzed several
acoustic properties including F0, the first three formant frequencies and the VOT. According
to the results, there was a decrease of F0 with age for women and a increase of F0 with age
for men. This finding is consistent with the results reported by Benjamin (1981). The authors
highlighted also that older men showed shorter VOTs than both younger men and younger women,
which is also reported by Benjamin (1982). A greater variability in F0, the three formants, and
the VOT is systematically observed in the speech productions by older adults compared to their
younger same-sex counterparts. As the natural aging process in humans carries several alterations
in speech production and perception, the impact of aging in the detection of voice disorders is
still an open problem and its relevance in the clinical practice was studied by Pernambuco et al.
(2017). The relationship between age and s