Current Issue

This Article From Issue

July-August 2024

Volume 112, Number 4
Page 198

DOI: 10.1511/2024.112.4.198

The study of the human voice is a highly complex field. Whether or not a person has a voice disorder can be a matter of perception. Some singers may have very breathy voices, which is considered unique and beautiful, but someone speaking with the same kind of breathiness may feel that they have a problem. Another patient may have hoarseness and show anomalies in the vibrations of their vocal folds, but if they feel that they have a normal voice, how would a clinician diagnose a disorder? Maryam Naghibolhosseini studies communicative science and disorders and is director of the Analysis of Voice and Hearing Laboratory at Michigan State University. She uses a suite of engineering and technology approaches—including advanced image processing, machine learning, and statistical analysis techniques—combined with physiology and perception, to better understand, diagnose, and individually treat a range of vocal disorders. In 2023, Naghibolhosseini was the recipient of Sigma Xi’s Young Investigator Award. She spoke with American Scientist editor-in-chief Fenella Saunders about her research. (This interview has been edited for length and clarity.)


Courtesy of Maryam Naghibolhosseini

You started with more of an engineering background, so what made you decide to study voice and hearing systems?

I started as a passionate engineer. You can really see the application of your work in real-world problems, which to me is really satisfying. But I wanted to be able to use that engineering tool to help people improve their health. So my focus was always on the biological applications of engineering. I got interested in studying otoacoustic emissions—how our ears generate sound. We use these sounds that we record from the hearing system to understand how our hearing functions. And this is quite important when we are looking at, for example, infant hearing, or hearing in people who have cognitive problems, and they cannot properly communicate their issues with you. So it’s an objective tool that made me interested in thinking about more objective methods and tools to evaluate human health in general.

That brought me to study voice science and voice disorders, along with our hearing system. Our hearing system and our voice production system are related, because when I’m talking, you’re listening to my voice. So your hearing system is involved with this process, and it is also involved when we evaluate someone’s voice. I do all that processing using my hearing system, so it plays an important role in voice perception. So I’m looking at both voice production and voice perception as a whole. But of course, I use my engineering background to develop all the methodologies and tools that we do in our research.

Ad Left

Is the vocal production system connected to the vocal perception system?

When we think about connected systems, it brings us back to the brain, because our brain controls all these processes. For example, when an infant is going through speech development stages, they have to hear sound in order to be able to generate it. But the high-level processing of our voice production and voice perception systems are separate. The main part of our voice production system is located in our larynx, and it’s called the phonatory system. It’s a very complex system that is controlled by more than 20 muscles within the larynx, and then we have other muscles that are connected with other parts of the body, such as the jaw or sternum. In addition, our respiratory system plays a role in voice production; we need to exhale to be able to talk. And then we have the articulatory system, for example our mouths, which are actually shaping the sound that is coming toward you. It’s a collaboration of several subsystems working together.

And then the hearing system is a whole separate system that has several other structures. We have, for example, the peripheral auditory system, and then we have the central auditory system that processes the more high-level information in the sound signals. We are looking at how the brain actually processes the sound at different stages. And we are trying to use artificial intelligence to simulate the same kind of processing approach in order to act the same way as someone who listens to your voice, basically repeating the same kind of voice perception process, but through an AI system.

You also study dysphonia, so how is that diagnosed?

When we say dysphonia, it means that the quality of our sound is not quite right. For example, I can have a really breathy voice, and you might have a hard time hearing my voice, or my voice may be interrupted. I can talk, but there is a problem in the biomechanics or the physics of the voice production in the larynx. That’s basically what we mean by dysphonia.

We study neurogenic voice disorders that are very difficult to diagnose clinically because they sound similar to some other disorders. The mechanisms that are happening at the vocal fold level, or in the larynx or voice box, are different from one another, but they are very difficult to distinguish in a clinical setting. So that’s why we want to develop some research tools, and that’s where AI comes in. We use high-speed video to capture thousands of images from the vibrations of the vocal folds while someone is talking. This tool enables us to capture a lot of information that is critical in order to understand and distinguish these disorders. We use AI to analyze the images and extract some sort of information or features from them. AI can go through our dataset very quickly and capture hidden features that we cannot see by just clicking on the images and going through them, which is also technically impossible to do in a clinical setting, where time is limited. So AI enables us to facilitate our search of information through the huge dataset that we collect.

How else is artificial intelligence useful in your research?

We can also use AI in a way that is just like facial recognition, in which AI can determine where the eyes or eyebrows or nose are located and then also capture the emotions of people based on their features. We can use the same approach with our high-speed videos, in which we have all different tissue structures in the video. We want to classify those images automatically, without needing someone to go through every single image in order to determine every structure. So we design AI tools that can go through the data in a matter of minutes and capture all these various structures. We determine what features we’re interested in, we teach the AI, and it goes and finds these landmarks. Then we can study the movements and the behaviors of various tissues. So we use AI in two different capacities. One way is to capture the kind of information that we can also see, but it is hard for us to manually go through the data. The other way is as a tool to capture hidden information. AI can bring out the kind of information that would help us to do a differential diagnosis of various disorders, but at the same time, this is information that we cannot capture easily ourselves.

What information is AI capturing that people can’t detect otherwise?

It could be information that is hidden in the temporal domain in our data. For example, we may have about 300,000 images for maybe just one minute of recording. Instead of the traditional way of using AI where we teach it the features that we’re interested in, another way to do it is we don’t tell the AI what to look for. Instead, we give the AI the raw data and then the AI does this complex analysis in the dataset, which I would call a trial-and-error type of work. We still need to provide the AI some guidance on how to do this analysis. But basically, the AI works on its own as a tool for feature extraction. It creates this huge information map for us with different types of information that we might be interested in, such as, for example, the time intervals that you have to look into in order to distinguish this specific breathiness quality in the voice, with this strain that you might hear in the voice.

How do you record these high-speed videos while someone is talking?

We use two methods. One way is to connect the high-speed camera to what is called a rigid endoscope. Basically, it’s a cylinder that goes through the mouth and toward the back of the tongue. Then you can capture videos from the vibrations of the vocal folds. The camera itself is positioned outside, connected with the scope. When the endoscope is in the mouth, the person can just produce some vowels, or do what we call a glide, going up and down in pitch. But it’s very limited in terms of what the person can say.

Our other method connects the high-speed camera to a flexible scope that can go through the nose to a position above the vocal folds. From there, we can capture videos of the vibrations of the vocal folds. The benefit is that the person can actually speak a combination of various words and vowels and consonants. We recently completed some of this data collection at the Mayo Clinic in Arizona. Some of the voice disorders we’re looking at, they are called task-specific voice disorders, which means that the way the symptoms of the disorders appear very much depends on what the person is saying. One person might be able to, for example, say, “Aaaaah,” no problem, there’s no issue. Everything sounds normal. But as soon as they start actually talking and putting sentences together, that’s where you notice the problem. This specific disorder is related to laryngeal dystonia, and it’s very rare. But it is a really interesting disorder because it also can happen to people who are singers, and it can affect only the singing voice, not even speech. People might be able to talk with normal speech, but as soon as they start singing you can see the symptoms appearing. We have one subject whom we are looking at now to see whether we can better understand this disorder.

What are your current research results with laryngeal dystonia?

One interesting thing that we are looking at now is how the behavior of voice production, what we call laryngeal behavior, would be affected based on what the person is actually saying. So we are looking at the phonemic content of speech, meaning that we are looking at how people go from different vowels to different consonants, and how that might elicit a more severe or a milder symptom in the patient.

An interesting thing that we observe is that we have a lot of variabilities from one subject to another with the same disorder. We put them in the same category clinically when they come in, and the treatment approach is usually the same—for laryngeal dystonia, it’s botox injection. The treatment is mainly based on the experience of laryngologists of what has worked.

That’s why we are trying to look into the variability between individual subjects, toward our longer-term goal of designing individualized treatment approaches. You might have some symptoms that overlap, but if you do the same kind of treatment, would that work or not? So we’re trying to answer these types of questions, which is related to some of the work that we do in terms of modeling voice production.

We are also designing mechanical models to see how the flow of air from the lungs interacts with the vocal folds, the movements of the tissue in order to generate sound, and what happens if there is a disruption in some of these systems. How would that show itself in someone’s voice? And how would we approach it in order to think about treatment strategies if we observe these anomalies in various places?

How could a disorder affect only a person’s singing voice?

Specifically regarding singer’s dystonia, and thinking about this disorder only appearing in the singing voice, one possible hypothesis is that it could be an early sign of a neurological disease in the body. We also use voice as a biomarker in order to study other health-related issues in the human body. Because, thinking about voice production, it involves the respiratory system. And then we have the vocal folds, vibration, everything has to be absolutely precise in order to have a voice that is nice and clean, with no disorder-related symptoms. And thinking about that, if someone has a mental health problem, it is going to reflect in the voice. If someone has a neurological problem, it’s going to affect voice. There are so many research groups looking at, for example, diagnosis of Parkinson’s disease by evaluating voice.

“We use voice as a biomarker to study other health-related issues, because if someone has a neurological problem, it’s going to affect voice.”

In our lab, we are going to use voice to evaluate depression and anxiety in patients. We want to design a tool that anyone can use to evaluate just their voice. Voice is used in many different ways as a biomarker, and many neurological diseases can affect voice. Weakness in the muscles is going to very quickly affect voice, although the changes may not be as perceivable in what we hear. And so that’s why having these sort of objective tools, in order to evaluate voice, can actually open a new door to the kind of information that we cannot simply hear in the voice, or observe in acoustic signals.

Could AI be used to create a personalized model of a person’s voice?

I think AI has a lot of potential in terms of summarizing all these different measurements and information that we have about one person. For instance, looking at the voices of patients with laryngeal cancer, surgeons use large questionnaires that contain 600 or 700 questions to make a decision about how to proceed with surgery or treatment of tumors in the larynx. Think about a surgeon sitting down and looking at these various answers. How should they pull together everything in order to make the right decision, based on this huge amount of information? So one way AI could be helpful, I think, is narrowing down the options for doctors, providing more summarized information directly related to the pathophysiology of the disorder. I don’t see AI ever doing everything by itself, honestly, especially when it comes to humans, because we are all so different, and a doctor makes decisions based on real experience.

What is the next step in your research?

Our goal is to be able to develop protocols that eventually could be used clinically. The research goal is to understand the basics of voice production, and how our voice might be affected, for example, by mental health status, or how our voice might be affected if someone has laryngeal cancer and undergoes radiation therapy. We are interested in laryngeal cancer because it’s among the top three cancers that show the highest suicide rates. And one reason might be because it’s related to our voice and losing the ability to communicate.

We are trying to better understand voice, but also be able to use this knowledge in clinical settings in order to do a more accurate diagnosis, which would lead to designing more accurate treatment approaches that are also tailored toward the needs of each individual. We’re interested to see how we can actually use voice as a biomarker to measure other problems in the human body. We are just starting data collection at Henry Ford Health, so hopefully we are going to have some interesting results in the near future about these projects, too.

American Scientist Comments and Discussion

To discuss our articles or comment on them, please share them and tag American Scientist on social media platforms. Here are links to our profiles on Twitter, Facebook, and LinkedIn.

If we re-share your post, we will moderate comments/discussion following our comments policy.