Nithin Rao Koluguri

As a Senior Research Scientist @ NVIDIA Conversational AI, I work on developing speech recogntion, speaker recognition, verification and diarization systems. My research interests include speech signal processing, natural language processing and machine learning.

I received my masters degree from University of Southern California (USC) with a major in Electrical and Computer Engineering. During my masters I carried out research through Signal Analysis and Interpretation Laboratory (SAIL) where I was advised by Prof. Shrikanth Narayanan.

Email  /  CV  /  Github  /  LinkedIn  /  Google Scholar  /  Blog

Prior Work Experience

Applied Machine Learning Intern at Bose CE Applied Research Group

  • Developed and Deployed NLP & Computer Vision related ML system on Google pixel phone and Bose wearables

Research Assistant at SPIRE LAB, IISc Bangalore (Advised by: Prof. Prasanta Kumar Ghosh)

  • Built a speech classifier to detect Amyotrophic Lateral Sclerosis (ALS) and Parkinsons (PD) diseases based on voice as bio-marker
  • Developed an unsupervised system for robust bird sound detection using enhanced Multiple Window Savitzky-Golay (MWSG) spectrogram.

Software Developer at Robert Bosch Bangalore

  • Customer and Production diagnosis in Telematics Projects, where designed new features for Daimler customer.
  • Text to speech (TTS) outputs for various car multimedia features like navigation, SMS readout and hands free control and tested the output using TTFIS tool

Research

I am interested in understanding signal level properties of audio,speech and Image. My research experiences thus far delve in natural language processing and applying machine learning algorithms on speech & image.

Discrete Audio Representation as an Alternative to Mel-Spectrograms for Speaker and Speech Recognition
Krishna C Puvvada, Nithin Rao Koluguri, Kunal Dhawan, Jagadeesh Balam, Boris Ginsburg

International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024
project page

We evaluate various discrete audio representations that can be used as an alternative to mel-spectrograms for speech and speaker recognition.

Investigating End-to-End ASR Architectures for Long Form Audio Transcription
Nithin Rao Koluguri, Samuel Kriman, Georgy Zelenfroind, Somshubra Majumdar, Dima Rekesh, Vahid Noroozi, Jagadeesh Balam, Boris Ginsburg

International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024
project page

We investigate end-to-end ASR architectures for long form audio transcription, that can do inference in one single pass.

Enhancing Speaker Diarization with Large Language Models: A Contextual Beam Search Approach
Tae Jin Park, Kunal Dhawan, Nithin Koluguri, Jagadeesh Balam

International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024
project page

We propose a novel contextual beam search approach for speaker diarization that leverages large language models to improve speaker diarization performance.

Fast conformer with linearly scalable attention for efficient speech recognition
Dima Rekesh, Nithin Rao Koluguri, Samuel Kriman, Somshubra Majumdar, Vahid Noroozi, He Huang, Oleksii Hrinchuk, Krishna Puvvada, Ankur Kumar, Jagadeesh Balam, Boris Ginsburg

IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023
project page

We propose a novel attention mechanism for Conformer architecture that is linearly scalable with respect to the input sequence length.

Open Automatic Speech Recognition Leaderboard
Sanchit, Hugging Face Team, Nvidia NeMo Team, SpeechBrain Team, Vaibhav Srivastav, Somshubra Majumdar, Nithin Koluguri, Adel
project page

We present the Open Automatic Speech Recognition Leaderboard, a platform for benchmarking and comparing ASR models.

A Compact End-to-End Model with Local and Global Context for Spoken Language Identification
Fei Jia, Nithin Rao Koluguri, Jagadeesh Balam, Boris Ginsburg

International Speech Communication Association (Interspeech), 2023
project page

AmberNet, a compact end-to-end neural network for Spoken Language Identification. AmberNet consists of 1D depth-wise separable convolutions and Squeeze-and-Excitation layers with global context, followed by statistics pooling and linear layers.

"
Property-Aware Multi-Speaker Data Simulation: A Probabilistic Modelling Technique for Synthetic Data Generation
Tae Jin Park, He Huang, Coleman Hooper, Nithin Rao Koluguri, Kunal Dhawan, Ante Jukić, Jagadeesh Balam, Boris Ginsburg

International Speech Communication Association (Interspeech), 2023
project page

We introduce a sophisticated multi-speaker speech data simulator, specifically engineered to generate multi-speaker speech recordings.

The CHiME-7 Challenge: System Description and Performance of NeMo Team's DASR System
Tae Jin Park, He Huang, Ante Jukic, Kunal Dhawan, Krishna C Puvvada,Nithin Koluguri , Nikolay Karpov, Aleksandr Laptev, Jagadeesh Balam, Boris Ginsburg

International Speech Communication Association (Interspeech), 2023
project page

We propose a novel multi-scale decoder for speech recognition, which is an ensembled NeMo diarization system.

Multi-scale Speaker Diarization with Dynamic Scale Weighting
Taejin Park, Nithin Rao Koluguri, Jagadeesh Balam, Boris Ginsburg

International Speech Communication Association (Interspeech), 2022
project page

Advanced multi-scale diarization system based on a multi-scale diarization decoder.

TitaNet: Neural Model for speaker representation with 1D Depth-wise separable convolutions and global context
Nithin Rao Koluguri, Taejin Park, Boris Ginsburg

International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022
project page

A novel neural network architecture for extracting speaker representations. Employs 1D depth-wise separable convolutions with Squeeze-and-Excitation (SE) layers with global context followed by channel attention based statistics pooling layer to map variable-length utterances to a fixed-length embedding (t-vector).

SpeakerNet: 1D Depth-wise Separable Convolutional Network for Text-Independent Speaker Recognition and Verification
Nithin Rao Koluguri, Jason Li, Vitaly Lavrukhin, Boris Ginsburg

International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021
project page

We propose SpeakerNet - a new neural architecture for speaker recognition and speaker verification tasks, this uses conv1D encoder and x-vector based statistics pooling decoder

Meta-learning for robust child-adult classification from speech
Nithin Rao Koluguri, Manoj Kumar, So Hyun Kim, Catherine Lord, Shrikanth Narayanan

International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020
project page

We demonstrate improvements over state-of-the-art speaker embeddings (x-vectors) for speaker classification using prototypical networks

Comparison Of Speech Tasks And Recording Devices For Voice Based Automatic Classification Of Healthy Subjects And Patients With Amyotrophic Lateral Sclerosis
Suhas B.N, Deep Patel, Nithin Rao Koluguri, Prasanta Ghosh*

International Speech Communication Association (Interspeech) , 2019
project page

We evaluated role of different speech tasks and recording devices in detecting ALS through speech

Spectrogram Enhancement Using Multiple Window Savitzky-Golay (MWSG) Filter for Robust Bird Sound Detection
Nithin Rao Koluguri, Nisha G Meenakshi*, Prasanta Ghosh*

IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017
project page

We propose a novel unsupervised method to denoise a spectrogram using Multiple Window Savitzky Golay algorithm, and use to enhance bird sound ques to recognize their sounds in noisy environments

Academic Reviewer

Conferences:

  • 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
  • 2023 IEEE International Speech Communication Association (Interspeech)

Projects

Generating sausages from text using WFSTs without acoustic model (DR Project)
Nithin Rao Koluguri, Prashanth G, Prof. Panayiotis Georgiou, 2019

Developed a method to mimic an ASR for a given Noise type at certain db level to generate nbest paths from text alone using sausages and confusion matrix. Generated text can be used swiftly to improve an ASR correction model.

Visual Question Answering ‐ Attention and Fusion based approaches
Nithin Rao Koluguri, Digbalay Bose, Namrata Tam, Aditya Mate 2019

Explored Visual Question Answering methods and looked at applying attention and fusion based methods specifically bottom-up and top-down. Explored BERT Embeddings and different representations for decent VQA system.

Image recognition system based on speech command
Nithin Rao Koluguri, Nikhil, Gaurav Newalkar, 2016

Designed an image recognition system based on the input given through speech. The goal of the project is to assist visual-impaired persons to know the coordinates of the object they are looking for just by their speech. The system on overall achieved an accuracy of 93.75% on speech digits

Blogs

Decoding an audio file using a pre-trained model with Kaldi

How to get timestamps of an audio file using pre-trained model with Kaldi

Visual Question Answering ‐ Attention and Fusion based approaches


inspired from this website