A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.



Transfer Learning in Bayesian Neural Networks

1 minute read


Let’s imagine we want to auto-encode images and detect unusual ones. We will use a Variational Autoencoder for this. However, our images come from different sources. They could have been captured by different photographers with different styles. If we treat all images as one dataset, all new photos from an extravagant photographer using bright colours and sharp contrast will have a low likelihood under our model, even though for this particular photographer it might fit in with the rest of her oeuvre.



Nora the Empathetic Psychologist

Published in Proc. Interspeech, 2017

Nora is a new dialog system that mimics a conversation with a psychologist by screening for stress, anxiety, and depression. She understands, empathizes, and adapts to users using emotional intelligence modules trained via statistical modelling such as Convolutional Neural Networks. These modules also enable her to personalize the content of each conversation.

Recommended citation: Genta Indra Winata, Onno Kampman, Yang Yang, Anik Dey, Pascale Fung (2017). "Nora the Empathetic Psychologist." Proc. Interspeech. (3437--3438).

Investigating Audio, Visual, and Text Fusion Methods for End-to-End Automatic Personality Prediction

Published in Proceedings of the Association for Computational Linguistics (ACL), 2018

We propose a tri-modal architecture to predict Big Five personality trait scores from video clips with different channels for audio, text, and video data. For each channel, stacked Convolutional Neural Networks are employed. The channels are fused both on decision-level and by concatenating their respective fully connected layers. It is shown that a multimodal fusion approach outperforms each single modality channel, with an improvement of 9.4\% over the best individual modality (video). Full backpropagation is also shown to be better than a linear combination of modalities, meaning complex interactions between modalities can be leveraged to build better models. Furthermore, we can see the prediction relevance of each modality for each trait. The described model can be used to increase the emotional intelligence of virtual agents.

Recommended citation: Onno Kampman, Elham J Barezi, Dario Bertero, Pascale Fung (2018). "Investigating Audio, Visual, and Text Fusion Methods for End-to-End Automatic Personality Prediction." Proc. ACL. 2(606--611).

Attention-Based LSTM for Psychological Stress Detection from Spoken Language Using Distant Supervision

Published in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018

We propose a Long Short-Term Memory (LSTM) with attention mechanism to classify psychological stress from self-conducted interview transcriptions. We apply distant supervision by automatically labeling tweets based on their hashtag content, which complements and expands the size of our corpus. This additional data is used to initialize the model parameters, after which it is fine-tuned using the interview data. This improves the model robustness, especially by expanding the vocabulary size. The bidirectional LSTM model with attention is found to be the best model in terms of accuracy (74.1%) and f-score (74.3%). Furthermore, we show that distant supervision fine-tuning enhances the model performance by 1.6% accuracy and 2.1% f-score. The attention mechanism helps the model to select informative words.

Recommended citation: Genta Indra Winata, Onno Kampman, Pascale Fung (2018). "Attention-Based LSTM for Psychological Stress Detection from Spoken Language Using Distant Supervision." IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Towards Universal End-to-End Affect Recognition from Multilingual Speech by ConvNets

Published in arXiv preprint, 2019

We propose an end-to-end affect recognition approach using a Convolutional Neural Network (CNN) that handles multiple languages, with applications to emotion and personality recognition from speech. We lay the foundation of a universal model that is trained on multiple languages at once. As affect is shared across all languages, we are able to leverage shared information between languages and improve the overall performance for each one. We obtained an average improvement of 12.8% on emotion and 10.1% on personality when compared with the same model trained on each language only. It is end-to-end because we directly take narrow-band raw waveforms as input. This allows us to accept as input audio recorded from any source and to avoid the overhead and information loss of feature extraction. It outperforms a similar CNN using spectrograms as input by 12.8% for emotion and 6.3% for personality, based on F-scores. Analysis of the network parameters and layers activation shows that the network learns and extracts significant features in the first layer, in particular pitch, energy and contour variations. Subsequent convolutional layers instead capture language-specific representations through the analysis of supra-segmental features. Our model represents an important step for the development of a fully universal affect recognizer, able to recognize additional descriptors, such as stress, and for the future implementation into affective interactive systems.

Recommended citation: Dario Bertero, Onno Kampman, Pascale Fung (2019). "Towards Universal End-to-End Affect Recognition from Multilingual Speech by ConvNets." arXiv preprint.