Sound Localization and Processing for Inducing Synesthetic Experiences in Virtual Reality

Aleksei Tepljakov, Sergei Astapov, Dirk Draheim, Eduard Petlenkov, and Kristina Vassiljeva

Talk Outline

  • Why synesthesia and Virtual Reality?

  • Problem statement

  • Description of the proposed solution

  • Experimental results

  • Conclusion and future work

Synesthesia and Virtual Reality

  • Synesthesia is the act of experiencing one sense modality as another, e.g., a person may vividly experience flashes of colors when listening to a series of sounds.
  • Recent technological advances in the Virtual Reality field allow to induce such experiences due to the effect of presence achieved in the virtual environment.
  • In this contribution, we focus on localization and visual interpretation of sound in Virtual Reality.

Leibniz' Monad Theory and Applications

  • The original trigger for the particular synesthetic scenario has been discussions against the background of the Leibniz anniversary year 2016.
  • In Leibniz theory monads are the smallest building blocks of mind that interact only via their senses.
  • The eventual goal of the present project is thus to create the experience of a full exchange of senses.
  • This has numerous medical and artistic applications.

Problem statement

To achieve a synesthetic experience we need to

  • Precisely localize the sound source.
  • Analyze the sound and extract its characteristics.
  • Visualize the sound in the 3D virtual space.

Sound localization

  • We use a conical array of microphones.
  • Our proposition is to use a DOA method, since compared to SRP-PHAT it avoids frequency domain computations and is thus more efficient in terms of performance.
  • Furthermore, the proposed DOA method allows to reduce the number of microphone pairs for cross-correlation.

Sets of microphones

  • For azimuth $\phi$ estimation we have set of pairs
    \[ A_{h}=\left\{ \left(m_{i}^{h},m_{j}^{h}\right)\subseteq S_{2}^{M_{h}}\biggm|\alpha_{ij}<\frac{\pi}{2}\right\} , \]
    where $S_2^{M_h}$ is the set of all combinations of horizontal microphone pairs.
  • For elevation $\theta$ estimation we have
    \[ A_{v}=\left\{ \left(m_{i}^{h},m_{j}^{v}\right)\bigm|m_{i}^{h}\in A_{act},j=[1,M_{v}]\right\} \cup S_{2}^{M_{v}}, \]
    where $S_{2}^{M_{v}}$ is the set of all combinations of vertical microphone pairs, and $A_{act}$ is the set of active horizontal microphones.

Sound localization: AOA estimation

  • Assuming far field disposition of the acoustic source
    \[ \hat{\varphi}_{ij}=\sin^{-1}\left(\frac{\tau_{ij}\cdot c}{l}\right)=\sin^{-1}\left(\frac{\Delta k_{ij}/f_{s}\cdot c}{l}\right)\tag{1}. \]
  • To estimate $\tau_{ij}$ we apply cross-correlation
    \[ R_{ij}\left(\mathrm{\Delta}k\right)=\sum_{k=0}^{N-1}x_{m_{i}}[k]\cdot x_{m_{j}}[k-\mathrm{\Delta}k]. \tag{2} \]
  • Then, the TDOA is
    \[ \mathrm{\Delta}k_{ij}=\arg\max\left(R_{ij}\left(\mathrm{\Delta}k\right)\right). \tag{3} \]
  • Finally, AOA estimates $(\phi,\theta)$ are computed using (3)$-$(6).

Acoustic feature extraction

  • Human perception of sound frequency contents for speech signals does not follow a linear scale.
  • So we will use the Mel scale:
    \[ f_m=2595\log_{10}\left(1+\frac{f}{700}\right). \]
  • We analyze the audio signal using the MFCC method which has also been successfully applied to modeling music.
  • The corresponding algorithm returns several features of the signal, in this work we consider the auditory spectrum portion denoted hereinafter as $A_{spec}$.

Sound Visualization

  • In the VR environment the incoming sound waves are visualized as spheres moving towards the listener.
  • The color, size, velocity of travel, and sampling rate for generating the spheres can be determined experimentally.
  • The incoming waveforms are broken down into frames and analyzed as discussed previously.

Color Mapping

  • The size of a single sphere is determined by the scaled maximum amplitude in a waveform frame.
  • The color of the sphere is determined by the dominant feature in auditory spectrum. A transform is defined as
    \[ \xi:\mathscr{I}\rightarrow\mathscr{C}, \]
    where $\mathscr{I}\subset\mathbb{N}$ is the index of the dominant feature in $A_{spec}$, and $\mathscr{C}\subset\mathbb{R}^{3}$ is the parameterized color specification in a particular color space.
  • For this work, we consider the RGB color space.

Experimental setup: Microphone array

Experimental setup: Full configuration

Experimental setup: Data

  • A sound source is manually moved within a plane at a distance of about $r=1.5$m from the conical array with constant velocity.
  • An audio clip with modern music is used as audio such that has no distinct spectral features.
  • The AOA estimation discussed above is carried out with a window of $t_{s}=0.1$s.
  • The resulting angles (with average tolerance about $3^{\circ}$) are filtered and a trajectory of motion is recovered.

Experimental configuration

Acoustic localization results

Experimental signal analysis

  • The MFCC is calculated for the sound clip recorded by the central microphone of the circular array.
  • The sound amplitude and dominant spectral features are encoded as color as proposed above.
  • Thus, all necessary parameters for the VR sound visualization system have been successfully obtained.

Signal analysis results

Conclusions and further research

  • We have developed a prototype for acoustic sound localization, processing, and visualization for inducing a synesthetic experience in a VR environment.
  • Experimental data was successfully processed using the proposed approach yielding usable results.
  • Further research is necessary and has several branches: Real-time application; Implementation and verification in an embedded system; Expansion of the microphone array for accurate multiple sound source detection; study of the induced synesthetic effect in real subjects.

Thank you for your attention!

For more information visit