XR experiences combining vision and sound Extended reality (XR) technologies are on the verge of dominating the human-computer interaction (HCI) scene by overtaking traditional approaches. Two other fields that are experiencing similar blooming are natural language processing (NLP) and computer vision (CV), mainly due to the emergence of data-driven methods in the areas of machine learning (ML) and artificial intelligence (AI). VOX aspires to fuse these parallel fields to design and develop AI-models that will integrate language as a core interaction medium, together with visual understanding. The focus is on producing pre-trained XR-models entangling the spatial and semantic knowledge of XR and NLP systems. This could kick-start a new era of applications built around the holistic understanding of the users’ goals, away from devices and controllers.