2024 Volume 53 Issue 1 Pages 15-32
In recent years, various novel techniques have emerged in the realm of deep learning for enhanced pattern recognition. Multimodal learning is a widely used approach that enables simultaneous data input across multiple modalities, including video, audio, and text.For customer relationship management in the field of marketing, a comprehensive analysis using combinations of multiple data, such as the behavior log and survey responses, is actively pursued to assess customer loyalty and predict future behavior. However, these modalities of datasets are typically aggregated into a single dataset comprising variables that are hand-crafted by the analysts. This may not fully exploit the high predictive performance achieved by feature extraction in deep learning techniques.In this study, we employ a source--target multihead attention transformer encoder in conjunction with serial feature fusion, which enables the creation of a multimodal and context-aware deep learning model. This model can simultaneously process two high-dimensional datasets―time-series panel data that aggregates daily smartphone app usage and cross-sectional data containing demographic variables and survey responses. Finally, the both data are effectively fused at the upper layers of the model for the output. Results of the exhaustive analysis demonstrate that the proposed model outperforms other major deep learning architectures.