arXiv Code
Automating the synthesis of coordinated bimanual piano performances poses significant challenges, particularly in capturing the intricate choreography between the hands while preserving their distinct kinematic signatures. In this paper, we propose a dual-stream neural framework designed to generate synchronized hand gestures for piano playing from audio input, addressing the critical challenge of modeling both hand independence and coordination. Our framework introduces two key innovations: (i) a decoupled diffusion-based generation framework that independently models each hand's motion via dual-noise initialization, sampling distinct latent noise for each while leveraging a shared positional condition, and (ii) a Hand-Coordinated Asymmetric Attention (HCAA) mechanism suppresses symmetric (common-mode) noise to highlight asymmetric hand-specific features, while adaptively enhancing inter-hand coordination during denoising. Comprehensive evaluations demonstrate that our framework outperforms existing state-of-the-art methods across multiple metrics.
The overall framework of our proposed method. It consists of three main modules: (1) Position Predictor, which predicts hand positions from piano audio; (2) Hand-Coordinated Asymmetric Attention (HCAA), applied to the intermediate U-Net features after a Multi-Head Self-Attention (MHSA) layer, enabling asymmetric feature interaction between the two hands. (3) Decoupled Gesture Generator, which employs two independent diffusion models to generate motions for the left and right hands separately.
We visualize and compare the motions generated by our model with the corresponding ground-truth to assess their similarity and expressiveness.
(For the best viewing experience, please ensure your sound is enabled. If you are not hearing any audio, we recommend using Google Chrome.)
(Blue=Ground-Truth, Gold=Synthesis)
(Blue=Ground-Truth, Gold=Synthesis)
(Blue=Ground-Truth, Gold=Synthesis)
(Blue=Ground-Truth, Gold=Synthesis)
(Blue=Ground-Truth, Gold=Synthesis)
(Blue=Ground-Truth, Gold=Synthesis)
(Blue=Ground-Truth, Gold=Synthesis)
(Blue=Ground-Truth, Gold=Synthesis)
Fast-paced music is often accompanied by dense note sequences and rapid rhythmic shifts, imposing higher demands on both the timeliness and accuracy of hand motion generation. In such musical contexts, the motion generation model must not only accurately capture the rapid positional changes of the hands in 3D space, but also produce coordinated gestures that align seamlessly with the rhythm of the notes, thereby ensuring a natural and fluid piano performance.
In slow-paced music, hand movements tend to follow a more relaxed rhythm, with smoother and more natural transitions between gestures. While such scenarios impose less stringent demands on the temporal responsiveness of the motion generation model, they require higher quality in terms of spatial precision and local coherence. The model must be capable of producing smooth, fine-grained hand motions that faithfully capture the expressive nuances characteristic of slow piano performances.