One of the sub-projects of Google Research is MusicLM, which is a model that can generate high-fidelity music from text descriptions. For example, given a text like "a calming violin melody backed by a distorted guitar riff", MusicLM can produce a realistic audio clip that matches the style and mood of the text. MusicLM uses a hierarchical sequence-to-sequence model that can generate music at 24 kHz and remain consistent over several minutes. MusicLM can also be conditioned on both text and a melody, such as a whistled or hummed tune, and transform it according to the text caption.
To support the development of MusicLM, Google Research released MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts. MusicCaps can be used to train and evaluate models for music generation from text, as well as for other tasks such as music captioning, retrieval, and classification.