M1 is a research project exploring large-scale music generation using diffusion transformers. This repository contains the implementation of our proposed architecture combining recent advances in diffusion models, transformer architectures, and music processing.
We propose a novel approach to music generation that combines:
- Diffusion-based generative modeling
- Multi-query attention mechanisms
- Hierarchical audio encoding
- Text-conditional generation
- Scalable training methodology
- Diffusion transformers can capture long-range musical structure better than traditional autoregressive models
- Multi-query attention mechanisms can improve training efficiency without sacrificing quality
- Hierarchical audio encoding preserves both local and global musical features
- Text conditioning enables semantic control over generation
βββββββββββββββββββ
β Time Encoding β
ββββββββββ¬βββββββββ
β
ββββββββββββββββ ββββββββΌββββββββββ
β Audio Input ββββΊ mel spectrogram βββββββββββΊ β
ββββββββββββββββ β Diffusion β
β Transformer β βββΊ Generated Audio
ββββββββββββββββ βββββββββββββββ β Block β
β Text Input ββββΊ β T5 Encoder ββββββββββββΊ β
ββββββββββββββββ βββββββββββββββ βββββββββββββββββ
# Key architectural dimensions
MODEL_CONFIG = {
'dim': 512, # Base dimension
'depth': 12, # Number of transformer layers
'heads': 8, # Attention heads
'dim_head': 64, # Dimension per head
'mlp_dim': 2048, # FFN dimension
'dropout': 0.1 # Dropout rate
}
# Audio processing parameters
AUDIO_CONFIG = {
'sample_rate': 16000,
'n_mels': 80,
'n_fft': 1024,
'hop_length': 256
}
- Baseline model training on synthetic data
- Ablation studies on attention mechanisms
- Time embedding comparison study
- Audio encoding architecture experiments
We plan to build a research dataset from multiple sources:
-
Initial Development Dataset
- 10k Creative Commons music samples
- Focused on single-instrument recordings
- Clear genre categorization
-
Scaled Dataset (Future Work)
- Spotify API integration
- SoundCloud API integration
- Public domain music archives
Planned training configurations:
initial_training:
batch_size: 32
gradient_accumulation: 4
learning_rate: 1e-4
warmup_steps: 1000
max_steps: 100000
evaluation_metrics:
- spectral_convergence
- magnitude_error
- musical_consistency
- genre_accuracy
# Clone repository
git clone https://github.com/Agora-Lab-AI/m1.git
cd m1-music
# Create environment
conda create -n m1 python=3.10
conda activate m1
# Install dependencies
pip install -r requirements.txt
# Run tests
pytest tests/
import torch
from m1.model import ModelConfig, AudioConfig, MusicDiffusionTransformer, DiffusionScheduler, train_step, generate_audio
from loguru import logger
# Example usage
def main():
logger.info("Setting up model configurations")
# Configure logging
logger.add("music_diffusion.log", rotation="500 MB")
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
logger.info(f"Using device: {device}")
# Initialize configurations
model_config = ModelConfig(
dim=512,
depth=12,
heads=8,
dim_head=64,
mlp_dim=2048,
dropout=0.1
)
audio_config = AudioConfig(
sample_rate=16000,
n_mels=80,
audio_length=1024,
hop_length=256,
win_length=1024,
n_fft=1024
)
# Initialize model and scheduler
model = MusicDiffusionTransformer(model_config, audio_config).to(device)
scheduler = DiffusionScheduler(num_inference_steps=1000)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
# Example forward pass
logger.info("Preparing example forward pass")
batch_size = 4
example_audio = torch.randn(batch_size, audio_config.audio_length).to(device)
example_text = {
'input_ids': torch.randint(0, 1000, (batch_size, 50)).to(device),
'attention_mask': torch.ones(batch_size, 50).bool().to(device)
}
# Training step
logger.info("Executing training step")
loss = train_step(
model,
scheduler,
optimizer,
example_audio,
example_text,
device
)
logger.info(f"Training loss: {loss:.4f}")
generation_text = {
'input_ids': torch.randint(0, 1000, (1, 50)).to(device),
'attention_mask': torch.ones(1, 50).bool().to(device)
}
# Generation example
logger.info("Generating example audio")
generated_audio = generate_audio(
model,
scheduler,
generation_text,
device,
audio_config.audio_length
)
logger.info(f"Generated audio shape: {generated_audio.shape}")
if __name__ == "__main__":
main()
m1/
βββ configs/ # Training configurations
βββ m1/
β βββ models/ # Model architectures
β βββ diffusion/ # Diffusion scheduling
β βββ data/ # Data loading/processing
β βββ training/ # Training loops
βββ notebooks/ # Research notebooks
βββ scripts/ # Training scripts
βββ tests/ # Unit tests
This is an active research project in early stages. Current focus:
- Implementing and testing base architecture
- Setting up data processing pipeline
- Designing initial experiments
- Building evaluation framework
Key papers informing this work:
- "Diffusion Models Beat GANs on Image Synthesis" (Dhariwal & Nichol, 2021)
- "Structured Denoising Diffusion Models" (Sohl-Dickstein et al., 2015)
- "High-Resolution Image Synthesis with Latent Diffusion Models" (Rombach et al., 2022)
We welcome research collaborations! Areas where we're looking for contributions:
- Novel architectural improvements
- Efficient training methodologies
- Evaluation metrics
- Dataset curation tools
For research collaboration inquiries:
- Submit an issue
- Start a discussion
- Email: [email protected]
This research code is released under the MIT License.
If you use this code in your research, please cite:
@misc{m1music2024,
title={M1: Experimental Music Generation via Diffusion Transformers},
author={M1 Research Team},
year={2024},
publisher={GitHub},
journal={GitHub repository},
howpublished={\url{https://github.com/Agora-Lab-AI/m1}}
}
This is experimental research code:
- Architecture and training procedures may change significantly
- Not yet optimized for production use
- Results and capabilities are being actively researched
- Breaking changes should be expected
We're sharing this code to foster collaboration and advance the field of AI music generation research.