Text embeddings' vocabulary and PyTorch' state_dict
s containing weights of the AudioCLIP model trained on AudioSet:
- bpe_simple_vocab_16e6.txt.gz – CLIP's vocabulary (origin)
- CLIP.pt – vanilla CLIP (text Transformer & ResNet-50 image-head, origin)
- ESRNXFBSP.pt – ESResNeXt trained on AudioSet (standalone)
- AudioCLIP trained on AudioSet (+ video frames)
- AudioCLIP-Full-Training.pt – training of all three heads (text, image and audio)
- AudioCLIP-Partial-Training.pt – training of the audio-head only