-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
questions and suggestions about training #6
Comments
Thank you for the work. It's great and it fills the blank. I want to restore some chinese songs. As far as I'm concerned the training data of apollo is about english song stems and mixing. Would it be biased towards east asia songs? I tried the inferencing and figured out that there is slight bias heard in the output wave. Generally, the western songs mixing production is somehow more powerful and dynamic than asia songs. Then the repaired chinese song has more energy in their rythm parts. Maybe the repairing can be more soft for east asia songs. I'm downloading the origin train data and will learn it. My question is, is the stems necessary in training data if the goal is just repairing final mixing songs? Can I just train with mixture? And about hyperparameters, will you provide some advice if I do finetuning from your checkpoint? It's finetuning but may be a deep one actually. It will be appreciated if you might advice on training recipes. Maybe I will develop some more augmentations for song repairing. But I have few experience on training an audio model. |
Thank you for your feedback! The training of this model took one week, and we trained for a total of 200 epochs. You’re right, I forgot to remove the wandb API key, and I have now taken care of it. Additionally, I’ve implemented your suggestion to add |
Thank you for your insights! There is likely some bias in the model due to stylistic differences, which can lead to different learned features. If you're inferring from mixed audio, you don't necessarily need the stems, as I aimed for the model to have some generalization ability and to expand the training data. I recommend fine-tuning directly from the checkpoint, which will make the model more applicable to your new data without requiring extensive additional resources. However, pay close attention to the learning rate; I believe setting it to one-tenth of the original would be optimal. Feel free to experiment with augmentations for song repairing as you progress! |
Hi, I'm starting finetuning apollo from my wav files, which are 80 files including 300min music mixtures. Can you provide utils for data preprocessing? In your repo, your data were processed into HDF5 file before training. Or can you provide some tips if I may just load these wav files directly with your implementation of MusdbMoisesdbDataset ? If they can't be used instantly, please point out the unavoidable preprocessing steps. I believe source activity detection is not neccessary for me. Rescaling and downsampling is vital. Can you provide a start point script or something please. Besides, you mention the learning rate should be one-tenth, is that for both G and D? Thanks. |
And does the released checkpoint include discriminator? If not, will it be weird to initialize the model with only the generator ? |
Thank you for your questions and insights!
optimizer_g:
_target_: torch.optim.AdamW
lr: 0.001
weight_decay: 0.01
optimizer_d:
_target_: torch.optim.AdamW
lr: 0.0001
weight_decay: 0.01
betas: [0.5, 0.99] If you're fine-tuning, I recommend starting with a learning rate that is one-tenth of these values. Yes, this adjustment applies to both G and D.
Let me know if you need further assistance! |
Here are some key topics in my mind:
|
Thank you for raising these insightful points! Scaling the Model: You could certainly scale up the model to match the larger dataset you have. Increasing the model size may help to capture more complex details in the audio, especially if your dataset grows substantially beyond the current 30 hours. Hyperparameter Importance: I believe both depth and width are crucial for improving fidelity and accuracy. You might want to experiment with the hidden dimensions as well as the number of layers. The Roformer architecture, for example, balances both depth and width effectively. Paying attention to its scaling strategy might be insightful. Hidden Dim Considerations: In my experience, the default hidden dimension of 256 works well for high-accuracy audio repair tasks. However, increasing it can enhance capacity at the cost of higher VRAM consumption. With a 24GB card, there is some room to experiment with larger dimensions, though you'll need to balance it carefully to avoid running into memory issues. Multiple Encoding Loss & Generalization: You bring up an excellent point about multiple encodings over time. We didn't explicitly consider this in our model, but it's likely beneficial to simulate these effects in data augmentation to improve generalization. Simulating repeated lossy encoding or artifacts like those from vinyl records or tapes could help the model learn to deal with a wider range of degradation, which would be quite valuable in real-world applications. Hope this helps! Let me know if you have further questions or if you'd like to discuss specific implementation details. |
Hi. I have finished my training of several days from scratch. It early stops at about 250e. The D loss dropped fast at begining epochs and then stayed around 0.5 for rest long epochs. And the val loss drop behind -24 at the end. Is that supposed to be normal ? And if that's normal, I also wonder if larger batch size will help improve training? I'm using batch size 2. I noticed that between epochs my fixed testing audio output results are very different by both hearing and audio freq spectrum image with the temp checkpoints, although they all seemed to fill the audio spectrum. That might mean that the optimizer was running to unstable local optimized solutions between epochs or even between batches. Does batch size 16 or 8 make noticable difference for training such a model ? |
Thank you for sharing your training details! Based on the training loss graph you provided (WandB link), it seems that your model has roughly converged around a loss of 0.48. This does seem reasonable, but whether it's "normal" depends on your specific dataset and model architecture. As for your question about batch size, I believe it’s worth experimenting with. Increasing the batch size to 8 or 16 could help stabilize training and reduce the variability in outputs between epochs, as larger batch sizes generally provide a more consistent gradient estimate. That said, the impact of batch size can vary depending on the nature of your model and data, so I would recommend trying it out if your hardware permits. Let me know how it goes, and feel free to share your results! |
Thank you for bringing us this great audio restoration project.
I'm training a vocal stem enhancement model on a single RTX 3090 in bfloat16 precision (which reduced the vram usage in half).
In the paper you used 8 GPUs to train the model. How long did it take to train the model?
Will I have to train for 8 times longer than you did because I'm using a single GPU?
Also I think you forgot to remove your wandb api key from train.py
Also I recommend adding torch.cuda.empty_cache() in audio_litmodule.py, after self.validation_step_outputs.clear(), because after validation the vram usage was bigger than first epoch.
The text was updated successfully, but these errors were encountered: