Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how is the progress? #20

Open
fenghe12 opened this issue Sep 24, 2024 · 1 comment
Open

how is the progress? #20

fenghe12 opened this issue Sep 24, 2024 · 1 comment

Comments

@fenghe12
Copy link

we collected some face videos data and constructed a new dataset(named FaceVid-1K),maybe available in next month.

@johndpope
Copy link
Owner

johndpope commented Oct 30, 2024

Hi @fenghe12

im very active to build out IMF neural video codec - it is another microsoft paper - (not stable diffusion)
https://github.com/johndpope/IMF/branches
i got the model working / training - in a way - it's superior to megaportraits - no keypoints / no warping.
it's decoder entric. have a read of paper - it's quite intriguing - and ive been able to plug / upgrade different modules to make it better.

here's the training for IMF
https://wandb.ai/snoozie/IMF/runs/xscj3hjo?nw=nwusersnoozie

the reconstructed image is 32 floats with some stylegan modulation. its very light weight.
its working - but im struggling to get into onto any client (wasm / ios / onnx ...) without breaking the model or degrading to unusable.

hopefully google can fix this -
google-ai-edge/ai-edge-torch#305

https://github.com/AlexanderLutsenko/nobuco
there's this library to convert pytorch to tensorflowjs - but this is a real head ache because tensorflow uses BHWC - and pytorch uses BCWH so all the logic is flip flopped around.

woking this paper - the performance of VASA is kinda unique.
I almost exhaust IMF now - and circle back to tack another look at this paper with fresh eyes.

UPDATE

i dump a bunch of fresh code from claude -
plan is to get the dataset working / validating.....
and then wire up the training .
https://github.com/johndpope/VASA-1-hack/blob/main/dataset_testing.py

theres some flux here in code models
I need to adjust code to use yaml configs / accelerate.
https://github.com/johndpope/VASA-1-hack/blob/main/train.py#L746

UPDATE -
i add a VASADatasetTester -
the dataset / testing emotion detection
it's running and passing

image

UPDATE

i cherry pick the models from megaportraits to do the encoding with stage 1

python train_stage_1.py

but hitting OOM - dont remember this being broken - have to debug https://github.com/johndpope/MegaPortrait-hack

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 23.59 GiB of which 173.75 MiB is free. Process 4093651 has 262.20 MiB memory in use. Including non-PyTorch memory, this process has 20.43 GiB memory in use. Of the allocated memory 20.02 GiB is allocated by PyTorch, and 38.79 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Update - Monday 5th November
I partially solve memory problem
https://wandb.ai/snoozie/megaportraits/overview

Its training stage 1 - here - overfitting example
It has updated code to do warping.

Next step - attempt to import this into stage 2.
There’s a million more things to do for stage 1 to match megaportrait paper
They have high res / distillation training teacher / student - notably missing
The emoportraits has many losses can also add.

but I’m more interested to test out my latest vasa motion generator code.

IMG_3934

Update Nov 19th

So I abandon the megaportraits code / logic and cherry pick the emo portraits volumetric avatar
this is SOTA albeit crippled with creative commons license -
I have some code that is not released (i dont want my code tainted with CC)

  • i take 10 videos - run them through the volumetric feature extractor
    for each window

Canonical Volume: [B, T, C, D, H, W] = [1, 50, 96, 16, 64, 64]
Size = 1 * 50 * 96 * 16 * 64 * 64 * 4 bytes (float32) ≈ 100MB per window

ID Embed: We only save one per video, so this is negligible

For a 5 second video at 30fps = 150 frames:

Number of windows = (150 - 50) / 25 + 1 = 5 windows (due to 50% overlap)
Total data per video = 5 windows * 100MB = ~500MB

to get this into the diffusion transformer - i hit OOM errors - and basically hit a wall with 3090 gpu.
i had a rethink - to extract the stage 1 features up front and save to h5 file.
I'm gob smacked how much data is necessary to store this.
looking to tweak this somehow before attempting stage 2 training again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants