runpod.io kohya_ss directions (in thread) #379

ghost · 2023-03-14T15:29:01Z

ghost
Mar 14, 2023

I had some trouble with the other linux ports (& the kohya_ss-linux that runpod has as a template)

instead you can use the latest bmaltais/kohya_ss fork:

deploy their existing RunPod Stable Diffusion v1.5 template

connect to Jupiter Lab via the connect button
open a new terminal
in /workspace do git clone https://github.com/bmaltais/kohya_ss.git
cd kohya_ss
bash ubuntu_setup.sh

wait a while

accelerate config answers: this machine, no distributed training, NO, NO, NO, all, fp16

the ubuntu_setup.sh installs tk, but it didn't seem to work for me, I had to do this manually afterward with:

apt-get update -y
apt-get install python3-tk -y

ubuntu_setup also creates the venv, for reference the command to create it is python3 -m venv venv

if you ever need to activate the venv yourself use source venv/bin/activate

before running gui.sh, in Jupiter Lab explorer tab:

open kohya_gui.py

find interface.launch(**launch_kwargs) change it to interface.launch(**launch_kwargs,share=True)
save kohya_gui.py

back to terminal tab:

bash gui.sh

click the public grado link

in kohya gui:

Folders tab, you have to manually set the images and output folder, ex: /workspace/images and /workspace/
uploading training data can be done via Jupiter
can also use runpodctl (better to zip data before sending too) https://github.com/runpod/runpodctl
get latest binary for your local PC
usage in windows cmd runpodctl send filename.ext
copy the command it prints
in a Jupiter terminal, paste the command, the file sends
install unzip if needed apt install unzip
ex usage unzip file.zip -d toFolder

kohya gui notes

Optimizer AdamW8Bit might error
if it errors, try switching optimizer to AdamW

saidben2022 · 2023-03-18T22:54:47Z

saidben2022
Mar 18, 2023

thanks you very much you really saved me a lot of hassle
it does t seem to work when I test it with an AMD cpu

1 reply

drphero Mar 20, 2023

Exact same problem here. Any fixes?

ppetrucz · 2023-03-20T13:57:08Z

ppetrucz
Mar 20, 2023

The ubuntu_setup.sh has an error but doesn't block the rest of the script:
root@ba79bdcd8d42:/workspace/kohya_ss# bash ubuntu_setup.sh
installing tk
ubuntu_setup.sh: line 3: sudo: command not found

Is it something expected?

7 replies

Norian11 Mar 20, 2023

I suppose that's the error of unable to install TK the author is talking, I think it happens to everyone, just after the script finished I installed TK with the two scripts the author of this post provided. But I'm unable to use the UI yet for the error y wrote below in other comment

ppetrucz Mar 20, 2023

I've installed TK manually as the author noted, previously, this error might not be related to it.

Norian11 Mar 20, 2023

Well, line 3 of that file says "sudo apt install python3-tk" I'm pretty sure that the installation of TK, I just installed after the ubuntu.sh script, but as I said, even after launching the UI I'm unable to use Kohya for other error, better wait to solve those error that make this unusable. I'm thinking to use a Linux Desktop Template that makes everything easier

Norian11 Mar 20, 2023

I think they just updated the requirements ubuntu.sh like 12 hours ago and now tk installs automatically without more problem, at least now that im trying this again

APZmedia Apr 5, 2023

Hi! I get this error
Traceback (most recent call last):
File "/workspace/kohya_ss/kohya_gui.py", line 4, in
from dreambooth_gui import dreambooth_tab
File "/workspace/kohya_ss/dreambooth_gui.py", line 13, in
from library.common_gui import (
File "/workspace/kohya_ss/library/common_gui.py", line 1, in
from tkinter import filedialog, Tk
ModuleNotFoundError: No module named 'tkinter'

I tried installed tkinter using
apt-get install python3-tk

ppetrucz · 2023-03-21T10:02:31Z

ppetrucz
Mar 21, 2023

Everything was installed correctly, although I did remove sudo and placed apt update before installing pythontk.
I tried training and the following is the output.

Folder 100_nnrmml: 0 images found
Folder 100_nnrmml: 0 steps
max_train_steps = 0
stop_text_encoder_training = 0
lr_warmup_steps = 0
accelerate launch --num_cpu_threads_per_process=2 "train_network.py" --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" --train_data_dir="/workspace/anna_Lora_New/image" --resolution=512,512 --output_dir="/workspace/anna_Lora_New/" --logging_dir="/workspace/anna_Lora_New/" --network_alpha="128" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=5e-5 --unet_lr=0.0001 --network_dim=128 --output_name="nnrmml_v1" --lr_scheduler_num_cycles="1" --learning_rate="0.0001" --lr_scheduler="cosine_with_restarts" --train_batch_size="9" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="fp16" --seed="1234" --caption_extension=".txt" --optimizer_type="AdamW8bit" --max_data_loader_n_workers="0" --caption_dropout_rate="0.1" --bucket_reso_steps=32 --xformers --bucket_no_upscale
2023-03-21 09:58:45.628678: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-21 09:58:45.701866: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-03-21 09:58:45.719668: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-03-21 09:58:46.044821: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-03-21 09:58:46.044852: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-03-21 09:58:46.044856: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-03-21 09:58:47.005335: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-21 09:58:47.077119: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-03-21 09:58:47.095235: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-03-21 09:58:47.414579: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-03-21 09:58:47.414612: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-03-21 09:58:47.414616: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
prepare tokenizer
Use DreamBooth method.
prepare images.
found directory /workspace/anna_Lora_New/image/100_nnrmml contains 0 image files
ignore subset with image_dir='/workspace/anna_Lora_New/image/100_nnrmml': no images found / 画像が見つからないためサブセットを無視します
0 train images with repeating.
0 reg images.
no regularization images / 正則化画像が見つかりませんでした
[Dataset 0]
batch_size: 9
resolution: (512, 512)
enable_bucket: False

[Dataset 0]
loading image sizes.
0it [00:00, ?it/s]
prepare dataset
No data found. Please verify arguments (train_data_dir must be the parent of folders with images) / 画像がありません。引数指定を確認してください（train_data_dirには画像があるフォルダではなく、画像があるフォルダの親フォルダを指定する必要があります）

The images are in the folder correctly. Locally it works.
There are several libraries missing according to the output.

4 replies

ppetrucz Mar 21, 2023

Doing this solves the missing library issues, but still no images found. Make sure to run gui.sh in the same terminal you set the envs.

cd /workspace/kohya_ss/venv/lib/python3.10/site-packages/tensorrt
ln -s libnvinfer_plugin.so.8 libnvinfer_plugin.so.7
ln -s libnvinfer.so.8 libnvinfer.so.7

cd /workspace/kohya_ss/venv/lib/python3.10/site-packages/nvidia/cuda_runtime/lib
ln -s libcudart.so.12 libcudart.so.11.0

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/workspace/kohya_ss/venv/lib/python3.10/site-packages/tensorrt/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/workspace/kohya_ss/venv/lib/python3.10/site-packages/nvidia/cuda_runtime/lib/

ppetrucz Mar 21, 2023

The problem was that I had .JPG (with capital letters) and in lora_gui.py there were if statements checking for lowercase jpg ending. I've added if for JPG but most probably something after this also did check because it found at the beginning but not after. After renaming the images to *.jpg it works. I'm able to train the lora.
On a RTX3090 having 18 images with batching 9 it needs 200 steps and it's ready in under 4 min.

use AdamW optimizer | {}
running training / 学習開始
num train images * repeats / 学習画像の数×繰り返し回数: 1800
num reg images / 正則化画像の数: 0
num batches per epoch / 1epochのバッチ数: 200
num epochs / epoch数: 1
batch size per device / バッチサイズ: 9
gradient accumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 200
steps: 0%| | 0/200 [00:00<?, ?it/s]epoch 1/1
steps: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [03:14<00:00, 1.03it/s, loss=0.11]save trained model to /workspace/annaLoraNew/model/nnrmml_v1.safetensors

drphero Mar 21, 2023

Strangely my virtual environment doesn't have tensorrt. Should I just manually install it with pip?

ppetrucz Mar 23, 2023

I think so. You can follow this:
#379 (comment)

ppetrucz · 2023-03-21T21:49:25Z

ppetrucz
Mar 21, 2023

I've written down the process of how I make it work on runpod. Hope it helps.

Run sd 1.5 runpod template with at least 15gb+ container storage and 30gb+ persistent storage

Exectute the following commands in jupiter terminal

https://www.runpod.io/console/gpu-secure-cloud?template=runpod-stable

1. Optional, kill the SD server if you don't need it

fuser -k 3000/tcp

2. Setup kohya_ss

git clone https://github.com/bmaltais/kohya_ss.git
cd kohya_ss
sed -i 's/sudo apt install python3-tk/apt update -y; apt install -y python3-tk/g' ubuntu_setup.sh && sed -i "s/interface.launch(\*\*launch_kwargs)/interface.launch(\*\*launch_kwargs,share=True)/g" kohya_gui.py
./ubuntu_setup.sh

Choose "This machine", "No distributed training", 3x "NO", then type "all", and finally select fp16

3. Install tensorrt

source /workspace/kohya_ss/venv/bin/activate
pip install tensorrt

4. Fix missing libnvinfer.so.7 and libnvinfer_plugin.so.7 libraries

cd /workspace/kohya_ss/venv/lib/python3.10/site-packages/tensorrt
ln -s libnvinfer_plugin.so.8 libnvinfer_plugin.so.7
ln -s libnvinfer.so.8 libnvinfer.so.7
cd /workspace/kohya_ss/venv/lib/python3.10/site-packages/nvidia/cuda_runtime/lib
ln -s libcudart.so.12 libcudart.so.11.0
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/workspace/kohya_ss/venv/lib/python3.10/site-packages/tensorrt/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/workspace/kohya_ss/venv/lib/python3.10/site-packages/nvidia/cuda_runtime/lib/

5. Not sure if it's needed but I always rerun the ubuntu_setup.sh after post-installing tensorrt, it downloads some additional things. Choose the same options as before.

cd /workspace/kohya_ss
./ubuntu_setup.sh

6. Optional install gdown and pull your data

pip install gdown && apt install -y unzip
cd /workspace && gdown https://drive.google.com/uc?id=YOUR_FILE_ID_HERE

7. Start the server

cd /workspace/kohya_ss && ./gui.sh

16 replies

ppetrucz Mar 23, 2023

Another question guys, what are your methods to download models, I think I use wget command, there is something better?

I usually have the models on google drive and just use gdown to pull it.

Norian11 Mar 23, 2023

Oh gdown creates a copy? I always used !mv in Collab and then I had to upload all things again to drive after I finished XD. Thanks for the tip

ohminy Mar 24, 2023

I can done step 2. setup kohya_ss
select fp16

and then to install tensorrt, type below.. but,

root@14c1f5703258:/workspace/kohya_ss# source /workspace/kohya_ss/venv/bin/activate
bash: /workspace/kohya_ss/venv/bin/activate: No such file or directory

Why it does not work?? why I can not find file /workspace/kohya_ss/venv/bin/activate

please help me...

HuanchengHu Mar 24, 2023

I can done step 2. setup kohya_ss select fp16

and then to install tensorrt, type below.. but,

root@14c1f5703258:/workspace/kohya_ss# source /workspace/kohya_ss/venv/bin/activate bash: /workspace/kohya_ss/venv/bin/activate: No such file or directory

Why it does not work?? why I can not find file /workspace/kohya_ss/venv/bin/activate

please help me...

which pod are you using? I encountered the same issue when using the fast stable diffusion pod, But it works for me using the stable difussion 1.5 pod

joegibes Mar 28, 2023

I have a few updates for @ppetrucz 's excellent walkthrough. I was gonna post a separate script version, but you can copy/paste it yourself and comment out the appropriate lines. Changes:

the accelerate config format seems changed as of 03/28: and you can add 0; 0; no; no; fp16 to the script to automate the config (insert after ubuntu_setup.sh is ran).
orjson (one of the requirements) requires Rust/cargo and pip update.
Add this before the Step 5. re-run of ubuntu_setup.sh:

apt update
apt install build-essential 
pip install --upgrade pip
curl https://sh.rustup.rs -sSf | sh

(Not sure if gcc toolkit is actually needed)
(Also, Note that orjson==3.8.7 might work without needing rust/cargo, but I haven't checked compatibility: ijl/orjson#366)

Note that to successfully run LORA training, I did have to change AdamW8bit to AdamW, and enable Gradient checkpointing and Memory efficient attention. Otherwise, I got the library error... Maybe some more work is needed to find the root cause. (from bitsandbytes-foundation/bitsandbytes#169 (comment))
EDIT: Just using AdamW works.

ohminy · 2023-03-24T09:32:21Z

ohminy
Mar 24, 2023

thank you I changed to SD 1.5 pod. This trouble is cleared. but there is another issue.. NameError: st2optimizer8bit_blockwise When I changed to AdamW instead AdamW8bit, it is cleared but... very slow trainning speed .. ... T.T 2023년 3월 24일 (금) 오후 5:44, HuanchengHu ***@***.***>님이 작성:

…

I can done step 2. setup kohya_ss select fp16 and then to install tensorrt, type below.. but, ***@***.***:/workspace/kohya_ss# source /workspace/kohya_ss/venv/bin/activate bash: /workspace/kohya_ss/venv/bin/activate: No such file or directory Why it does not work?? why I can not find file /workspace/kohya_ss/venv/bin/activate please help me... which pod are you using? I encountered the same issue when using the fast stable diffusion pod, But it works for me using the stable difussion 1.5 pod — Reply to this email directly, view it on GitHub <#379 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AK7SS7Q5ULPAW4WEPVEZRZ3W5VNIFANCNFSM6AAAAAAV2TIMXQ> . You are receiving this because you commented.Message ID: ***@***.***>

0 replies

jeanhadrien · 2023-03-26T20:32:07Z

jeanhadrien
Mar 26, 2023

Here's a sh script to make everything at once and start the gui :

run.sh

git clone https://github.com/bmaltais/kohya_ss.git
cd kohya_ss
bash ubuntu_setup.sh
apt-get update -y
apt-get install python3-tk -y
sed -i 's/interface\.launch(\*\*launch_kwargs/interface.launch(\*\*launch_kwargs,share=True/g' kohya_gui.py
bash gui.sh

put that in /workspace and do bash run.sh

another script to restart gui if needed :

restart.sh

cd kohya_ss
bash gui.sh

2 replies

APZmedia Apr 5, 2023

after cloning the repo when we run ubuntu_setup.sh there's no file or directory. The only two files are setup.sh and gui.sh
If we run setup.sh instead of ubuntu.sh then it installs but there's a tkinter error even if the tkinter installation works.

Traceback (most recent call last):
  File "/workspace/kohya_ss/kohya_gui.py", line 4, in <module>
    from dreambooth_gui import dreambooth_tab
  File "/workspace/kohya_ss/dreambooth_gui.py", line 13, in <module>
    from library.common_gui import (
  File "/workspace/kohya_ss/library/common_gui.py", line 1, in <module>
    from tkinter import filedialog, Tk
ModuleNotFoundError: No module named 'tkinter'

DoguCatto Apr 9, 2023

after cloning the repo when we run ubuntu_setup.sh there's no file or directory. The only two files are setup.sh and gui.sh If we run setup.sh instead of ubuntu.sh then it installs but there's a tkinter error even if the tkinter installation works.
Traceback (most recent call last):
  File "/workspace/kohya_ss/kohya_gui.py", line 4, in <module>
    from dreambooth_gui import dreambooth_tab
  File "/workspace/kohya_ss/dreambooth_gui.py", line 13, in <module>
    from library.common_gui import (
  File "/workspace/kohya_ss/library/common_gui.py", line 1, in <module>
    from tkinter import filedialog, Tk
ModuleNotFoundError: No module named 'tkinter'

same

ghost · 2023-03-29T01:04:20Z

ghost
Mar 29, 2023

if you don't want Auto1111 to run, edit pod, expand the Environment Variables section, add a key RUNPOD_STOP_AUTO with value 1
(this functionality is built into the start.sh file found in the root of the server, you can add your own code there for kohya_ss)

killing the Auto1111 server with fuser -k 3000/tcp will not work because by default it is on infinite sleep with a relauncher script

0 replies

dpyy · 2023-03-30T03:26:29Z

dpyy
Mar 30, 2023

can someone please fix NameError: st2optimizer8bit_blockwise

3 replies

ppetrucz Mar 30, 2023

Not sure about fixing it but till then, as a workaround you can use adamW.

dpyy Mar 30, 2023

how to do that?

joegibes Mar 31, 2023

@dpyy change the setting in the gui, might be down lower in Advanced Settings.

SGKino · 2023-03-30T18:11:38Z

SGKino
Mar 30, 2023

Here is the problem I got,
when I am trying to train my dataset with GUI
and don't know how to fix...

Traceback (most recent call last):
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/routes.py", line 384, in run_predict
output = await app.get_blocks().process_api(
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1024, in process_api
result = await self.call_function(
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/blocks.py", line 836, in call_function
prediction = await anyio.to_thread.run_sync(
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/anyio/_backends/asyncio.py", line 867, in run
result = context.run(func, args)
File "/workspace/kohya_ss/lora_gui.py", line 422, in train_model
repeats = int(folder.split('')[0])
ValueError: invalid literal for int() with base 10: '.ipynb'
Folder 100_CandyL: 15 images found
Folder 100_CandyL: 1500 steps
Traceback (most recent call last):
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/routes.py", line 384, in run_predict
output = await app.get_blocks().process_api(
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1024, in process_api
result = await self.call_function(
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/blocks.py", line 836, in call_function
prediction = await anyio.to_thread.run_sync(
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/anyio/_backends/asyncio.py", line 867, in run
result = context.run(func, args)
File "/workspace/kohya_ss/lora_gui.py", line 422, in train_model
repeats = int(folder.split('')[0])
ValueError: invalid literal for int() with base 10: '.ipynb'

0 replies

djcedr · 2023-03-30T18:28:52Z

djcedr
Mar 30, 2023

I've seen that before, it might be because you have spaces somewhere in your filenames, probably in the images names... Le jeu. 30 mars 2023 à 20:11, SGKino ***@***.***> a écrit :

…

Here is the problem I got, and don't know how to fix... Traceback (most recent call last): File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/routes.py", line 384, in run_predict output = await app.get_blocks().process_api( File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1024, in process_api result = await self.call_function( File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/blocks.py", line 836, in call_function prediction = await anyio.to_thread.run_sync( File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/anyio/_backends/ *asyncio.py", line 867, in run result = context.run(func, *args) File "/workspace/kohya_ss/lora_gui.py", line 422, in train_model repeats = int(folder.split('*')[0]) ValueError: invalid literal for int() with base 10: '.ipynb' Folder 100_CandyL: 15 images found Folder 100_CandyL: 1500 steps Traceback (most recent call last): File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/routes.py", line 384, in run_predict output = await app.get_blocks().process_api( File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1024, in process_api result = await self.call_function( File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/blocks.py", line 836, in call_function prediction = await anyio.to_thread.run_sync( File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/anyio/_backends/ *asyncio.py", line 867, in run result = context.run(func, *args) File "/workspace/kohya_ss/lora_gui.py", line 422, in train_model repeats = int(folder.split('*')[0]) ValueError: invalid literal for int() with base 10: '.ipynb' — Reply to this email directly, view it on GitHub <#379 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABEKG55H2JPF6Q2V5RFGUWLW6XEGJANCNFSM6AAAAAAV2TIMXQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

2 replies

SGKino Mar 30, 2023

Thank you for your reply.
But where should I type this command

rm -rf .ipynb_checkpoints

SGKino Mar 30, 2023

I finally got the location right...
Is somewhere in my training data set folder...thanks for your advice

SGKino · 2023-03-31T16:26:46Z

SGKino
Mar 31, 2023

Here I got a new problem now.
Though I successfully run the gui.sh, it works opened the public URL,
I can't use it due to the 504 gateway times out.

Can anyone tell me what should I do?
Is it a problem of my network or the RunPod?

0 replies

DoguCatto · 2023-04-09T07:15:14Z

DoguCatto
Apr 9, 2023

this effing tkinter is preventing me from using kohya_ss on runpod for weeks now,
Im literally pulling out my hair, frustated since I have no coding background,
all guides and tutorials gave me the same error, and after trying everything from this page, it still gives me the same error :

Validating that requirements are satisfied.
All requirements satisfied.
Traceback (most recent call last):
File "/workspace/kohya_ss/kohya_gui.py", line 4, in
from dreambooth_gui import dreambooth_tab
File "/workspace/kohya_ss/dreambooth_gui.py", line 13, in
from library.common_gui import (
File "/workspace/kohya_ss/library/common_gui.py", line 1, in
from tkinter import filedialog, Tk
ModuleNotFoundError: No module named 'tkinter'
(venv) root@a15b90ed32cb:/workspace/kohya_ss#

8 replies

Sintho Apr 14, 2023

"fuser -k 3000/tcp" should do the trick

Kida007 Apr 21, 2023

worksss.

C1ao3Bo0 Apr 22, 2023

sucessfully ran in this method, but having issues trying to train Lora as follow. Any thoughts? Thanks.

Traceback (most recent call last):
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/routes.py", line 401, in run_predict
output = await app.get_blocks().process_api(
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1302, in process_api
result = await self.call_function(
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1025, in call_function
prediction = await anyio.to_thread.run_sync(
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/anyio/_backends/asyncio.py", line 867, in run
result = context.run(func, *args)
File "/workspace/kohya_ss/lora_gui.py", line 432, in train_model
repeats = int(folder.split('')[0])
ValueError: invalid literal for int() with base 10: '.ipynb'

Kida007 Apr 22, 2023

this effing tkinter is preventing me from using kohya_ss on runpod for weeks now,

Im literally pulling out my hair, frustated since I have no coding background,

all guides and tutorials gave me the same error, and after trying everything from this page, it still gives me the same error :

Validating that requirements are satisfied.

All requirements satisfied.

Traceback (most recent call last):

File "/workspace/kohya_ss/kohya_gui.py", line 4, in
from dreambooth_gui import dreambooth_tab
File "/workspace/kohya_ss/dreambooth_gui.py", line 13, in
from library.common_gui import (
File "/workspace/kohya_ss/library/common_gui.py", line 1, in
from tkinter import filedialog, Tk
ModuleNotFoundError: No module named 'tkinter'

(venv) root@a15b90ed32cb:/workspace/kohya_ss#

There must be a .ipynb folder created in your image folder. Its hidden so do "ls -a" , And delete it .

C1ao3Bo0 Apr 22, 2023

dam!! you are awesome!! a millions thx!!

kodxana · 2023-04-14T14:48:31Z

kodxana
Apr 14, 2023

Here is my setup script: https://github.com/kodxana/SCforRunpod/blob/main/kohya_ss-installer.sh
It's made to use with Kasm based template: https://runpod.io/gsc?template=9thomk0pjf&ref=vfker49t
Fixed recent issues with accelerate config so it should work fine.

3 replies

drphero Apr 17, 2023

The script seems to complete just fine, but the gui doesn't work for opening folders or selecting files for some reason.

duskfallcrew May 6, 2023

oO you do have jupyter's file finder in the bg, it's hindered yes but you could also in suggestion add emjoy finder to the thigns you could adhoc to your setup.

monydochev May 9, 2023

Thank you. It;s work

duskfallcrew · 2023-05-06T01:19:57Z

duskfallcrew
May 6, 2023

Anyone ever notice you CAN use ocalmfuse for gdrive? runpodctl never works for me it confuses me and drives me nuts - but evidently in theory you can install ocalmfuse on anyhting that's not colab XD and google's the one that developed it.

0 replies

picobyte · 2023-05-11T21:05:24Z

picobyte
May 11, 2023

Edit: ignore this, unless you're interested in the cause, use the script 15ky3 mentioned, and see my comment below it if your python 3 still complains about tkinter after that.

I tried running on runpod, After install the tkinter bug strikes. Nothing mentioned here really resolves it. There is no ubuntu_upgrade.sh anymore, only upgrade.sh. If you install the python3-tk via apt-get the links still point to python 3.10, you need to make python link to python 3.8, because only that one has tkinter. I just removed the links and recreated the new ones.

cd venv/bin

rm python3

ln -s /usr/bin/python3.8 python3

however, then you're missing pip, which you need to install using
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py

however, before installing it, you need to

apt install python3.8-distutils
and then
python3.8 get-pip.py
but the install scripts are broken, they keep trying to get back to python 3.10, which does not come installed with tkinter.

0 replies

15ky3 · 2023-05-11T21:13:49Z

15ky3
May 11, 2023

Maybe you can use my Script for vast.ai provided here.
Dont know if work on runpod but give it a try :)

Maybe you should, if it downt work with the script, execute:
export LD_LIBRARY_PATH=/opt/conda/lib:$LD_LIBRARY_PATH
export MKL_THREADING_LAYER=1

0 replies

picobyte · 2023-05-11T21:27:09Z

picobyte
May 11, 2023

thanks, might have helped, but what I also needed was:

apt-get install python3.10-tk
apt install python3.10-distutils
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python3.10 get-pip.py

If you do not specify 3.10 explicitly you get python3.8 which is ignored when running gui.sh
now I have it running on RunPod Fast Stable Diffusion. Not however, that this installation seems to interfere with the stable-diffusion-webui, at least I get an error now, running step 2: Install/Update AUTOMATIC1111 repo easiest is to reset the pod to install that.

0 replies

Norian11 · 2023-06-19T18:11:42Z

Norian11
Jun 19, 2023

If someone its having problems with this error very recently:
runtimeerror: detected that pytorch and torchvision were compiled with different cuda versions. pytorch has cuda version=11.7 and torchvision has cuda version=11.8. please reinstall the torchvision that matches your pytorch install.

i just with the virtual environment activated updated pytorch to use cuda 11.8:
pip uninstall torch -y

pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu118

1 reply

unk1911 Jun 23, 2023

didn't work for me, same error

bmaltais · 2023-06-24T00:58:18Z

bmaltais
Jun 24, 2023
Maintainer

I have reworked the runpod setup. It should now be easier then ever to run on runpod.

Run the Torch 2.0.1 template.

cd /workspace
git clone https://github.com/bmaltais/kohya_ss.git
git checkout dev2
./setup.sh -p

I hope I have not broken everything else in the process... If I did I appoligise and let's work on making runpor support top notch

2 replies

drphero Jun 25, 2023

Install seems to work, however no venv is created. I manually installed python3.10-venv and python3-venv and it still won't make and use it. I am able to manually make a venv with python3.10 -m venv but it won't use the venv that is already created inside the kohya_ss folder.

EDIT:
If the ! is removed on line 187 of setup.sh, it will create a venv. I'm not sure if this is a typo or intended. But if people want to run kohya_ss and something like automatic1111 on the same pod, using venv would be useful.

synapsestorm Jul 25, 2023

Probably a noob question, but if I have a multi-gpu runpod, would I need to consider the distributed training when installing Kohya?

runpod.io kohya_ss directions (in thread) #379

Replies: 19 comments · 49 replies

Run sd 1.5 runpod template with at least 15gb+ container storage and 30gb+ persistent storage

Exectute the following commands in jupiter terminal

1. Optional, kill the SD server if you don't need it

2. Setup kohya_ss

Choose "This machine", "No distributed training", 3x "NO", then type "all", and finally select fp16

3. Install tensorrt

4. Fix missing libnvinfer.so.7 and libnvinfer_plugin.so.7 libraries

5. Not sure if it's needed but I always rerun the ubuntu_setup.sh after post-installing tensorrt, it downloads some additional things. Choose the same options as before.

6. Optional install gdown and pull your data

7. Start the server

cd /workspace/kohya_ss && ./gui.sh

Here is the problem I got, when I am trying to train my dataset with GUI and don't know how to fix...

Replies: 19 comments 49 replies

Here is the problem I got,
when I am trying to train my dataset with GUI
and don't know how to fix...