HowTo: Running FLUX.1 [dev] on A770 (Forge, ComfyUI)
System Requirements
- Windows PC
- at least 32GB RAM
- Intel Arc A770
Resources
- text encoders(clip_l, t5xxl): https://huggingface.co/comfyanonymous/flux_text_encoders/tree/main
- vae: https://huggingface.co/black-forest-labs/FLUX.1-schnell/blob/main/ae.safetensors
- diffuser model:
- https://huggingface.co/city96/FLUX.1-dev-gguf (recommend q4_1)
- ComfyUI workflow: https://comfyanonymous.github.io/ComfyUI_examples/flux/#flux-dev
Installation
- Update Arc driver https://www.intel.com/content/www/us/en/download/785597/intel-arc-iris-xe-graphics-windows.html
- Install oneAPI https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html
- Install Git https://git-scm.com/downloads
- Install miniforge https://conda-forge.org/download/
stable-diffusion-webui-forge
Setup
1. Run "Miniforge Prompt",create env then install torch.
conda create -n forge python==3.11 libuv
conda activate forge
pip install torch==2.1.0.post3 torchvision==0.16.0.post3 torchaudio==2.1.0.post3 intel-extension-for-pytorch==2.1.40+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
2. Clone Forge to forge directory. (or change whatever you want)
cd <WHERE_TO_DOWNLOAD>
git clone https://github.com/lllyasviel/stable-diffusion-webui-forge forge
3 Until https://github.com/lllyasviel/stable-diffusion-webui-forge/pull/1162 pulled, check the difference, and apply it. (or simply overwrite backend\nn\flux.py to https://raw.githubusercontent.com/lllyasviel/stable-diffusion-webui-forge/3a8cf833e148f88e37edd17012ffaf3af7480d40/backend/nn/flux.py) It doesn't need after cc37858.
4. Place resources to
- diffuser model to
models/Stable-Diffusion
- vae to
models/VAE
- clip_l and t5xxls to
models/text_encoder
- Modify webui-user.bat
@echo off
set COMMANDLINE_ARGS=--use-ipex --disable-xformers --unet-in-bf16 --always-low-vram
set SKIP_VENV=1
call %USERPROFILE%\miniforge3\Scripts\activate.bat forge
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
call webui.bat
5. Double click webui-user.bat from file explorer and wait until installation
- it may takes looooooong time. (for me, it was 20mins)Startup time: 1254.7s (prepare environment: 1231.5s, launcher: 5.1s, import torch: 6.6s, initialize shared: 0.5s, other imports: 3.1s, list SD models: 1.5s, load scripts: 3.1s, create ui: 2.2s, gradio launch: 1.1s).
Test
- (recomend) Go to "Settings" and search cpu, then change RNG to "CPU".
- Set Checkpoint to
flux1-dev-Q4_0.gguf
, VAE / Text encoder toclip_l
,ae
,t5xxl_fp16
on the top selectors. - Prompt "hello, world", size to 1024x1024, seed 42, then press "generate" button.
ComfyUI
Setup
1. Create conda env, cloning ComfyUI, install requirements.
conda create -n comfyui python==3.11 libuv
conda activate comfyui
pip install torch==2.1.0.post3 torchvision==0.16.0.post3 torchaudio==2.1.0.post3 intel-extension-for-pytorch==2.1.40+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
cd <WHERE_TO_DOWNLOAD>
git clone https://github.com/comfyanonymous/ComfyUI && cd ComfyUI
pip install -r requirements.txt
cd custom_nodes
git clone https://github.com/city96/ComfyUI-GGUF && cd ..
pip install gguf "numpy<2.0"
2. Place resources to
- clip_l and t5 to
models/clip
- vae to
models/vae
- diffuser models to
models/diffusion_models
(ormodels/checkpoints
depends on model) NOTEunet
directory is deplicated, so recommend to usediffusion_models
instead.
Optional) Or you can use models from Forge by creating extra_model_paths.yaml
. See Tip section.
3. Create run.bat.
call %USERPROFILE%\miniforge3\Scripts\activate.bat comfyui
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
python main.py --auto-launch --disable-xformers --bf16-unet --lowvram
Test
- Double click
run.bat
from file explorer, and drag&drop Flux dev workflow image. - Change
Load Diffusion Model
node toUnet Loader (GGUF)
node, and selectflux1-dev-Q4_0.gguf
, then connect toModel Sampling Flux
node. - Press "Queue" Button.
Tip : Sharing models on Forge and ComfyUI
Comfy has brilliant feature get models from other tools.
You just need to create extra_model_paths.yaml
on root of ComfyUI.
Here's the slightly modified version of example. I just added clip and diffusion_models.
forge:
base_path: <YOUR_FORGE_DIRECTORY>
checkpoints: models/Stable-diffusion
clip: models/text_encoder
configs: models/Stable-diffusion
diffusion_models: models/Stable-diffusion
vae: models/VAE
loras: |
models/Lora
models/LyCORIS
upscale_models: |
models/ESRGAN
models/RealESRGAN
models/SwinIR
embeddings: embeddings
hypernetworks: models/hypernetworks
controlnet: models/ControlNet
However, Forge uses one directory for checkpoints and diffusion_models, beside ComfyUI uses seprated directories.
You can just link both checkpoints and diffusion models to Stable-diffusion directory, like below.
checkpoints: models/Stable-diffusion
diffusion_models: models/Stable-diffusion
But in that case, you may see all models both on "Load checkpoint" node and "Load diffusion mode" node.
So, I suggest to make symlink of checkpoints and diffusion_models to Stable-diffusion directory.
cd <YOUR_FORGE_DIRECTORY>\models
mkdir diffusion_models
mkdir checkpoints
cd Stable-diffusion
mklink /d dfs ..\diffusion_models
mklink /d ckpts ..\checkpoints
Then, change the yaml file. (checkpoints → checkpoints)
forge:
base_path: <YOUR_FORGE_DIRECTORY>
checkpoints: models/checkpoints
clip: models/text_encoder
configs: models/Stable-diffusion
diffusion_models: models/diffusion_models
vae: models/VAE
loras: |
models/Lora
models/LyCORIS
upscale_models: |
models/ESRGAN
models/RealESRGAN
models/SwinIR
embeddings: embeddings
hypernetworks: models/hypernetworks
controlnet: models/ControlNet
Simple comparison vs RTX3060
Generation Speed
AMD 5600G, 32GB DDR4, Windows 11
A770 (PCIe 3.0 x4) / RTX3060 (PCIe 4.0 x4, Power limit 130W)
Prompt: "hello, world", 1024x1024, seed 42, t5xxl_fp16
q4_0 | q4_1 |
---|---|
A770 Forge | 86.5s, 3.30s/it |
A770 ComfyUI | 80.63s, 3.31s/it |
RTX3060 Forge | 107.5s, 4.96s/it |
RTX3060 ComfyUI | 91.51s, 4.23s/it |
A770 is about 15~20% faster than RTX3060, shows reasonable performance.
Image Check
Result seems different with RTX3060, guess because of diffrence of computing, but result on ComfyUI and Forge are identical.
prompt: hello, world, size: 1024x1024, seed: 42
Limitation
- A770 has potential to run fp8/q8_0, but generation speed will be 10x slower if it start to use shared GPU memory, and Intel Arc doesn't have feature to disable shared GPU memory unlike nVidia.
- However, I could run q5_1 or q6_k(new!) and their quality seems okay for me. thanks city96!
- bitsandbytes doesn't support Intel Arc still yet, so you can't use nf4 models.
- I didn't test Lora, but it may work.
- Loading diffusion_models and clip(mainly t5xxl) use more than 20GB, so if you have 32GB RAM, please care about lack of RAM. If you have 64GB or above, you can try WSL2 and use tcmalloc. it may boost generation performance.
Refs
- https://github.com/comfyanonymous/ComfyUI?tab=readme-ov-file#intel-gpus
- https://github.com/comfyanonymous/ComfyUI/discussions/476
- https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_windows_gpu.md
- https://intel.github.io/intel-extension-for-pytorch/index.html#installation?platform=gpu