HowTo: Running FLUX.1 [dev] on A770 (Forge, ComfyUI)

System Requirements

  • Windows PC
  • at least 32GB RAM
  • Intel Arc A770

Resources

Installation

  1. Update Arc driver https://www.intel.com/content/www/us/en/download/785597/intel-arc-iris-xe-graphics-windows.html
  2. Install oneAPI https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html
  3. Install Git https://git-scm.com/downloads
  4. Install miniforge https://conda-forge.org/download/

stable-diffusion-webui-forge

Setup

1. Run "Miniforge Prompt",create env then install torch.

conda create -n forge python==3.11 libuv
conda activate forge
pip install torch==2.1.0.post3 torchvision==0.16.0.post3 torchaudio==2.1.0.post3 intel-extension-for-pytorch==2.1.40+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

2. Clone Forge to forge directory. (or change whatever you want)

cd <WHERE_TO_DOWNLOAD>
git clone https://github.com/lllyasviel/stable-diffusion-webui-forge forge

3 Until https://github.com/lllyasviel/stable-diffusion-webui-forge/pull/1162 pulled, check the difference, and apply it. (or simply overwrite backend\nn\flux.py to https://raw.githubusercontent.com/lllyasviel/stable-diffusion-webui-forge/3a8cf833e148f88e37edd17012ffaf3af7480d40/backend/nn/flux.py) It doesn't need after cc37858.

4. Place resources to

  1. diffuser model to models/Stable-Diffusion
  2. vae to models/VAE
  3. clip_l and t5xxls to models/text_encoder
  4. Modify webui-user.bat

@echo off
set COMMANDLINE_ARGS=--use-ipex --disable-xformers --unet-in-bf16 --always-low-vram
set SKIP_VENV=1
call %USERPROFILE%\miniforge3\Scripts\activate.bat forge
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
call webui.bat

5. Double click webui-user.bat from file explorer and wait until installation

  • it may takes looooooong time. (for me, it was 20mins)Startup time: 1254.7s (prepare environment: 1231.5s, launcher: 5.1s, import torch: 6.6s, initialize shared: 0.5s, other imports: 3.1s, list SD models: 1.5s, load scripts: 3.1s, create ui: 2.2s, gradio launch: 1.1s).

Test

  1. (recomend) Go to "Settings" and search cpu, then change RNG to "CPU".
  2. Set Checkpoint to flux1-dev-Q4_0.gguf, VAE / Text encoder to clip_l, ae, t5xxl_fp16 on the top selectors.
  3. Prompt "hello, world", size to 1024x1024, seed 42, then press "generate" button.

ComfyUI

Setup

1. Create conda env, cloning ComfyUI, install requirements.

conda create -n comfyui python==3.11 libuv
conda activate comfyui
pip install torch==2.1.0.post3 torchvision==0.16.0.post3 torchaudio==2.1.0.post3 intel-extension-for-pytorch==2.1.40+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

cd <WHERE_TO_DOWNLOAD>
git clone https://github.com/comfyanonymous/ComfyUI && cd ComfyUI
pip install -r requirements.txt

cd custom_nodes
git clone https://github.com/city96/ComfyUI-GGUF && cd ..
pip install gguf "numpy<2.0"

2. Place resources to

  1. clip_l and t5 to models/clip
  2. vae to models/vae
  3. diffuser models to models/diffusion_models (or models/checkpoints depends on model) NOTE unet directory is deplicated, so recommend to use diffusion_models instead.

Optional) Or you can use models from Forge by creating extra_model_paths.yaml . See Tip section.

3. Create run.bat.

call %USERPROFILE%\miniforge3\Scripts\activate.bat comfyui
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"

python main.py --auto-launch --disable-xformers --bf16-unet --lowvram

Test

  1. Double click run.bat from file explorer, and drag&drop Flux dev workflow image.
  2. Change Load Diffusion Model node to Unet Loader (GGUF) node, and select flux1-dev-Q4_0.gguf , then connect to Model Sampling Flux node.
  3. Press "Queue" Button.

Tip : Sharing models on Forge and ComfyUI

Comfy has brilliant feature get models from other tools.

You just need to create extra_model_paths.yaml on root of ComfyUI.
Here's the slightly modified version of example. I just added clip and diffusion_models.

forge:
    base_path: <YOUR_FORGE_DIRECTORY>

    checkpoints: models/Stable-diffusion
    clip: models/text_encoder
    configs: models/Stable-diffusion
    diffusion_models: models/Stable-diffusion
    vae: models/VAE
    loras: |
         models/Lora
         models/LyCORIS
    upscale_models: |
                  models/ESRGAN
                  models/RealESRGAN
                  models/SwinIR
    embeddings: embeddings
    hypernetworks: models/hypernetworks
    controlnet: models/ControlNet

However, Forge uses one directory for checkpoints and diffusion_models, beside ComfyUI uses seprated directories.

You can just link both checkpoints and diffusion models to Stable-diffusion directory, like below.

    checkpoints: models/Stable-diffusion
    diffusion_models: models/Stable-diffusion

But in that case, you may see all models both on "Load checkpoint" node and "Load diffusion mode" node.

So, I suggest to make symlink of checkpoints and diffusion_models to Stable-diffusion directory.

cd <YOUR_FORGE_DIRECTORY>\models
mkdir diffusion_models
mkdir checkpoints
cd Stable-diffusion
mklink /d dfs ..\diffusion_models
mklink /d ckpts ..\checkpoints

Then, change the yaml file. (checkpoints → checkpoints)

forge:
    base_path: <YOUR_FORGE_DIRECTORY>

    checkpoints: models/checkpoints
    clip: models/text_encoder
    configs: models/Stable-diffusion
    diffusion_models: models/diffusion_models
    vae: models/VAE
    loras: |
         models/Lora
         models/LyCORIS
    upscale_models: |
                  models/ESRGAN
                  models/RealESRGAN
                  models/SwinIR
    embeddings: embeddings
    hypernetworks: models/hypernetworks
    controlnet: models/ControlNet

Simple comparison vs RTX3060

Generation Speed

AMD 5600G, 32GB DDR4, Windows 11

A770 (PCIe 3.0 x4) / RTX3060 (PCIe 4.0 x4, Power limit 130W)

Prompt: "hello, world", 1024x1024, seed 42, t5xxl_fp16

q4_0 q4_1
A770 Forge 86.5s, 3.30s/it
A770 ComfyUI 80.63s, 3.31s/it
RTX3060 Forge 107.5s, 4.96s/it
RTX3060 ComfyUI 91.51s, 4.23s/it

A770 is about 15~20% faster than RTX3060, shows reasonable performance.

Image Check

Result seems different with RTX3060, guess because of diffrence of computing, but result on ComfyUI and Forge are identical.

prompt: hello, world, size: 1024x1024, seed: 42

Limitation

  • A770 has potential to run fp8/q8_0, but generation speed will be 10x slower if it start to use shared GPU memory, and Intel Arc doesn't have feature to disable shared GPU memory unlike nVidia.
    • However, I could run q5_1 or q6_k(new!) and their quality seems okay for me. thanks city96!
  • bitsandbytes doesn't support Intel Arc still yet, so you can't use nf4 models.
  • I didn't test Lora, but it may work.
  • Loading diffusion_models and clip(mainly t5xxl) use more than 20GB, so if you have 32GB RAM, please care about lack of RAM. If you have 64GB or above, you can try WSL2 and use tcmalloc. it may boost generation performance.

Refs