Nodes Browser
ComfyDeploy: How VLM_nodes works in ComfyUI?
What is VLM_nodes?
Custom Nodes for Vision Language Models (VLM) , Large Language Models (LLM), Image Captioning, Automatic Prompt Generation, Creative and Consistent Prompt Suggestion, Keyword Extraction
How to install it in ComfyDeploy?
Head over to the machine page
- Click on the "Create a new machine" button
- Select the
Edit
build steps - Add a new step -> Custom Node
- Search for
VLM_nodes
and select it - Close the build step dialig and then click on the "Save" button to rebuild the machine
Usage
- For Windows and Linux
cd custom_nodes
git clone https://github.com/gokayfem/ComfyUI_VLM_nodes.git
Acknowledgements
If you get errors related to llama-cpp-python or if it is not using GPU.
I recommend installing it with the right arguments provided in this link llama-cpp-python
VLM Nodes
Utilizes llama-cpp-python
for integration of LLaVa models. You can load and use any VLM with LLaVa models in GGUF format with this nodes.
You need to download the model similar to ggml-model-q4_k.gguf
and it's clip projector similar to mmproj-model-f16.gguf
from this repositories (in the files and versions).
python=>3.9
is necessary.
Put all of the files inside models/LLavacheckpoints
Note that every model's clip projector is different!
Structured Output
Getting structured outputs can be quite challenging through prompt engineering alone.
I've added the Structured Output node to VLM Nodes.
Now, you can obtain your answers reliably.
You can extract entities, numbers, classify prompts with given classes, and generate one specific prompt. These are just a few examples.
You can add additional descriptions to fields and choose the attributes you want it to return.
Image to Music
Utilizes VLMs, LLMs and AudioLDM-2 to make music from images.
Use SaveAudioNode to save the music inside output
folder.
It will automatically download the necessary files into models/LLavacheckpoints/files_for_audioldm2
https://github.com/gokayfem/ComfyUI_VLM_nodes/assets/88277926/2c5bdcde-d637-49ad-b317-14ac0a12f7df
LLM to Music
Utilizes Chat Musician, an open-source LLM that integrates intrinsic musical abilities.
ChatMusician Demo Page
You can try prompts from this demo page.
Download the GGUF file
ChatMusician GGUF Files
ChatMusician.Q5_K_M.gguf or ChatMusician.Q5_K_S.gguf recommended
BIG BIG BIG Warning: It does NOT work perfectly, if you got errors accept the error queue prompt again with the same settings!!
https://github.com/gokayfem/ComfyUI_VLM_nodes/assets/88277926/7f22d4f2-b998-402e-88c8-c382a730d624
InternLM-XComposer2-VL Node
Utilizes AutoGPTQ
for integration of InternLM-XComposer2-VL Model. It will automatically download the necessary files into models/LLavacheckpoints/files_for_internlm
.
This is one of the best models for visual perception.
Important Note : This model is heavy.
Automatic Prompt Generation and Suggestion Nodes
Get Keyword node: It can take LLava outputs and extract keywords from them.
LLava PromptGenerator node: It can create prompts given descriptions or keywords using (input prompt could be Get Keyword or LLava output directly).
Suggester node: It can generate 5 different prompts based on the original prompt using consistent in the options or random prompts using random in the options.
- Works best with LLava 1.5 and 1.6.
Play with the temperature
for creative or consistent results. Higher the temperature more creative are the results.
If you want to dive deep into LLM Settings
Outputs are JSON looking texts, you can see them as a text using JsonToText Node.
You can see any string output with ViewText Node
You can set any string input using SimpleText Node
Utilizes llama-cpp-agents
for getting structured outputs.
LLM Prompt Generation from text nodes
LLM PromptGenerator node:
Qwen 1.8B Stable Diffusion Prompt
IF prompt MKR
This LLM's works best for now for prompt generation.
LLMSampler node: You can chat with any LLM in gguf format, you can use LLava models as an LLM also.
API PromptGenerator node: You can use ChatGPT and DeepSeek API's to create prompts. https://platform.deepseek.com/ gives 10m free tokens.
- ChatGPT-4
- ChatGPT-3.5
- DeepSeek You can use them for simple chat also there is an option in the node.
UForm-Gen2 Qwen Node
UForm-Gen2 is an extremely fast small generative vision-language model primarily designed for Image Captioning and Visual Question Answering.
UForm-Gen2 Qwen
It will automatically download the necessary files into models/LLavacheckpoints/files_for_uform_gen2_qwen
Kosmos-2 Node
Kosmos-2: Grounding Multimodal Large Language Models to the World.
Kosmos-2
It will automatically download the necessary files into models/LLavacheckpoints/files_for_kosmos2
moondream1 and moondream2 Node
This node is designed to work with the Moondream model, a powerful small vision language model built by @vikhyatk using SigLIP, Phi-1.5, and the LLaVa training dataset. The model boasts 1.6 billion parameters and is made available for research purposes only; commercial use is not allowed.
moondream2 is a small vision language model designed to run efficiently on edge devices.
It will automatically download the necessary files into models/LLavacheckpoints/files_for__moondream
and models/LLavacheckpoints/files_for_moondream2
JoyTag Node
@fpgamine's JoyTag is a state of the art AI vision model for tagging images, with a focus on sex positivity and inclusivity.
It uses the Danbooru tagging schema, but works across a wide range of images, from hand drawn to photographic.
It will automatically download the necessary files into models/LLavacheckpoints/files_for_joytagger
Qwen2-VL Node
Utilizes the latest Qwen2-VL series of models, which are state-of-the-art vision language models supporting various resolutions, ratios, and languages. The models excel at:
- Understanding images of various resolutions & ratios
- Complex visual reasoning and decision making
- Multilingual support (English, Chinese, European languages, Japanese, Korean, Arabic, Vietnamese, etc.)
Available models include 2B, 7B, and 72B parameter versions, with standard, AWQ, and GPTQ quantized variants. It will automatically download the necessary files into models/LLavacheckpoints/files_for_qwen2vl
.
Important Note: Larger models (7B, 72B) require significant VRAM. Choose quantized versions (AWQ, GPTQ) for reduced memory usage.