ComfyUI_Gemini_Flash 002

ComfyUI_Gemini_Flash is a updated version of the custom node for ComfyUI that integrates the powerful Gemini 1.5 Flash 002 model from Google. This node allows users to leverage Gemini's capabilities for various AI tasks, including text generation, image analysis, video processing, and audio transcription.

https://github.com/user-attachments/assets/32ba1803-9c19-4c79-bdfc-15c61d05bf9f

Features

Multimodal Input Support: Process text, images, videos, and audio using the Gemini 1.5 Flash model.
Long Context Window: Utilize Gemini 1.5 Flash's 1-million-token context window for processing large inputs.
API Key Management: Securely save and manage your Gemini API key.
Proxy Support: Configure proxy settings for API requests.
Output Control: Adjust max output tokens and temperature for fine-tuned responses.

Installation

Clone this repository to your ComfyUI custom nodes directory:

git clone https://github.com/YourUsername/ComfyUI_Gemini_Flash.git

Install the required dependencies:

pip install google-generativeai pillow torchaudio

Ensure you have a valid Gemini API key from Google AI Studio.

Configuration

Open the config.json file in the ComfyUI_Gemini_Flash directory.
Replace "your key" with your actual Gemini API key.
Optionally, set a proxy if required.

Proxy Configuration

The Gemini Flash 002 node supports the use of a proxy for API requests. This can be useful if you're behind a corporate firewall or need to route your requests through a specific server. To use a proxy:

In the ComfyUI interface, locate the "Gemini Flash 002" node.
Find the "proxy" input field.
Enter your proxy URL in the following format:
```
http://username:password@proxy_host:proxy_port
```
or if no authentication is required:
```
http://proxy_host:proxy_port
```

For example:

With authentication: http://john:pass123@proxy.example.com:8080
Without authentication: http://proxy.example.com:8080

You can also set a default proxy in the config.json file:

{
  "GEMINI_API_KEY": "your_api_key_here",
  "PROXY": "http://proxy.example.com:8080"
}

The proxy set in the node input will override the one in the config file.

Note: Make sure your proxy is compatible with HTTPS traffic, as the Gemini API uses secure connections.

Usage

In ComfyUI, locate the "Gemini Flash 002" node in the "Gemini Flash 002" category.
Connect the appropriate inputs based on your use case (text, image, video, or audio).
Set the prompt and any additional parameters (max output tokens, temperature).
Run the workflow to generate content using Gemini 1.5 Flash.

Input Types and Parameters

Required Inputs:

prompt: (STRING) The main instruction or question for the Gemini model.
input_type: (["text", "image", "video", "audio"]) Specifies the type of input being processed.
api_key: (STRING) Your Gemini API key.
proxy: (STRING) Proxy settings (if needed).

Optional Inputs:

text_input: (STRING) Additional text input for text-based tasks.
image: (IMAGE) Image input for image analysis tasks.
video: (IMAGE) Video input (processed as a sequence of frames).
audio: (AUDIO) Audio input for transcription or analysis.
max_output_tokens: (INT, default: 1000) Controls the length of the generated output.
temperature: (FLOAT, default: 0.4, range: 0.0 to 1.0) Adjusts the randomness of the output.

Versions and Capabilities

New Version (Gemini Flash 002)

Supports text, image, video, and audio inputs.
Improved video handling with frame sampling and resizing to manage payload size.
Better prompts for video analysis, considering movement and changes across frames.
Robust error handling and payload size management.

Old Version (Gemini Flash)

Supported text and image inputs.
Basic video support (treated similarly to images).
Limited audio support.

Input Processing Details

Text

Directly processed by the Gemini model.

Image

Resized to a maximum of 1024x1024 pixels to manage payload size.
Analyzed for content, objects, and visual elements.

Video

Processed as a sequence of frames.
Samples up to 10 frames evenly distributed throughout the video.
Each frame is resized to 512x512 pixels.
The model analyzes changes and movements across frames.

Audio

Converted to mono and resampled to 16kHz if necessary.
Sent to the Gemini model for transcription or analysis.

Output

The node returns a STRING containing the generated content or analysis from the Gemini model.

About Gemini 1.5 Flash

Gemini 1.5 Flash is designed to be the fastest and most cost-efficient model for high-volume tasks. It features a 1-million-token context window, enabling processing of large amounts of text, long videos, or extended audio clips in a single request.

Troubleshooting

If you encounter any issues:

Ensure your API key is correctly set in the config.json file.
Check that your input size is not exceeding the model's limits (especially for video and audio).
Verify that all required dependencies are installed.

Contributing

Contributions to improve ComfyUI_Gemini_Flash are welcome! Please feel free to submit issues or pull requests.

Acknowledgements

This project was made possible thanks to the contributions of the ComfyUI community and the developers of the Gemini model at Google. Special thanks to all the contributors and users for their continuous support and feedback.

License

[Specify your license here]

Nodes Browser

ComfyDeploy: How ComfyUI_Gemini_Flash works in ComfyUI?

What is ComfyUI_Gemini_Flash?

How to install it in ComfyDeploy?