Skip to content

PJ2 in Microsoft. Zero-shot Object detection with captioning in overlayed display

License

Notifications You must be signed in to change notification settings

MinwooKim1990/Scouter_PJ

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scouter Project v 1.0

AI-powered object detection system with real-time enhancement and accessibility features

Contributor:

Overview

An AI system inspired by Cyberpunk 2077 that combines several advanced features:

  • Zero-shot object detection with detailed captioning
  • Real-time object enhancement and upscaling
  • Speech-to-text (STT) functionality for accessibility
  • Overlay display system similar to Cyberpunk 2077 Scanner elements
  • Extra information with Bing Search API (Default: Deactivated)
  • STT to LLM Response on display

Demo Screenshots

Demo Screenshot 1 Demo Screenshot 2

GUI Demo Screenshots

Demo Screenshot 3 Demo Screenshot 4

STT to LLM Demo Screenshots

Demo Screenshot 5 Demo Screenshot 6

Installation and Setup

  1. Clone the repository
git clone https://github.com/MinwooKim1990/Scouter_PJ.git
cd Scouter_PJ
  1. Install dependencies
pip install -r requirements.txt
  1. Prepare your data
mkdir data

Place your video files in the data folder

  1. Run the application
python system.py --video path/to/your/video.mp4(integer 0 for webcam, Default: 0) --bing [YOUR-API-KEY] --llm-provider [google, openai, groq] --llm-api-key [YOUR-API-KEY] --llm-model [providing models]
  1. For checking arguments help
python system.py --help

Use GUI

  1. Clone the repository
git clone https://github.com/MinwooKim1990/Scouter_PJ.git
cd Scouter_PJ
  1. Install dependencies
pip install -r requirements.txt
  1. Prepare your data
mkdir data

Place your video files in the data folder

  1. Run the application
python tkinterapp.py

How to Use

Basic Controls

Object Detection & Tracking

  • Left Click:
    • Click anywhere on the video to activate zero-shot object detection
    • Click on an object to create a bounding box and start tracking
  • Right Click:
    • Releases the current bounding box and stops tracking

Image Enhancement

  • F Key:
    • Toggles real-time upscaling of the tracked object
    • Default: Disabled

Voice Subtitles

  • T Key:
    • First Press: Starts voice recording for subtitle generation
    • Second Press: Stops recording and processes the subtitle

Image Search

  • S Key:
    • Toggles Activate image search of detected object
    • Default: Disabled

STT to LLM

  • A Key:
    • First Press: Starts prompt recording for LLM
    • Second Press: Stops prompt recording and processes the LLM output
Provider Model
Google gemini-2.0-flash-exp
gemini-1.5-flash
gemini-1.5-flash-8b
gemini-1.5-pro
OpenAI gpt-4o-2024-08-06
gpt-4o-mini-2024-07-18
o1-2024-12-17
gpt-3.5-turbo-0125
GROQ llama-3.3-70b-versatile
llama-3.2-90b-text-preview
llama-3.2-11b-text-preview
gemma2-9b-it
mixtral-8x7b-32768
Anthropic claude-3-5-sonnet-20241022
claude-3-opus-20240229
claude-3-5-haiku-20241022

Stop Video

  • Space Key:
    • First Press: Stop video playing and can do object detection
    • Second Press: play video again

Stop Video

  • Tab Key:
    • First Press: Turn on full instructions (default: turn off)
    • Second Press: Turn off all instructions

Quick Reference

Action Key/Button Function
Select Object Left Click Activates detection & tracking
Release Tracking Right Click Stops tracking current object
Toggle Upscaling F Enable realtime upscaling
Voice Recording T Start/Stop subtitle recording
Image Search S Activate Bing image search
STT to LLM A Start/Stop prompt recording to LLM
Play/Stop Space Play/Stop video
Show Instructions Tab Showing/Not Showing instructions

Features

  • Zero-shot object detection
  • Real-time object tracking
  • AI-powered image upscaling
  • Speech-to-text subtitle generation
  • Speech-to-LLM output generation
  • Cyberpunk-style overlay display

Technical Specifications

Hardware Requirements

  • GPU: NVIDIA GPU with CUDA support
    • Minimum VRAM: 8GB (For realtime Upscaling: 24GB)
    • Tested on: NVIDIA GPU RTX 4090
  • Memory: 2GB
  • Storage: 2.5GB

Performance Notes

  • Tested 720P quality videos
  • Processing FPS Without Realtime Upscaling FPS: 30 ~ 40 FPS
  • Processing FPS during heaviest work load: 10 ~ 20 FPS
  • Current version is optimized for VRAM efficiency
  • Peak VRAM usage: ~7GB during operation (In Realtime Upscaling ~ 21GB during operation)
  • Further optimization may be available in future updates

Models & Attributions

This project utilizes several open-source models:

  • MobileSAM: Zero-shot object detection - Apache-2.0 License
  • Fast-SRGAN: Real-time image upscaling - MIT License
  • OpenAI Whisper: Speech-to-text processing - MIT License
  • Florence 2: Image captioning - MIT License
  • Bing Search API: Image Search
  • Google Gemini API: LLM Response
  • OpenAI API: LLM Response
  • GROQ API: LLM Response

License

This project is licensed under the MIT License since all major components use either MIT or Apache-2.0 licenses. The MIT License is compatible with both and maintains the open-source nature of the utilized models.

This project uses the following APIs. Please ensure compliance with their respective terms of use:

Note

To ensure the security of API keys, store them securely using environment variables or secret management solutions. Do not expose sensitive information in public repositories.

See the LICENSE file for details.

About

PJ2 in Microsoft. Zero-shot Object detection with captioning in overlayed display

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages