Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: countgd sam2 video support #318

Merged
merged 15 commits into from
Dec 13, 2024
Merged

Conversation

hrnn
Copy link
Member

@hrnn hrnn commented Dec 5, 2024

Added countgd_sam2_video_tracking tool, to use countgd for object detection and pass the results to sam2 to track the objects in the entire video.

Screenshot 2024-12-05 161441
Link to Colab

Note to the reviewer:

@hrnn hrnn self-assigned this Dec 5, 2024
tests/integ/test_tools.py Outdated Show resolved Hide resolved
@hrnn hrnn marked this pull request as ready for review December 6, 2024 16:14
Copy link
Member

@dillonalaird dillonalaird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! just some minor comments

OWLV2 = "owlv2"


def od_sam2_video_tracking(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

vision_agent/tools/tools.py Show resolved Hide resolved
"""'countgd_sam2_video_tracking' is a tool that can segment multiple objects given a text
prompt such as category names or referring expressions. The categories in the text
prompt are separated by commas. It returns a list of bounding boxes, label names,
mask file names and associated probability scores of 1.0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minro comment, only florence2 returns probability scores of 1.0, countgd and owlv2 will can return regular probability scores. So you can just say "and associated probability scores."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

"""'owlv2_sam2_video_tracking' is a tool that can segment multiple objects given a text
prompt such as category names or referring expressions. The categories in the text
prompt are separated by commas. It returns a list of bounding boxes, label names,
mask file names and associated probability scores of 1.0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above on prob scores

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 2732 to 2753
List[Dict[str, Any]]: A list of dictionaries containing the score, label,
bounding box, and mask of the detected objects with normalized coordinates
(xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the top-left
and xmax and ymax are the coordinates of the bottom-right of the bounding box.
The mask is binary 2D numpy array where 1 indicates the object and 0 indicates
the background.

Example
-------
>>> countgd_sam2_video_tracking("car, dinosaur", image)
[
{
'score': 1.0,
'label': 'dinosaur',
'bbox': [0.1, 0.11, 0.35, 0.4],
'mask': array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
},
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the return values and examples from florence2_sam2_video_tracking. It's actually a list of list of dictionaries where the inner list is a frame

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 2776 to 2798
Returns:
List[Dict[str, Any]]: A list of dictionaries containing the score, label,
bounding box, and mask of the detected objects with normalized coordinates
(xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the top-left
and xmax and ymax are the coordinates of the bottom-right of the bounding box.
The mask is binary 2D numpy array where 1 indicates the object and 0 indicates
the background.

Example
-------
>>> countgd_sam2_video_tracking("car, dinosaur", image)
[
{
'score': 1.0,
'label': 'dinosaur',
'bbox': [0.1, 0.11, 0.35, 0.4],
'mask': array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
},
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above on return comments

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Member

@dillonalaird dillonalaird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hrnn hrnn merged commit 421dd1e into landing-ai:main Dec 13, 2024
8 checks passed
@hrnn hrnn deleted the feat/countgd_sam2_video branch December 13, 2024 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants