Video models (take 2) #890

dreadatour · 2025-02-03T18:36:22Z

Alternative approach to implement video models based on this comment. Looks much cleaner IMO.

New `VideoFile` model

class VideoFile(File):
    """`DataModel` for reading video files."""

    def get_info(self) -> "Video":
        """Returns video file information."""

    def get_frame(self, frame: int) -> "VideoFrame":
        """
        Returns VideoFrame model for a video frame.

        Args:
            frame (int): Frame number to read.

        Returns:
            VideoFrame: Video frame model.
        """

    def get_frames(
        self,
        start: int = 0,
        end: Optional[int] = None,
        step: int = 1,
    ) -> "Iterator[VideoFrame]":
        """
        Returns VideoFrame models for a video frame.

        Args:
            start (int): Frame number to start reading from (default: 0).
            end (Optional[int]): Frame number to stop reading at, non-inclusive
                                 (default: None, read until the end).
            step (int): Step size for reading frames (default: 1).

        Returns:
            Iterator[VideoFrame]: List of video frame models.

        Note:
            If end is not specified, number of frames will be taken from the video file.
        """

    def get_fragment(self, start: float, end: float) -> "VideoFragment":
        """
        Returns VideoFragment model for a video interval.

        Args:
            start (float): Start time in seconds.
            end (float): End time in seconds.

        Returns:
            VideoFragment: Video fragment model.
        """

    def get_fragments(
        self,
        intervals: list[tuple[float, float]],
    ) -> "Iterator[VideoFragment]":
        """
        Returns VideoFragment models for video intervals.

        Args:
            intervals (list[tuple[float, float]]): List of start and end times
                                                   in seconds.

        Returns:
            Iterator[VideoFragment]: List of video fragment models.
        """

New `VideoFrame` model

One can create VideoFrame without downloading video file, since it is "virtual" frame: original VideoFile + frame number.

If physical frame image is needed, call save method, which uploads frame image into storage and returns ImageFile new model.

API:

class VideoFrame(VideoFile):
    """`DataModel` for reading video frames."""

    frame: int = Field(default=-1)

    def get_np(self) -> "ndarray":
        """
        Reads video frame from a video file and returns as numpy array.

        Returns:
            ndarray: Video frame.
        """

    def read_bytes(self, format: str = "jpg") -> bytes:
        """
        Reads video frame from a video file and returns as image bytes.

        Args:
            format (str): Image format (default: 'jpg').

        Returns:
            bytes: Video frame image as bytes.
        """

    def save(self, output: str, format: str = "jpg") -> "ImageFile":
        """
        Saves video frame as a new image file. If output is a remote path,
        the image file will be uploaded to the remote storage.

        Args:
            output (str): Output path, can be a local path or a remote path.
            format (str): Image format (default: 'jpg').

        Returns:
            ImageFile: Image file model.
        """

New `VideoFragment` model

One can create VideoFragment without downloading video file, since it is "virtual" fragment: original video file + start/end timestamp.

If physical fragment video is needed, call save method, which uploads fragment video into storage and returns new VideoFile model.

API:

class VideoFragment(VideoFile):
    """`DataModel` for reading video fragments."""

    start: float = Field(default=-1.0)
    end: float = Field(default=-1.0)

    def save(self, output: str, format: Optional[str] = None) -> "VideoFile":
        """
        Saves video interval as a new video file. If output is a remote path,
        the video file will be uploaded to the remote storage.

        Args:
            output (str): Output path, can be a local path or a remote path.
            format (Optional[str]): Output format (default: None). If not provided,
                                    the format will be inferred from the video fragment
                                    file extension.

        Returns:
            VideoFile: Video fragment model.
        """

New `Video` model

Video file meta information.

class Video(DataModel):
    """`DataModel` for video file meta information."""

    width: int = Field(default=-1)
    height: int = Field(default=-1)
    fps: float = Field(default=-1.0)
    duration: float = Field(default=-1.0)
    frames: int = Field(default=-1)
    format: str = Field(default="")
    codec: str = Field(default="")

for more information, see https://pre-commit.ci

* [pre-commit.ci] pre-commit autoupdate updates: - [github.com/astral-sh/ruff-pre-commit: v0.8.6 → v0.9.1](astral-sh/ruff-pre-commit@v0.8.6...v0.9.1) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Bumps [ultralytics](https://github.com/ultralytics/ultralytics) from 8.3.58 to 8.3.61. - [Release notes](https://github.com/ultralytics/ultralytics/releases) - [Commits](ultralytics/ultralytics@v8.3.58...v8.3.61) --- updated-dependencies: - dependency-name: ultralytics dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Review help/usage for cli commands The pattern followed is: - Descriptions: Complete sentences with periods - Help messages: Concise phrases without periods - Consistent terminology ("Iterative Studio") - Clear, standardized format for similar arguments * Bring uniformity for Studio mention * Override default command failure * Remove datasets from studio * Fix anon message and remove edatachain message * dirs to directories * Remove studio dataset test

* prefetching: remove prefetched item after use in udf This PR removes the prefetched item after use in the UDF. This is enabled by default on `prefetch>0`, unless `cache=True` is set in the UDF, in which case the prefetched item is not removed. For pytorch dataloader, this is not enabled by default, but can be enabled by setting `remove_prefetched=True` in the `PytorchDataset` class. This is done so because the dataset can be used in multiple epochs, and removing the prefetched item after use can cause it to redownload again in the next epoch. The exposed `remove_prefetched=True|False` setting could be renamed to some better option. Feedbacks are welcome. * close iterable properly

* added main logic for outer join * fixing filters * removign datasetquery tests and added more datachain unit tests

If usearch fails to download the extension, it will keep retrying in the future. This adds significant cost - for example, in `tests/func/test_pytorch.py` run, it was invoked 111 times, taking ~30 seconds in total. Now, we cache the return value for the whole session.

Added `isnone()` function

* move tests using cloud_test_catalog into func directory * move tests using tmpfile catalog * move long running tests that read/write from disk

…837)

updates: - [github.com/astral-sh/ruff-pre-commit: v0.9.1 → v0.9.2](astral-sh/ruff-pre-commit@v0.9.1...v0.9.2) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Bumps [ultralytics](https://github.com/ultralytics/ultralytics) from 8.3.61 to 8.3.64. - [Release notes](https://github.com/ultralytics/ultralytics/releases) - [Commits](ultralytics/ultralytics@v8.3.61...v8.3.64) --- updated-dependencies: - dependency-name: ultralytics dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.5.22 to 9.5.50. - [Release notes](https://github.com/squidfunk/mkdocs-material/releases) - [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG) - [Commits](squidfunk/mkdocs-material@9.5.22...9.5.50) --- updated-dependencies: - dependency-name: mkdocs-material dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Handle permission error properly when checking for file Currently, we had blanket catch for exception when trying to check the file using _isfile. As a result, the exception stacktrace was repeated and catching the exception in script was difficult as we had to capture different exception. This convert the error to datachain native error that can be captured safely and proceed accordingly. This is first step toward handling #600 * Convert scheme to lower * Handle case for glob in windows

for more information, see https://pre-commit.ci

codecov · 2025-02-03T18:43:43Z

Codecov Report

Attention: Patch coverage is 86.02941% with 19 lines in your changes missing coverage. Please review.

Project coverage is 87.75%. Comparing base (7f757b3) to head (0f2e12c).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/datachain/lib/video.py	75.00%	10 Missing and 7 partials ⚠️
src/datachain/lib/file.py	97.05%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #890      +/-   ##
==========================================
+ Coverage   87.74%   87.75%   +0.01%     
==========================================
  Files         129      130       +1     
  Lines       11462    11595     +133     
  Branches     1545     1563      +18     
==========================================
+ Hits        10057    10175     +118     
- Misses       1017     1025       +8     
- Partials      388      395       +7

Flag	Coverage Δ
datachain	`87.67% <86.02%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mattseddon · 2025-02-04T05:17:24Z

src/datachain/lib/file.py

+
+        return video_info(self)
+
+    def get_frame(self, frame: int) -> "VideoFrame":


Minor, but should these be to_ methods to match the DataChain class?

Minor, but should these be to_ methods to match the DataChain class?

Looks reasonable 🤔 Although it is not a direct conversion ("to"), but rather getting a part of the file into another file, like "get frame from video" looks good to me, but "video to frame" looks odd. What do you think? I don't have strict opinion on this 🤔

dreadatour and others added 30 commits January 13, 2025 23:48

Add video models + functions

75877d1

Code review update

031b9df

[pre-commit.ci] auto fixes from pre-commit.com hooks

548bbd5

for more information, see https://pre-commit.ci

Code review update

b55149a

Code review update

2cd6d62

Small fixes due to work on usage examples

5892ab9

Examples fixes

f3dc66a

docs(merge): add examples with Func object (#811)

65529f3

fix(tqdm): import tqdm to support jupyter (#812)

b044082

progress: remove unused logging/tqdm lock (#817)

89ee2f0

file: raise error (#820)

67beb9f

README - mistral fix (#821)

60c5848

file: support exporting files as a symlink (#819)

d3b1619

ReferenceFileSystem: use fs.open instead of fs._open (#823)

bcd95b1

Fix list of tuples. Closes #827 (#828)

dbefa5f

Added full outer join (#822)

258454e

* added main logic for outer join * fixing filters * removign datasetquery tests and added more datachain unit tests

Added isnone() function (#801)

a1a47b2

Added `isnone()` function

tests: reduce pytorch functional tests' runtime (#834)

5b2f45b

improve runtime of diff unit tests (#831)

14caa08

move functional tests out of unit test suite (#832)

746fd73

* move tests using cloud_test_catalog into func directory * move tests using tmpfile catalog * move long running tests that read/write from disk

import Int into test_datachain_merge (fix tests broken on bad merge) (#…

0fe47dd

…837)

[pre-commit.ci] pre-commit autoupdate (#836)

1598c4c

updates: - [github.com/astral-sh/ruff-pre-commit: v0.9.1 → v0.9.2](astral-sh/ruff-pre-commit@v0.9.1...v0.9.2) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

skshetry and others added 22 commits January 28, 2025 01:02

pull _is_file checks to get_listing (#846)

69a4385

use posixpath in upload methods (#855)

7859e16

catch (HfHub)HTTPError in hf-dataset-llm-eval example (#848)

17118d1

Code review updates

cc05da9

Merge branch 'main' into video-models

8d9f6c2

Update video requirements

23514f7

Code review updates

8a8dd64

Merge branch 'main' into video-models

1a04dd0

Merge branch 'main' into video-models

0c95c3d

Code review updates + tests

e55405d

Set up ffmpeg in tests

8e2a673

Set up ffmpeg in tests

9c910ec

Set up ffmpeg in tests

a2b8c9a

Update 'ensure_cached' test

63448d9

Revert 'ensure_cached' test

abe39f5

[pre-commit.ci] auto fixes from pre-commit.com hooks

3b7b829

for more information, see https://pre-commit.ci

Fix tests

55f0478

Fix tests

99b9490

Update video models

c28cd66

Merge branch 'main' into video-models

4098e8b

Update video models

0f2e12c

dreadatour requested review from shcheklein, dmpetrov, mattseddon and a team February 3, 2025 18:36

dreadatour self-assigned this Feb 3, 2025

mattseddon reviewed Feb 4, 2025

View reviewed changes

dreadatour marked this pull request as draft February 4, 2025 05:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Video models (take 2) #890

Video models (take 2) #890

dreadatour commented Feb 3, 2025

codecov bot commented Feb 3, 2025 •

edited

Loading

mattseddon Feb 4, 2025

dreadatour Feb 4, 2025


		return video_info(self)

		def get_frame(self, frame: int) -> "VideoFrame":

Video models (take 2) #890

Are you sure you want to change the base?

Video models (take 2) #890

Conversation

dreadatour commented Feb 3, 2025

New VideoFile model

New VideoFrame model

New VideoFragment model

New Video model

codecov bot commented Feb 3, 2025 • edited Loading

Codecov Report

mattseddon Feb 4, 2025

Choose a reason for hiding this comment

dreadatour Feb 4, 2025

Choose a reason for hiding this comment

New `VideoFile` model

New `VideoFrame` model

New `VideoFragment` model

New `Video` model

codecov bot commented Feb 3, 2025 •

edited

Loading