Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Video models (take 2) #890

Draft
wants to merge 62 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
75877d1
Add video models + functions
dreadatour Jan 13, 2025
031b9df
Code review update
dreadatour Jan 14, 2025
548bbd5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 14, 2025
b55149a
Code review update
dreadatour Jan 14, 2025
2cd6d62
Code review update
dreadatour Jan 15, 2025
5892ab9
Small fixes due to work on usage examples
dreadatour Jan 15, 2025
f3dc66a
Examples fixes
dreadatour Jan 20, 2025
65529f3
docs(merge): add examples with Func object (#811)
shcheklein Jan 13, 2025
b044082
fix(tqdm): import tqdm to support jupyter (#812)
shcheklein Jan 13, 2025
2a77047
[pre-commit.ci] pre-commit autoupdate (#815)
pre-commit-ci[bot] Jan 13, 2025
89ee2f0
progress: remove unused logging/tqdm lock (#817)
skshetry Jan 14, 2025
5f522ad
build(deps): bump ultralytics from 8.3.58 to 8.3.61 (#816)
dependabot[bot] Jan 14, 2025
e2f5a3a
Review help/usage for cli commands (#802)
amritghimire Jan 15, 2025
67beb9f
file: raise error (#820)
skshetry Jan 15, 2025
60c5848
README - mistral fix (#821)
dmpetrov Jan 16, 2025
d3b1619
file: support exporting files as a symlink (#819)
skshetry Jan 16, 2025
e31210c
prefetching: remove prefetched item after use in udf (#818)
skshetry Jan 16, 2025
bcd95b1
ReferenceFileSystem: use fs.open instead of fs._open (#823)
skshetry Jan 16, 2025
08edd27
Second iteration of cli command help (#826)
amritghimire Jan 18, 2025
dbefa5f
Fix list of tuples. Closes #827 (#828)
dmpetrov Jan 19, 2025
258454e
Added full outer join (#822)
ilongin Jan 20, 2025
328c1a7
memoize usearch.sqlite_path() (#833)
skshetry Jan 20, 2025
a1a47b2
Added `isnone()` function (#801)
ilongin Jan 20, 2025
5b2f45b
tests: reduce pytorch functional tests' runtime (#834)
skshetry Jan 20, 2025
14caa08
improve runtime of diff unit tests (#831)
mattseddon Jan 20, 2025
746fd73
move functional tests out of unit test suite (#832)
mattseddon Jan 20, 2025
0fe47dd
import Int into test_datachain_merge (fix tests broken on bad merge) …
mattseddon Jan 20, 2025
1598c4c
[pre-commit.ci] pre-commit autoupdate (#836)
pre-commit-ci[bot] Jan 20, 2025
0c3f3b4
build(deps): bump ultralytics from 8.3.61 to 8.3.64 (#839)
dependabot[bot] Jan 21, 2025
bf824af
build(deps): bump mkdocs-material from 9.5.22 to 9.5.50 (#838)
dependabot[bot] Jan 21, 2025
428d865
Revert "build(deps): bump mkdocs-material from 9.5.22 to 9.5.50 (#838…
yathomasi Jan 21, 2025
b7549b1
Add CSV parsing options (#813)
skirdey Jan 21, 2025
8639246
e2e tests: limit name_len_slow to 3, split e2e tests from other tests…
skshetry Jan 21, 2025
3376449
ci: switch trigger from `pull_request_target` to `pull_request` (#843)
skshetry Jan 21, 2025
5b2e437
rename DataChainCache to Cache (#847)
skshetry Jan 21, 2025
213b1d8
feat: add apollo integration, drop reo.dev (#835)
yathomasi Jan 22, 2025
43389f7
append e2e tests coverage instead of overwriting (#851)
mattseddon Jan 22, 2025
5a20c4e
drop unstructured examples (#854)
mattseddon Jan 24, 2025
b72c440
add upload classmethod to File (#850)
mattseddon Jan 24, 2025
55cd044
drop .edatachain support (#853)
skshetry Jan 24, 2025
69a4385
pull _is_file checks to get_listing (#846)
skshetry Jan 24, 2025
7859e16
use posixpath in upload methods (#855)
mattseddon Jan 24, 2025
3f47d12
Handle permission error properly when checking for file (#856)
amritghimire Jan 27, 2025
17118d1
catch (HfHub)HTTPError in hf-dataset-llm-eval example (#848)
mattseddon Jan 27, 2025
cc05da9
Code review updates
dreadatour Jan 27, 2025
8d9f6c2
Merge branch 'main' into video-models
dreadatour Jan 27, 2025
23514f7
Update video requirements
dreadatour Jan 28, 2025
8a8dd64
Code review updates
dreadatour Jan 28, 2025
1a04dd0
Merge branch 'main' into video-models
dreadatour Jan 28, 2025
0c95c3d
Merge branch 'main' into video-models
dreadatour Jan 29, 2025
e55405d
Code review updates + tests
dreadatour Jan 29, 2025
8e2a673
Set up ffmpeg in tests
dreadatour Jan 29, 2025
9c910ec
Set up ffmpeg in tests
dreadatour Jan 29, 2025
a2b8c9a
Set up ffmpeg in tests
dreadatour Jan 29, 2025
63448d9
Update 'ensure_cached' test
dreadatour Jan 29, 2025
abe39f5
Revert 'ensure_cached' test
dreadatour Jan 29, 2025
3b7b829
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 29, 2025
55f0478
Fix tests
dreadatour Jan 30, 2025
99b9490
Fix tests
dreadatour Jan 30, 2025
c28cd66
Update video models
dreadatour Jan 30, 2025
4098e8b
Merge branch 'main' into video-models
dreadatour Feb 3, 2025
0f2e12c
Update video models
dreadatour Feb 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/workflows/tests-studio.yml
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,9 @@ jobs:
path: './backend/datachain'
fetch-depth: 0

- name: Set up FFmpeg
uses: AnimMouse/setup-ffmpeg@v1

- name: Set up Python ${{ matrix.pyv }}
uses: actions/setup-python@v5
with:
Expand Down
3 changes: 3 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,9 @@ jobs:
fetch-depth: 0
ref: ${{ github.event.pull_request.head.sha || github.ref }}

- name: Set up FFmpeg
uses: AnimMouse/setup-ffmpeg@v1

- name: Set up Python ${{ matrix.pyv }}
uses: actions/setup-python@v5
with:
Expand Down
10 changes: 9 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -79,8 +79,16 @@ hf = [
"numba>=0.60.0",
"datasets[audio,vision]>=2.21.0"
]
video = [
# Use 'av<14' because of incompatibility with imageio
# See https://github.com/PyAV-Org/PyAV/discussions/1700
"av<14",
"ffmpeg-python",
"imageio[ffmpeg]",
"opencv-python"
]
tests = [
"datachain[torch,remote,vector,hf]",
"datachain[torch,remote,vector,hf,video]",
"pytest>=8,<9",
"pytest-sugar>=0.9.6",
"pytest-cov>=4.1.0",
Expand Down
10 changes: 10 additions & 0 deletions src/datachain/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,14 @@
ArrowRow,
File,
FileError,
Image,
ImageFile,
TarVFile,
TextFile,
Video,
VideoFile,
VideoFragment,
VideoFrame,
)
from datachain.lib.model_store import ModelStore
from datachain.lib.udf import Aggregator, Generator, Mapper
Expand All @@ -27,13 +32,18 @@
"File",
"FileError",
"Generator",
"Image",
"ImageFile",
"Mapper",
"ModelStore",
"Session",
"Sys",
"TarVFile",
"TextFile",
"Video",
"VideoFile",
"VideoFragment",
"VideoFrame",
"is_chain_type",
"metrics",
"param",
Expand Down
205 changes: 201 additions & 4 deletions src/datachain/lib/file.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
from urllib.request import url2pathname

from fsspec.callbacks import DEFAULT_CALLBACK, Callback
from PIL import Image
from PIL import Image as PilImage
from pydantic import Field, field_validator

from datachain.client.fileslice import FileSlice
Expand All @@ -27,6 +27,7 @@
from datachain.utils import TIME_ZERO

if TYPE_CHECKING:
from numpy import ndarray
from typing_extensions import Self

from datachain.catalog import Catalog
Expand All @@ -40,7 +41,7 @@
# how to create file path when exporting
ExportPlacement = Literal["filename", "etag", "fullpath", "checksum"]

FileType = Literal["binary", "text", "image"]
FileType = Literal["binary", "text", "image", "video"]


class VFileError(DataChainError):
Expand Down Expand Up @@ -193,7 +194,7 @@
@classmethod
def upload(
cls, data: bytes, path: str, catalog: Optional["Catalog"] = None
) -> "File":
) -> "Self":
if catalog is None:
from datachain.catalog.loader import get_catalog

Expand All @@ -203,6 +204,8 @@

client = catalog.get_client(parent)
file = client.upload(data, name)
if not isinstance(file, cls):
file = cls(**file.model_dump())
file._set_stream(catalog)
return file

Expand Down Expand Up @@ -486,13 +489,205 @@
def read(self):
"""Returns `PIL.Image.Image` object."""
fobj = super().read()
return Image.open(BytesIO(fobj))
return PilImage.open(BytesIO(fobj))

def save(self, destination: str):
"""Writes it's content to destination"""
self.read().save(destination)


class Image(DataModel):
"""`DataModel` for image file meta information."""

width: int = Field(default=-1)
height: int = Field(default=-1)
format: str = Field(default="")


class VideoFile(File):
"""`DataModel` for reading video files."""

def get_info(self) -> "Video":
"""Returns video file information."""
from .video import video_info

return video_info(self)

def get_frame(self, frame: int) -> "VideoFrame":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but should these be to_ methods to match the DataChain class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but should these be to_ methods to match the DataChain class?

Looks reasonable 🤔 Although it is not a direct conversion ("to"), but rather getting a part of the file into another file, like "get frame from video" looks good to me, but "video to frame" looks odd. What do you think? I don't have strict opinion on this 🤔

"""
Returns VideoFrame model for a video frame.

Args:
frame (int): Frame number to read.

Returns:
VideoFrame: Video frame model.
"""
if frame < 0:
raise ValueError("frame must be a non-negative integer")

frame_file = VideoFrame(**self.model_dump(), frame=frame)
frame_file._set_stream(self._catalog)
return frame_file

def get_frames(
self,
start: int = 0,
end: Optional[int] = None,
step: int = 1,
) -> "Iterator[VideoFrame]":
"""
Returns VideoFrame models for a video frame.

Args:
start (int): Frame number to start reading from (default: 0).
end (Optional[int]): Frame number to stop reading at, non-inclusive
(default: None, read until the end).
step (int): Step size for reading frames (default: 1).

Returns:
Iterator[VideoFrame]: List of video frame models.

Note:
If end is not specified, number of frames will be taken from the video file.
"""
if start < 0:
raise ValueError("start_frame must be a non-negative integer.")

if end is None:
end = self.get_info().frames

if end < 0:
raise ValueError("end_frame must be a non-negative integer.")
if start > end:
raise ValueError("start_frame must be less than or equal to end_frame.")

if step < 1:
raise ValueError("step must be a positive integer.")

for frame in range(start, end, step):
yield self.get_frame(frame)

def get_fragment(self, start: float, end: float) -> "VideoFragment":
"""
Returns VideoFragment model for a video interval.

Args:
start (float): Start time in seconds.
end (float): End time in seconds.

Returns:
VideoFragment: Video fragment model.
"""
if start < 0 or end < 0 or start >= end:
raise ValueError(f"Invalid time range: ({start:.3f}, {end:.3f})")

fragment_file = VideoFragment(**self.model_dump(), start=start, end=end)
fragment_file._set_stream(self._catalog)
return fragment_file

def get_fragments(
self,
intervals: list[tuple[float, float]],
) -> "Iterator[VideoFragment]":
"""
Returns VideoFragment models for video intervals.

Args:
intervals (list[tuple[float, float]]): List of start and end times
in seconds.

Returns:
Iterator[VideoFragment]: List of video fragment models.
"""
for start, end in intervals:
yield self.get_fragment(start, end)


class VideoFrame(VideoFile):
"""`DataModel` for reading video frames."""

frame: int = Field(default=-1)

def get_np(self) -> "ndarray":
"""
Reads video frame from a video file and returns as numpy array.

Returns:
ndarray: Video frame.
"""
from .video import video_frame_np

return video_frame_np(self)

def read_bytes(self, format: str = "jpg") -> bytes:
"""
Reads video frame from a video file and returns as image bytes.

Args:
format (str): Image format (default: 'jpg').

Returns:
bytes: Video frame image as bytes.
"""
from .video import video_frame_bytes

return video_frame_bytes(self, format)

def save(self, output: str, format: str = "jpg") -> "ImageFile":
"""
Saves video frame as a new image file. If output is a remote path,
the image file will be uploaded to the remote storage.

Args:
output (str): Output path, can be a local path or a remote path.
format (str): Image format (default: 'jpg').

Returns:
ImageFile: Image file model.
"""
from .video import save_video_frame

return save_video_frame(self, output, format)


class VideoFragment(VideoFile):
"""`DataModel` for reading video fragments."""

start: float = Field(default=-1.0)
end: float = Field(default=-1.0)

def save(self, output: str, format: Optional[str] = None) -> "VideoFile":
"""
Saves video interval as a new video file. If output is a remote path,
the video file will be uploaded to the remote storage.

Args:
output (str): Output path, can be a local path or a remote path.
format (Optional[str]): Output format (default: None). If not provided,
the format will be inferred from the video fragment
file extension.

Returns:
VideoFile: Video fragment model.
"""
from .video import save_video_fragment

return save_video_fragment(self, output, format)


class Video(DataModel):
"""`DataModel` for video file meta information."""

width: int = Field(default=-1)
height: int = Field(default=-1)
fps: float = Field(default=-1.0)
duration: float = Field(default=-1.0)
frames: int = Field(default=-1)
format: str = Field(default="")
codec: str = Field(default="")


class ArrowRow(DataModel):
"""`DataModel` for reading row from Arrow-supported file."""

Expand Down Expand Up @@ -528,5 +723,7 @@
file = TextFile
elif type_ == "image":
file = ImageFile # type: ignore[assignment]
elif type_ == "video":
file = VideoFile

Check warning on line 727 in src/datachain/lib/file.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/file.py#L727

Added line #L727 was not covered by tests

return file
2 changes: 1 addition & 1 deletion src/datachain/lib/hf.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@

except ImportError as exc:
raise ImportError(
"Missing dependencies for huggingface datasets:\n"
"Missing dependencies for huggingface datasets.\n"
"To install run:\n\n"
" pip install 'datachain[hf]'\n"
) from exc
Expand Down
Empty file removed src/datachain/lib/vfile.py
Empty file.
Loading
Loading