From 12f78cfc3d3a6d61cb2b33e05f4027fee15606f0 Mon Sep 17 00:00:00 2001 From: Boyuan Zheng <58822425+boyuanzheng010@users.noreply.github.com> Date: Thu, 21 Dec 2023 23:28:52 -0500 Subject: [PATCH] Add files via upload --- index.html | 344 +++++++++++++++++++++++++---------------------------- 1 file changed, 162 insertions(+), 182 deletions(-) diff --git a/index.html b/index.html index 7be8c47..47195bc 100644 --- a/index.html +++ b/index.html @@ -88,37 +88,36 @@

SeeAct: GPT-4V(ision) is a Generalist Web Agent, if Grounded

- Boyuan Zheng1, + Boyuan Zheng, - Boyu Gou2, + Boyu Gou, - Jihyung Kil2, + Jihyung Kil, - Huan Sun2, + Huan Sun, - Yu Su2, + Yu Su,
- 1University of Washington, - 2Google Research + The Ohio State University
-
@@ -166,22 +164,58 @@

SeeAct: GPT-4V(ision) is a Generalist W + +
+ + SEEACT is a generalist web agent based on GPT-4V. Specifically, given a web-based task (e.g., “Rent a truck with the lowest rate” in the car rental website), we examine two essential capabilities of GPT-4V as a generalist web agent: (i) Action Generation to produce an action description at each step (e.g., “Move the cursor over the ‘Find Your Truck’ button and perform a click”) towards completing the task, and (ii) Element Grounding to identify an HTML element (e.g., “[button] Find Your Truck”) at the current step on the webpage.
+
+

- Nerfies turns selfie videos from your phone into - free-viewpoint - portraits. + SeeAct Real-time Demo on Live Website

+
+
+ +
+
+

Abstract

+
+

+ The recent development on large multimodal models (LMMs), especially GPT-4V(ision) and Gemini, have been quickly ex- panding the capability boundaries of multimodal models beyond traditional tasks like image captioning and visual question answering. Websites are designed to be rendered visually for easy consumption by humans. In this work, we explore the potential of LMMs like GPT-4V as a generalist web agent that can follow natural language instructions to complete tasks on any given website. We propose SEEACT, a generalist web agent that harnesses the power of LMMs for integrated visual understanding and acting on the web. +

+

+ We evaluate on the recent MIND2WEB benchmark. In addition to offline evaluation on cached websites, we enable a new online evaluation setting by developing a tool that allows running web agents on live websites. We show that GPT-4V presents a great potential for web agents—it can successfully complete 50% of the tasks on live websites if we manually ground its textual plans into actions on the websites. This substantially outperforms text-only LLMs like GPT-4 or smaller models (FLAN-T5 and BLIP-2) specifically fine-tuned for web agents. However, grounding still remains a major challenge. Existing LMM grounding strategies like set-of-mark prompting turns out not effective for web agents, and the best grounding strategy leverages both the HTML text and visuals, yet there is still a substantial gap with oracle grounding. +

+
+
+
+ + + + + + + + + + + + + +
+
+
@@ -243,158 +277,104 @@

- -
-
-

Abstract

-
-

- We present the first method capable of photorealistically reconstructing a non-rigidly - deforming scene using photos/videos captured casually from mobile phones. -

-

- Our approach augments neural radiance fields - (NeRF) by optimizing an - additional continuous volumetric deformation field that warps each observed point into a - canonical 5D NeRF. - We observe that these NeRF-like deformation fields are prone to local minima, and - propose a coarse-to-fine optimization method for coordinate-based models that allows for - more robust optimization. - By adapting principles from geometry processing and physical simulation to NeRF-like - models, we propose an elastic regularization of the deformation field that further - improves robustness. -

-

- We show that Nerfies can turn casually captured selfie - photos/videos into deformable NeRF - models that allow for photorealistic renderings of the subject from arbitrary - viewpoints, which we dub "nerfies". We evaluate our method by collecting data - using a - rig with two mobile phones that take time-synchronized photos, yielding train/validation - images of the same pose at different viewpoints. We show that our method faithfully - reconstructs non-rigidly deforming scenes and reproduces unseen views with high - fidelity. -

-
-
-
- - - -
-
-

Video

-
- -
-
-
- -
-
- - -
-
- -
- - -
-
-

Visual Effects

-

- Using nerfies you can create fun visual effects. This Dolly zoom effect - would be impossible without nerfies since it would require going through a wall. -

- -
-
- - - -
-

Matting

-
-
-

- As a byproduct of our method, we can also solve the matting problem by ignoring - samples that fall outside of a bounding box during rendering. -

- -
-
-
-
- - - -
-
-

Animation

- - -

Interpolating states

-
-

- We can also animate the scene by interpolating the deformation latent codes of two input - frames. Use the slider here to linearly interpolate between the left frame and the right - frame. -

-
-
-
- Interpolate start reference image. -

Start Frame

-
-
-
- Loading... -
- -
-
- Interpolation end reference image. -

End Frame

-
-
-
- - - -

Re-rendering the input video

-
-

- Using Nerfies, you can re-render a video from a novel - viewpoint such as a stabilized camera by playing back the training deformations. -

-
-
- -
- + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +