Skip to content

Commit

Permalink
medium blog migrationšŸ”„
Browse files Browse the repository at this point in the history
  • Loading branch information
vrtnis committed Jan 31, 2025
1 parent 26079f3 commit 648daa9
Show file tree
Hide file tree
Showing 16 changed files with 223 additions and 9 deletions.
Binary file added public/images/research/css-pattern2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added public/images/research/css-pattern3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added public/images/research/css-pattern4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file added public/images/research/kscale_projections.webp
Binary file not shown.
Binary file added public/images/research/qwen-llama.webp
Binary file not shown.
Binary file added public/images/research/rj-score.webp
Binary file not shown.
Binary file added public/images/research/time-to-t1.webp
Binary file not shown.
Binary file added public/images/research/why-rj.webp
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: "Efficient Vision-Language-Action Models"
description: "Improving inference speed of vision-language-action models for edge devices while preserving encoding power."
date: "29 January 2025"
date: "September 9, 2024"
image: "/images/research/css-pattern1.png"
---

Expand Down
37 changes: 37 additions & 0 deletions src/content/research/introducing-k-scale-labs.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
title: "Introducing K-Scale Labs"
description: "Our mission at K-Scale Labs is to move humanity to a Type 1 Kardashev civilization."
date: "August 7, 2024"
image: "/images/research/css-pattern2.png"
---



# Introducing K-Scale Labs

In 1964, Soviet astronomer Nikolai Kardashev proposed a scale for measuring a civilizationā€™s level of technological advancement based on energy consumption. A Type 1 civilization on this scale is one which can harness all the energy available on its planet. A Type 2 civilization is one which can harness all of the energy from a star. A Type 3 civilization is one which can harness all the energy from a galaxy.

Our mission at K-Scale Labs is to move humanity to a Type 1 Kardashev civilization within my lifetime. Why is this a good idea? Barring more abstract philosophical conceptualizations of ā€œthe good lifeā€, harnessing more energy is what makes peoplesā€™ lives better. Famine, poverty, natural disasters, cost of living ā€” most of the problems that people care about, at their core, stem from a collective inability to harness energy for useful outcomes in one way or another. A world where it is practically free to have something done ā€” in other words, a world with less scarcity and more abundance ā€” is one in which most of humanityā€™s problems become political rather than technical.

<Image
src="/images/research/kscale_projections.webp"
alt="Kardashev Scale Projection"
width={600}
height={300}
/>

### Kardashev scale projections for Earth


<Image
src="/images/research/time-to-t1.webp"
alt="Time-to-T1 for different growth rates in energy consumption"
width={600}
height={300}
/>

### Time-to-T1 for different growth rates in energy consumption.

As an engineering problem, reaching a Type 1 civilization in my lifetime until very recently felt like an impossible task. However, in the context of general-purpose intelligence, it seems much more tractable. In a world with a general-purpose agent for every human, increasing humanityā€™s energy consumption by 15% annually simply means something like 30% of those agents copying themselves once per year.

Our goal at K-Scale Labs is to make this world possible, by designing a platform for general-purpose embodied intelligence and making it freely available for anyone to build. We believe that working towards this future should be the collective project of all of humanity, rather than the work of a few companies in Silicon Valley ā€” in the same way that beavers build dams and birds build nests, humans build abundance. If you would like to build this future with us, we would be happy to have you.
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
---
title: "Wait, Thatā€™s Not an Option: LLMs Robustness with In-correct Multiple-Choice Options"
description: "Exploring Reflective Judgment in language models and their ability to critically evaluate input even in flawed multiple-choice scenarios."
date: "October 14, 2024"
image: "/images/research/css-pattern4.png"
---

# Wait, Thatā€™s Not an Option


Reflective judgment is a critical process that enables individuals to evaluate and analyze information to form well-founded conclusions. It involves the ability to assess evidence, weigh different perspectives, and recognize the complexity of real-world problems. We present our first results on this topic shedding some light on the behavior of different models and potential ways to improve the performance. You can also see our project website and the Github code.

## What do we measure?

We investigate Reflective Judgment (RJ), a modelā€™s ability to override its tendency to follow flawed instructions and critically evaluate input, even if it means not providing an answer.

## Why RJ?

Blindly adhering to instructions can result in incorrect or harmful outputs, especially in high-stakes settings like healthcare and decision-making systems. Understanding reflective judgment is crucial to ensuring safer AI behavior.

<Image
src="/images/research/why-rj.webp"
alt="Why RJ?"
width={600}
height={300}
/>


## How do we measure RJ?

To measure reflective judgment, we create two datasets: the Basic Arithmetic Dataset (BAD), which consists of 3 levels ā€” easy, medium, and hard. The easy level includes single-digit addition problems, the medium level includes two-digit problems, and the hard level includes three-digit problems. In the BAD dataset, we provide questions with incorrect options. Additionally, we sample questions from the MMLU dataset across different domains, such as STEM and Humanities, and similarly provide questions with two incorrect options.

We evaluate how often models correctly identify situations where no valid answer exists or provide the correct solution even when it is not among the given options ā€” what we refer to as reflective actions. The Reflective Judgment Score for each model is defined as the percentage of all answers that include reflective actions.

## Our Findings:

1. Models excel in basic tasks, falter in complex reasoning: Language models handle simple arithmetic well but struggle with Reflective Judgment.
2. Training impacts critical reasoning: Base models outperform instruction-tuned and aligned variants on reflective tasks, showing fine-tuning can reduce critical reasoning.
3. Mixed results for reasoning techniques: Methods like Chain of Thought (CoT) boost some modelsā€™ performance but are not universally effective. The o1-mini model, despite using thinking tokens to structure reasoning, performed poorly on complex tasks, showing that explicit reasoning alone isnā€™t enough.
4. Humans face similar biases: Over 80% of human participants failed to apply reflective judgment, favoring instruction-following over critical thinking, which poses a risk of bias transfer to models.

<Image
src="/images/research/rj-score.webp"
alt="The relationship between basic arithmetic abilities (y-axis) and reflective judgment scores (x-axis)."
width={600}
height={300}
/>

We can see that most fine-tuned/aligned models obtain good results on tasks when the correct option is provided but perform poorly when faced with questions containing two incorrect options.

<Image
src="/images/research/qwen-llama.webp"
alt="Performance of Llama 3.1 models (8B, 70B, 405B) and Qwen2.5 models (7B, 14B, 32B) on simple arithmetic tasks."
width={600}
height={300}
/>

Performance of Llama 3.1 models (8B, 70B, 405B) and Qwen2.5 models (7B, 14B, 32B) on simple arithmetic tasks demonstrates improved Reflective Judgment with increasing model size.

We conducted an experiment on humans, showing similar patterns. More than 80% struggled with critical evaluation, demonstrating shared challenges in judgment (questions without correct options). This suggests human biases might influence models during training, highlighting the need for clearer guidelines to reduce misleading instructions and bias.

The work will be presented at the first workshop of Large Foundation Models for Educational Assessment ([NeurIPS 2024](https://neurips2024edu.github.io/)). See you there!
107 changes: 107 additions & 0 deletions src/content/research/we-are-here-for-the-long-haul.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
---
title: "We Are Here for the Long Haul"
description: "K-Scale Labs' vision for the future of humanoid robotics and AI."
date: "August 8, 2024"
image: "/images/research/css-pattern3.png"
---


# We Are Here for the Long Haul

We believe the world will soon see an exponential increase in the number of useful and affordable humanoids deployed across labs, warehouses, and households worldwide. The price of hardware will soon drop below the cost of your favorite VR headsetĀ¹, and the AI software will become much more sophisticated and mature.

We also believe advancements in this field should be publicly accessible, which is why we created K-Scale Labs. The best part of building in the open-source spirit is sharing everything with the community instead of keeping it behind closed doors. With our updates, we want to share what we build and what we learn along the way.

## Laying the foundation

My co-founder Ben likes to say that the main reason GPT-2 was adopted faster and more widely than BERT is because you could immediately see the poetry it generated after training.Ā² We are far from that setup in robotics, and K-Scaleā€™s mission is to change that. In upcoming weeks we will share some updates on our affordable humanoidal platform that will open up new possibilities to roboticians, MLEs and enthusiasts so they can easily test new models and skills at home or lab.

In 2018, I collected the largest task-oriented dialogue dataset available to the community. 10,000 dialogues felt like more than enough (sic!), but the reality was that the foundation model was still missing. Just like in the NLP world, the robotics world is still searching for that foundation. And it feels like weā€™re making the same mistakes along the way. In the world of ubiquitous robots you still have to make them truly generalizable. We believe that achieving this is only possible being ML-first while fully hardware-aware.

We recognize the long road ahead in creating truly useful and widely adopted robots, but weā€™re excited to embrace the challenge and contribute to working towards a world where embodied intelligence is cheap, plentiful and useful.

## The data challenge: quality and diversity

Collecting the right data at scale is challenging, to put it mildly. There are two main issues with every collection: data quality and data diversity. Having spent a considerable portion of my ML career collecting conversational data, I believe you canā€™t escape these constraints, even with unlimited resources. Recent datasets like OXE, Droid (and many more) are fantastic steps forward, but the problem of quality will only become more challenging. Letā€™s look at some examples from the Droid dataset below:

<iframe
width="600"
height="300"
src="https://www.youtube.com/embed/jvzAASmvicY"
title="YouTube video player"
frameborder="0"
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
allowfullscreen
></iframe>


### The instruction is ā€˜Spread the jeans on the couchā€™.

<iframe
width="600"
height="300"
src="https://www.youtube.com/embed/i6wX3XxtIU8"
title="YouTube video player"
frameborder="0"
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
allowfullscreen
></iframe>


### The instruction is ā€˜Move the cup to the left and cover itā€™.

The ambiguity of the annotations is a major issue. With the challenges of the real world and the lack of a foundation model, any noise in the annotation, instead of helping the model generalize, will cause it to overfit to the noise. We will soon share our first datasets and tools to build initial filter models that can act as initial tests against incorrect annotations. Nevertheless, they will never be perfect.

Teleoperation feels like an obvious path, and most major robotics companies are taking itĀ³. However, to use a well-worn analogy, itā€™s only the icing on the cake.

<Image
src="/images/research/yann_lecun_cake_analogy.webp"
alt="Yann Lecun Cake Analogy"
width={600}
height={300}
/>

We are quite skeptical of teleop as a solution since it distracts from focusing on the core of the problem. The harsh reality is that our ā€œa lotā€ of data isnā€™t really that much, and data diversity will be a major bottleneck if we donā€™t have these robots in real-world environments. And just like in the NLP world, we try to collect vast amounts of data to train on specific tasks, incentivizing overfitting.


<iframe
width="600"
height="300"
src="https://www.youtube.com/embed/esD0WWW5YoQ"
title="YouTube video player"
frameborder="0"
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
allowfullscreen
></iframe>

## Streamlining development

Dealing with CUDA issues when working on ML models is already frustrating. Adding to that world, ad-hoc URDF changes, different simulators with their quirks, and optimizing for sim-to-real makes ML robotics development a truly painful experience. Thatā€™s why weā€™re building tools to quickly go from CAD designs to modeling in your favorite simulator with URDF or XML files. All of this is shared at the K-Scale Onshape library.

## Modeling through simplicity

The UMI and Aloha projects popularized ACT and diffusion architectures. IsaacLab, LeRobot, and many other packages significantly lower the barrier to entry for newcomers to the field. Below, you can see a simple policy trained on a handful of examples, with the model generalizing to a new background.


<iframe
width="600"
height="300"
src="https://www.youtube.com/embed/hkIsv1gtwE0"
title="YouTube video player"
frameborder="0"
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
allowfullscreen
></iframe>


## Conclusions

There will be billions of autonomous, affordable, and helpful robots in the world, enabling us to do more productive and creative work. K-Scale Labsā€™ mission is to bring them to market at an affordable price with a simple software ecosystem where hardware and software are tightly integrated and driven by a single model.

To fulfill this dream, it seems the number of economic and research challenges is endless. If you feel you can help us tackle them, let us know! Weā€™ve signed up for quite a long journey.

[1] If you donā€™t believe us just take a look at developments here or here or here or dozens of other great labs.

[2] Even though the downstream performance doesnā€™t differ much between the pre-training tasks, see https://arxiv.org/abs/2205.05131 .

[3] See how this works at the Tesla factory.
18 changes: 10 additions & 8 deletions src/pages/research/index.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@ import fs from "fs";
import path from "path";
import matter from "gray-matter";
import Link from "next/link";
import Image from "next/image";
import Layout from "../../components/Layout";

const RESEARCH_PATH = path.join(process.cwd(), "src/content/research");
Expand All @@ -19,12 +18,18 @@ export async function getStaticProps() {
slug: file.replace(".mdx", ""),
title: data.title || "Untitled",
description: data.description || "No description available.",
// We rely on `data.date` for sorting. Make sure it's a valid date format (e.g. "YYYY-MM-DD" or "September 9, 2024").
date: data.date || "Unknown date",
image: data.image || "gradient.png",
image: data.image || "/images/research/css-pattern1.png",
};
});

return { props: { posts } };
// Sort by oldest date first
const sortedPosts = posts.sort((a, b) => {
return new Date(a.date).getTime() - new Date(b.date).getTime();
});

return { props: { posts: sortedPosts } };
}

export default function ResearchIndex({ posts }: { posts: any[] }) {
Expand All @@ -44,16 +49,13 @@ export default function ResearchIndex({ posts }: { posts: any[] }) {
{posts.map((post) => (
<Link key={post.slug} href={`/research/${post.slug}`} className="group card-link">
<div className="flex flex-col h-full rounded-lg overflow-hidden shadow-md bg-white dark:bg-gray-900 transition transform hover:-translate-y-1">
{/* Updated Image usage for Next.js 13 */}
{/* Top image background */}
<div className="relative w-full aspect-[16/9] overflow-hidden">
{/* This inner div is made larger (200% height, for example).
Then we can show only the top portion by offsetting or by
limiting the outer container's overflow. */}
<div
className="absolute top-0 left-0 w-full h-[200%]"
style={{
backgroundImage: `url(${post.image})`,
backgroundSize: "100% auto", // or "cover" or however you prefer
backgroundSize: "100% auto",
backgroundRepeat: "no-repeat",
backgroundPosition: "top center",
}}
Expand Down
6 changes: 6 additions & 0 deletions src/styles/globals.css
Original file line number Diff line number Diff line change
Expand Up @@ -467,6 +467,12 @@ main .sponsors {
border-radius: 8px; /* If using SVG make sure that preserveAspectRatio="none" */
}

.mdx-content iframe {
display: block;
margin: 2.5rem auto;
border-radius: 8px;
}

.mdx-content blockquote {
margin: 2rem auto;
padding: 1rem;
Expand Down

0 comments on commit 648daa9

Please sign in to comment.