-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
16 changed files
with
223 additions
and
9 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
2 changes: 1 addition & 1 deletion
2
src/content/research/efficient-vision-language-action-models.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
--- | ||
title: "Introducing K-Scale Labs" | ||
description: "Our mission at K-Scale Labs is to move humanity to a Type 1 Kardashev civilization." | ||
date: "August 7, 2024" | ||
image: "/images/research/css-pattern2.png" | ||
--- | ||
|
||
|
||
|
||
# Introducing K-Scale Labs | ||
|
||
In 1964, Soviet astronomer Nikolai Kardashev proposed a scale for measuring a civilizationās level of technological advancement based on energy consumption. A Type 1 civilization on this scale is one which can harness all the energy available on its planet. A Type 2 civilization is one which can harness all of the energy from a star. A Type 3 civilization is one which can harness all the energy from a galaxy. | ||
|
||
Our mission at K-Scale Labs is to move humanity to a Type 1 Kardashev civilization within my lifetime. Why is this a good idea? Barring more abstract philosophical conceptualizations of āthe good lifeā, harnessing more energy is what makes peoplesā lives better. Famine, poverty, natural disasters, cost of living ā most of the problems that people care about, at their core, stem from a collective inability to harness energy for useful outcomes in one way or another. A world where it is practically free to have something done ā in other words, a world with less scarcity and more abundance ā is one in which most of humanityās problems become political rather than technical. | ||
|
||
<Image | ||
src="/images/research/kscale_projections.webp" | ||
alt="Kardashev Scale Projection" | ||
width={600} | ||
height={300} | ||
/> | ||
|
||
### Kardashev scale projections for Earth | ||
|
||
|
||
<Image | ||
src="/images/research/time-to-t1.webp" | ||
alt="Time-to-T1 for different growth rates in energy consumption" | ||
width={600} | ||
height={300} | ||
/> | ||
|
||
### Time-to-T1 for different growth rates in energy consumption. | ||
|
||
As an engineering problem, reaching a Type 1 civilization in my lifetime until very recently felt like an impossible task. However, in the context of general-purpose intelligence, it seems much more tractable. In a world with a general-purpose agent for every human, increasing humanityās energy consumption by 15% annually simply means something like 30% of those agents copying themselves once per year. | ||
|
||
Our goal at K-Scale Labs is to make this world possible, by designing a platform for general-purpose embodied intelligence and making it freely available for anyone to build. We believe that working towards this future should be the collective project of all of humanity, rather than the work of a few companies in Silicon Valley ā in the same way that beavers build dams and birds build nests, humans build abundance. If you would like to build this future with us, we would be happy to have you. |
62 changes: 62 additions & 0 deletions
62
...thats-not-an-option-llms-robustness-with-in-correct-multiple-choice-options.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
--- | ||
title: "Wait, Thatās Not an Option: LLMs Robustness with In-correct Multiple-Choice Options" | ||
description: "Exploring Reflective Judgment in language models and their ability to critically evaluate input even in flawed multiple-choice scenarios." | ||
date: "October 14, 2024" | ||
image: "/images/research/css-pattern4.png" | ||
--- | ||
|
||
# Wait, Thatās Not an Option | ||
|
||
|
||
Reflective judgment is a critical process that enables individuals to evaluate and analyze information to form well-founded conclusions. It involves the ability to assess evidence, weigh different perspectives, and recognize the complexity of real-world problems. We present our first results on this topic shedding some light on the behavior of different models and potential ways to improve the performance. You can also see our project website and the Github code. | ||
|
||
## What do we measure? | ||
|
||
We investigate Reflective Judgment (RJ), a modelās ability to override its tendency to follow flawed instructions and critically evaluate input, even if it means not providing an answer. | ||
|
||
## Why RJ? | ||
|
||
Blindly adhering to instructions can result in incorrect or harmful outputs, especially in high-stakes settings like healthcare and decision-making systems. Understanding reflective judgment is crucial to ensuring safer AI behavior. | ||
|
||
<Image | ||
src="/images/research/why-rj.webp" | ||
alt="Why RJ?" | ||
width={600} | ||
height={300} | ||
/> | ||
|
||
|
||
## How do we measure RJ? | ||
|
||
To measure reflective judgment, we create two datasets: the Basic Arithmetic Dataset (BAD), which consists of 3 levels ā easy, medium, and hard. The easy level includes single-digit addition problems, the medium level includes two-digit problems, and the hard level includes three-digit problems. In the BAD dataset, we provide questions with incorrect options. Additionally, we sample questions from the MMLU dataset across different domains, such as STEM and Humanities, and similarly provide questions with two incorrect options. | ||
|
||
We evaluate how often models correctly identify situations where no valid answer exists or provide the correct solution even when it is not among the given options ā what we refer to as reflective actions. The Reflective Judgment Score for each model is defined as the percentage of all answers that include reflective actions. | ||
|
||
## Our Findings: | ||
|
||
1. Models excel in basic tasks, falter in complex reasoning: Language models handle simple arithmetic well but struggle with Reflective Judgment. | ||
2. Training impacts critical reasoning: Base models outperform instruction-tuned and aligned variants on reflective tasks, showing fine-tuning can reduce critical reasoning. | ||
3. Mixed results for reasoning techniques: Methods like Chain of Thought (CoT) boost some modelsā performance but are not universally effective. The o1-mini model, despite using thinking tokens to structure reasoning, performed poorly on complex tasks, showing that explicit reasoning alone isnāt enough. | ||
4. Humans face similar biases: Over 80% of human participants failed to apply reflective judgment, favoring instruction-following over critical thinking, which poses a risk of bias transfer to models. | ||
|
||
<Image | ||
src="/images/research/rj-score.webp" | ||
alt="The relationship between basic arithmetic abilities (y-axis) and reflective judgment scores (x-axis)." | ||
width={600} | ||
height={300} | ||
/> | ||
|
||
We can see that most fine-tuned/aligned models obtain good results on tasks when the correct option is provided but perform poorly when faced with questions containing two incorrect options. | ||
|
||
<Image | ||
src="/images/research/qwen-llama.webp" | ||
alt="Performance of Llama 3.1 models (8B, 70B, 405B) and Qwen2.5 models (7B, 14B, 32B) on simple arithmetic tasks." | ||
width={600} | ||
height={300} | ||
/> | ||
|
||
Performance of Llama 3.1 models (8B, 70B, 405B) and Qwen2.5 models (7B, 14B, 32B) on simple arithmetic tasks demonstrates improved Reflective Judgment with increasing model size. | ||
|
||
We conducted an experiment on humans, showing similar patterns. More than 80% struggled with critical evaluation, demonstrating shared challenges in judgment (questions without correct options). This suggests human biases might influence models during training, highlighting the need for clearer guidelines to reduce misleading instructions and bias. | ||
|
||
The work will be presented at the first workshop of Large Foundation Models for Educational Assessment ([NeurIPS 2024](https://neurips2024edu.github.io/)). See you there! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,107 @@ | ||
--- | ||
title: "We Are Here for the Long Haul" | ||
description: "K-Scale Labs' vision for the future of humanoid robotics and AI." | ||
date: "August 8, 2024" | ||
image: "/images/research/css-pattern3.png" | ||
--- | ||
|
||
|
||
# We Are Here for the Long Haul | ||
|
||
We believe the world will soon see an exponential increase in the number of useful and affordable humanoids deployed across labs, warehouses, and households worldwide. The price of hardware will soon drop below the cost of your favorite VR headsetĀ¹, and the AI software will become much more sophisticated and mature. | ||
|
||
We also believe advancements in this field should be publicly accessible, which is why we created K-Scale Labs. The best part of building in the open-source spirit is sharing everything with the community instead of keeping it behind closed doors. With our updates, we want to share what we build and what we learn along the way. | ||
|
||
## Laying the foundation | ||
|
||
My co-founder Ben likes to say that the main reason GPT-2 was adopted faster and more widely than BERT is because you could immediately see the poetry it generated after training.Ā² We are far from that setup in robotics, and K-Scaleās mission is to change that. In upcoming weeks we will share some updates on our affordable humanoidal platform that will open up new possibilities to roboticians, MLEs and enthusiasts so they can easily test new models and skills at home or lab. | ||
|
||
In 2018, I collected the largest task-oriented dialogue dataset available to the community. 10,000 dialogues felt like more than enough (sic!), but the reality was that the foundation model was still missing. Just like in the NLP world, the robotics world is still searching for that foundation. And it feels like weāre making the same mistakes along the way. In the world of ubiquitous robots you still have to make them truly generalizable. We believe that achieving this is only possible being ML-first while fully hardware-aware. | ||
|
||
We recognize the long road ahead in creating truly useful and widely adopted robots, but weāre excited to embrace the challenge and contribute to working towards a world where embodied intelligence is cheap, plentiful and useful. | ||
|
||
## The data challenge: quality and diversity | ||
|
||
Collecting the right data at scale is challenging, to put it mildly. There are two main issues with every collection: data quality and data diversity. Having spent a considerable portion of my ML career collecting conversational data, I believe you canāt escape these constraints, even with unlimited resources. Recent datasets like OXE, Droid (and many more) are fantastic steps forward, but the problem of quality will only become more challenging. Letās look at some examples from the Droid dataset below: | ||
|
||
<iframe | ||
width="600" | ||
height="300" | ||
src="https://www.youtube.com/embed/jvzAASmvicY" | ||
title="YouTube video player" | ||
frameborder="0" | ||
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" | ||
allowfullscreen | ||
></iframe> | ||
|
||
|
||
### The instruction is āSpread the jeans on the couchā. | ||
|
||
<iframe | ||
width="600" | ||
height="300" | ||
src="https://www.youtube.com/embed/i6wX3XxtIU8" | ||
title="YouTube video player" | ||
frameborder="0" | ||
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" | ||
allowfullscreen | ||
></iframe> | ||
|
||
|
||
### The instruction is āMove the cup to the left and cover itā. | ||
|
||
The ambiguity of the annotations is a major issue. With the challenges of the real world and the lack of a foundation model, any noise in the annotation, instead of helping the model generalize, will cause it to overfit to the noise. We will soon share our first datasets and tools to build initial filter models that can act as initial tests against incorrect annotations. Nevertheless, they will never be perfect. | ||
|
||
Teleoperation feels like an obvious path, and most major robotics companies are taking itĀ³. However, to use a well-worn analogy, itās only the icing on the cake. | ||
|
||
<Image | ||
src="/images/research/yann_lecun_cake_analogy.webp" | ||
alt="Yann Lecun Cake Analogy" | ||
width={600} | ||
height={300} | ||
/> | ||
|
||
We are quite skeptical of teleop as a solution since it distracts from focusing on the core of the problem. The harsh reality is that our āa lotā of data isnāt really that much, and data diversity will be a major bottleneck if we donāt have these robots in real-world environments. And just like in the NLP world, we try to collect vast amounts of data to train on specific tasks, incentivizing overfitting. | ||
|
||
|
||
<iframe | ||
width="600" | ||
height="300" | ||
src="https://www.youtube.com/embed/esD0WWW5YoQ" | ||
title="YouTube video player" | ||
frameborder="0" | ||
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" | ||
allowfullscreen | ||
></iframe> | ||
|
||
## Streamlining development | ||
|
||
Dealing with CUDA issues when working on ML models is already frustrating. Adding to that world, ad-hoc URDF changes, different simulators with their quirks, and optimizing for sim-to-real makes ML robotics development a truly painful experience. Thatās why weāre building tools to quickly go from CAD designs to modeling in your favorite simulator with URDF or XML files. All of this is shared at the K-Scale Onshape library. | ||
|
||
## Modeling through simplicity | ||
|
||
The UMI and Aloha projects popularized ACT and diffusion architectures. IsaacLab, LeRobot, and many other packages significantly lower the barrier to entry for newcomers to the field. Below, you can see a simple policy trained on a handful of examples, with the model generalizing to a new background. | ||
|
||
|
||
<iframe | ||
width="600" | ||
height="300" | ||
src="https://www.youtube.com/embed/hkIsv1gtwE0" | ||
title="YouTube video player" | ||
frameborder="0" | ||
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" | ||
allowfullscreen | ||
></iframe> | ||
|
||
|
||
## Conclusions | ||
|
||
There will be billions of autonomous, affordable, and helpful robots in the world, enabling us to do more productive and creative work. K-Scale Labsā mission is to bring them to market at an affordable price with a simple software ecosystem where hardware and software are tightly integrated and driven by a single model. | ||
|
||
To fulfill this dream, it seems the number of economic and research challenges is endless. If you feel you can help us tackle them, let us know! Weāve signed up for quite a long journey. | ||
|
||
[1] If you donāt believe us just take a look at developments here or here or here or dozens of other great labs. | ||
|
||
[2] Even though the downstream performance doesnāt differ much between the pre-training tasks, see https://arxiv.org/abs/2205.05131 . | ||
|
||
[3] See how this works at the Tesla factory. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters