medium blog migration🔥

kscalelabs · Jan 31, 2025 · 648daa9 · 648daa9
1 parent 26079f3
commit 648daa9
Show file tree

Hide file tree

Showing 16 changed files with 223 additions and 9 deletions.
diff --git a/public/images/research/css-pattern2.png b/public/images/research/css-pattern2.png
diff --git a/public/images/research/css-pattern3.png b/public/images/research/css-pattern3.png
diff --git a/public/images/research/css-pattern4.png b/public/images/research/css-pattern4.png
diff --git a/public/images/research/humans-models-rj-score.webp b/public/images/research/humans-models-rj-score.webp
diff --git a/public/images/research/kscale_projections.webp b/public/images/research/kscale_projections.webp
diff --git a/public/images/research/qwen-llama.webp b/public/images/research/qwen-llama.webp
diff --git a/public/images/research/rj-score.webp b/public/images/research/rj-score.webp
diff --git a/public/images/research/time-to-t1.webp b/public/images/research/time-to-t1.webp
diff --git a/public/images/research/why-rj.webp b/public/images/research/why-rj.webp
diff --git a/public/images/research/yann_lecun_cake_analogy.webp b/public/images/research/yann_lecun_cake_analogy.webp
diff --git a/src/content/research/efficient-vision-language-action-models.mdx b/src/content/research/efficient-vision-language-action-models.mdx
@@ -1,7 +1,7 @@
 ---
 title: "Efficient Vision-Language-Action Models"
 description: "Improving inference speed of vision-language-action models for edge devices while preserving encoding power."
-date: "29 January 2025"
+date: "September 9, 2024"
 image: "/images/research/css-pattern1.png"
 ---
 

diff --git a/src/content/research/introducing-k-scale-labs.mdx b/src/content/research/introducing-k-scale-labs.mdx
@@ -0,0 +1,37 @@
+---
+title: "Introducing K-Scale Labs"
+description: "Our mission at K-Scale Labs is to move humanity to a Type 1 Kardashev civilization."
+date: "August 7, 2024"
+image: "/images/research/css-pattern2.png"
+---
+
+
+
+# Introducing K-Scale Labs
+
+In 1964, Soviet astronomer Nikolai Kardashev proposed a scale for measuring a civilization’s level of technological advancement based on energy consumption. A Type 1 civilization on this scale is one which can harness all the energy available on its planet. A Type 2 civilization is one which can harness all of the energy from a star. A Type 3 civilization is one which can harness all the energy from a galaxy.
+
+Our mission at K-Scale Labs is to move humanity to a Type 1 Kardashev civilization within my lifetime. Why is this a good idea? Barring more abstract philosophical conceptualizations of “the good life”, harnessing more energy is what makes peoples’ lives better. Famine, poverty, natural disasters, cost of living — most of the problems that people care about, at their core, stem from a collective inability to harness energy for useful outcomes in one way or another. A world where it is practically free to have something done — in other words, a world with less scarcity and more abundance — is one in which most of humanity’s problems become political rather than technical.
+
+<Image
+  src="/images/research/kscale_projections.webp"
+  alt="Kardashev Scale Projection"
+  width={600}
+  height={300}
+/>
+
+### Kardashev scale projections for Earth
+
+
+<Image
+  src="/images/research/time-to-t1.webp"
+  alt="Time-to-T1 for different growth rates in energy consumption"
+  width={600}
+  height={300}
+/>
+
+### Time-to-T1 for different growth rates in energy consumption.
+
+As an engineering problem, reaching a Type 1 civilization in my lifetime until very recently felt like an impossible task. However, in the context of general-purpose intelligence, it seems much more tractable. In a world with a general-purpose agent for every human, increasing humanity’s energy consumption by 15% annually simply means something like 30% of those agents copying themselves once per year.
+
+Our goal at K-Scale Labs is to make this world possible, by designing a platform for general-purpose embodied intelligence and making it freely available for anyone to build. We believe that working towards this future should be the collective project of all of humanity, rather than the work of a few companies in Silicon Valley — in the same way that beavers build dams and birds build nests, humans build abundance. If you would like to build this future with us, we would be happy to have you.
diff --git a/...thats-not-an-option-llms-robustness-with-in-correct-multiple-choice-options.mdx b/...thats-not-an-option-llms-robustness-with-in-correct-multiple-choice-options.mdx
@@ -0,0 +1,62 @@
+---
+title: "Wait, That’s Not an Option: LLMs Robustness with In-correct Multiple-Choice Options"
+description: "Exploring Reflective Judgment in language models and their ability to critically evaluate input even in flawed multiple-choice scenarios."
+date: "October 14, 2024"
+image: "/images/research/css-pattern4.png"
+---
+
+# Wait, That’s Not an Option
+
+
+Reflective judgment is a critical process that enables individuals to evaluate and analyze information to form well-founded conclusions. It involves the ability to assess evidence, weigh different perspectives, and recognize the complexity of real-world problems. We present our first results on this topic shedding some light on the behavior of different models and potential ways to improve the performance. You can also see our project website and the Github code.
+
+## What do we measure?
+
+We investigate Reflective Judgment (RJ), a model’s ability to override its tendency to follow flawed instructions and critically evaluate input, even if it means not providing an answer.
+
+## Why RJ?
+
+Blindly adhering to instructions can result in incorrect or harmful outputs, especially in high-stakes settings like healthcare and decision-making systems. Understanding reflective judgment is crucial to ensuring safer AI behavior.
+
+<Image
+  src="/images/research/why-rj.webp"
+  alt="Why RJ?"
+  width={600}
+  height={300}
+/>
+
+
+## How do we measure RJ?
+
+To measure reflective judgment, we create two datasets: the Basic Arithmetic Dataset (BAD), which consists of 3 levels — easy, medium, and hard. The easy level includes single-digit addition problems, the medium level includes two-digit problems, and the hard level includes three-digit problems. In the BAD dataset, we provide questions with incorrect options. Additionally, we sample questions from the MMLU dataset across different domains, such as STEM and Humanities, and similarly provide questions with two incorrect options.
+
+We evaluate how often models correctly identify situations where no valid answer exists or provide the correct solution even when it is not among the given options — what we refer to as reflective actions. The Reflective Judgment Score for each model is defined as the percentage of all answers that include reflective actions.
+
+## Our Findings:
+
+1. Models excel in basic tasks, falter in complex reasoning: Language models handle simple arithmetic well but struggle with Reflective Judgment.
+2. Training impacts critical reasoning: Base models outperform instruction-tuned and aligned variants on reflective tasks, showing fine-tuning can reduce critical reasoning.
+3. Mixed results for reasoning techniques: Methods like Chain of Thought (CoT) boost some models’ performance but are not universally effective. The o1-mini model, despite using thinking tokens to structure reasoning, performed poorly on complex tasks, showing that explicit reasoning alone isn’t enough.
+4. Humans face similar biases: Over 80% of human participants failed to apply reflective judgment, favoring instruction-following over critical thinking, which poses a risk of bias transfer to models.
+
+<Image
+  src="/images/research/rj-score.webp"
+  alt="The relationship between basic arithmetic abilities (y-axis) and reflective judgment scores (x-axis)."
+  width={600}
+  height={300}
+/>
+
+We can see that most fine-tuned/aligned models obtain good results on tasks when the correct option is provided but perform poorly when faced with questions containing two incorrect options.
+
+<Image
+  src="/images/research/qwen-llama.webp"
+  alt="Performance of Llama 3.1 models (8B, 70B, 405B) and Qwen2.5 models (7B, 14B, 32B) on simple arithmetic tasks."
+  width={600}
+  height={300}
+/>
+
+Performance of Llama 3.1 models (8B, 70B, 405B) and Qwen2.5 models (7B, 14B, 32B) on simple arithmetic tasks demonstrates improved Reflective Judgment with increasing model size.
+
+We conducted an experiment on humans, showing similar patterns. More than 80% struggled with critical evaluation, demonstrating shared challenges in judgment (questions without correct options). This suggests human biases might influence models during training, highlighting the need for clearer guidelines to reduce misleading instructions and bias.
+
+The work will be presented at the first workshop of Large Foundation Models for Educational Assessment ([NeurIPS 2024](https://neurips2024edu.github.io/)). See you there!
diff --git a/src/content/research/we-are-here-for-the-long-haul.mdx b/src/content/research/we-are-here-for-the-long-haul.mdx
@@ -0,0 +1,107 @@
+---
+title: "We Are Here for the Long Haul"
+description: "K-Scale Labs' vision for the future of humanoid robotics and AI."
+date: "August 8, 2024"
+image: "/images/research/css-pattern3.png"
+---
+
+
+# We Are Here for the Long Haul
+
+We believe the world will soon see an exponential increase in the number of useful and affordable humanoids deployed across labs, warehouses, and households worldwide. The price of hardware will soon drop below the cost of your favorite VR headset¹, and the AI software will become much more sophisticated and mature.
+
+We also believe advancements in this field should be publicly accessible, which is why we created K-Scale Labs. The best part of building in the open-source spirit is sharing everything with the community instead of keeping it behind closed doors. With our updates, we want to share what we build and what we learn along the way.
+
+## Laying the foundation  
+
+My co-founder Ben likes to say that the main reason GPT-2 was adopted faster and more widely than BERT is because you could immediately see the poetry it generated after training.² We are far from that setup in robotics, and K-Scale’s mission is to change that. In upcoming weeks we will share some updates on our affordable humanoidal platform that will open up new possibilities to roboticians, MLEs and enthusiasts so they can easily test new models and skills at home or lab.
+
+In 2018, I collected the largest task-oriented dialogue dataset available to the community. 10,000 dialogues felt like more than enough (sic!), but the reality was that the foundation model was still missing. Just like in the NLP world, the robotics world is still searching for that foundation. And it feels like we’re making the same mistakes along the way. In the world of ubiquitous robots you still have to make them truly generalizable. We believe that achieving this is only possible being ML-first while fully hardware-aware.
+
+We recognize the long road ahead in creating truly useful and widely adopted robots, but we’re excited to embrace the challenge and contribute to working towards a world where embodied intelligence is cheap, plentiful and useful.
+
+## The data challenge: quality and diversity  
+
+Collecting the right data at scale is challenging, to put it mildly. There are two main issues with every collection: data quality and data diversity. Having spent a considerable portion of my ML career collecting conversational data, I believe you can’t escape these constraints, even with unlimited resources. Recent datasets like OXE, Droid (and many more) are fantastic steps forward, but the problem of quality will only become more challenging. Let’s look at some examples from the Droid dataset below:
+
+<iframe
+  width="600"
+  height="300"
+  src="https://www.youtube.com/embed/jvzAASmvicY"
+  title="YouTube video player"
+  frameborder="0"
+  allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
+  allowfullscreen
+></iframe>
+
+
+### The instruction is ‘Spread the jeans on the couch’.
+
+<iframe
+  width="600"
+  height="300"
+  src="https://www.youtube.com/embed/i6wX3XxtIU8"
+  title="YouTube video player"
+  frameborder="0"
+  allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
+  allowfullscreen
+></iframe>
+
+
+### The instruction is ‘Move the cup to the left and cover it’.
+
+The ambiguity of the annotations is a major issue. With the challenges of the real world and the lack of a foundation model, any noise in the annotation, instead of helping the model generalize, will cause it to overfit to the noise. We will soon share our first datasets and tools to build initial filter models that can act as initial tests against incorrect annotations. Nevertheless, they will never be perfect.
+
+Teleoperation feels like an obvious path, and most major robotics companies are taking it³. However, to use a well-worn analogy, it’s only the icing on the cake.
+
+<Image
+  src="/images/research/yann_lecun_cake_analogy.webp"
+  alt="Yann Lecun Cake Analogy"
+  width={600}
+  height={300}
+/>
+
+We are quite skeptical of teleop as a solution since it distracts from focusing on the core of the problem. The harsh reality is that our “a lot” of data isn’t really that much, and data diversity will be a major bottleneck if we don’t have these robots in real-world environments. And just like in the NLP world, we try to collect vast amounts of data to train on specific tasks, incentivizing overfitting.
+
+
+<iframe
+  width="600"
+  height="300"
+  src="https://www.youtube.com/embed/esD0WWW5YoQ"
+  title="YouTube video player"
+  frameborder="0"
+  allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
+  allowfullscreen
+></iframe>
+
+## Streamlining development  
+
+Dealing with CUDA issues when working on ML models is already frustrating. Adding to that world, ad-hoc URDF changes, different simulators with their quirks, and optimizing for sim-to-real makes ML robotics development a truly painful experience. That’s why we’re building tools to quickly go from CAD designs to modeling in your favorite simulator with URDF or XML files. All of this is shared at the K-Scale Onshape library.
+
+## Modeling through simplicity  
+
+The UMI and Aloha projects popularized ACT and diffusion architectures. IsaacLab, LeRobot, and many other packages significantly lower the barrier to entry for newcomers to the field. Below, you can see a simple policy trained on a handful of examples, with the model generalizing to a new background.
+
+
+<iframe
+  width="600"
+  height="300"
+  src="https://www.youtube.com/embed/hkIsv1gtwE0"
+  title="YouTube video player"
+  frameborder="0"
+  allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
+  allowfullscreen
+></iframe>
+
+
+## Conclusions  
+
+There will be billions of autonomous, affordable, and helpful robots in the world, enabling us to do more productive and creative work. K-Scale Labs’ mission is to bring them to market at an affordable price with a simple software ecosystem where hardware and software are tightly integrated and driven by a single model.
+
+To fulfill this dream, it seems the number of economic and research challenges is endless. If you feel you can help us tackle them, let us know! We’ve signed up for quite a long journey.
+
+[1] If you don’t believe us just take a look at developments here or here or here or dozens of other great labs.
+
+[2] Even though the downstream performance doesn’t differ much between the pre-training tasks, see https://arxiv.org/abs/2205.05131 .
+
+[3] See how this works at the Tesla factory.
diff --git a/src/pages/research/index.tsx b/src/pages/research/index.tsx
@@ -2,7 +2,6 @@ import fs from "fs";
 import path from "path";
 import matter from "gray-matter";
 import Link from "next/link";
-import Image from "next/image";
 import Layout from "../../components/Layout";
 
 const RESEARCH_PATH = path.join(process.cwd(), "src/content/research");
@@ -19,12 +18,18 @@ export async function getStaticProps() {
       slug: file.replace(".mdx", ""),
       title: data.title || "Untitled",
       description: data.description || "No description available.",
+      // We rely on `data.date` for sorting. Make sure it's a valid date format (e.g. "YYYY-MM-DD" or "September 9, 2024").
       date: data.date || "Unknown date",
-      image: data.image || "gradient.png",
+      image: data.image || "/images/research/css-pattern1.png",
     };
   });
 
-  return { props: { posts } };
+  // Sort by oldest date first
+  const sortedPosts = posts.sort((a, b) => {
+    return new Date(a.date).getTime() - new Date(b.date).getTime();
+  });
+
+  return { props: { posts: sortedPosts } };
 }
 
 export default function ResearchIndex({ posts }: { posts: any[] }) {
@@ -44,16 +49,13 @@ export default function ResearchIndex({ posts }: { posts: any[] }) {
           {posts.map((post) => (
             <Link key={post.slug} href={`/research/${post.slug}`} className="group card-link">
               <div className="flex flex-col h-full rounded-lg overflow-hidden shadow-md bg-white dark:bg-gray-900 transition transform hover:-translate-y-1">
-                {/* Updated Image usage for Next.js 13 */}
+                {/* Top image background */}
                 <div className="relative w-full aspect-[16/9] overflow-hidden">
-                  {/* This inner div is made larger (200% height, for example). 
-     Then we can show only the top portion by offsetting or by 
-     limiting the outer container's overflow. */}
                   <div
                     className="absolute top-0 left-0 w-full h-[200%]"
                     style={{
                       backgroundImage: `url(${post.image})`,
-                      backgroundSize: "100% auto", // or "cover" or however you prefer
+                      backgroundSize: "100% auto",
                       backgroundRepeat: "no-repeat",
                       backgroundPosition: "top center",
                     }}

diff --git a/src/styles/globals.css b/src/styles/globals.css
@@ -467,6 +467,12 @@ main .sponsors {
   border-radius: 8px; /* If using SVG make sure that preserveAspectRatio="none" */
 }
 
+.mdx-content iframe {
+  display: block;
+  margin: 2.5rem auto;
+  border-radius: 8px;
+}
+
 .mdx-content blockquote {
   margin: 2rem auto;
   padding: 1rem;