Fireworks: maximum context length: 32k - for Llama 405b (should be: 128k context length) #697

aidando73 · 2024-12-30T23:00:17Z

System Info

PyTorch version: 2.5.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 15.2 (arm64)
GCC version: Could not collect
Clang version: 16.0.0 (clang-1600.0.26.6)
CMake version: version 3.31.0
Libc version: N/A

Python version: 3.10.16 (main, Dec 11 2024, 10:22:29) [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-15.2-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M3 Max

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] onnxruntime==1.20.1
[pip3] torch==2.5.1
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] torch                     2.5.1                    pypi_0    pypi

Information

The official example scripts
My own modified scripts

🐛 Describe the bug

With fireworks Llama 405b I'm hitting a 400 error saying the maximum context length is 32,767

Repro:

from llama_stack_client import LlamaStackClient
import os

MODEL_ID = "meta-llama/Llama-3.1-405B-Instruct-FP8"
messages = [{
    "role": "user",
    "content": "What is the weather in San Francisco?" * 6000,
}]

client = LlamaStackClient(base_url=f"http://localhost:{os.environ['LLAMA_STACK_PORT']}")

response = client.inference.chat_completion(
    model_id=MODEL_ID,
    messages=messages,
)

print(response.completion_message.content)

Error logs

identifier='meta-llama/Llama-3.1-405B-Instruct-FP8' provider_resource_id='fireworks/llama-v3p1-405b-instruct' provider_id='fireworks' type='model' metadata={} model_type=<ModelType.llm: 'llm'>
Traceback (most recent call last):
  File "/Users/aidand/dev/llama-stack/llama_stack/distribution/server/server.py", line 187, in endpoint
    return await maybe_await(value)
  File "/Users/aidand/dev/llama-stack/llama_stack/distribution/server/server.py", line 151, in maybe_await
    return await value
  File "/Users/aidand/dev/llama-stack/llama_stack/providers/utils/telemetry/trace_protocol.py", line 101, in async_wrapper
    result = await method(self, *args, **kwargs)
  File "/Users/aidand/dev/llama-stack/llama_stack/distribution/routers/routers.py", line 132, in chat_completion
    return await provider.chat_completion(**params)
  File "/Users/aidand/dev/llama-stack/llama_stack/providers/utils/telemetry/trace_protocol.py", line 101, in async_wrapper
    result = await method(self, *args, **kwargs)
  File "/Users/aidand/dev/llama-stack/llama_stack/providers/remote/inference/fireworks/fireworks.py", line 207, in chat_completion
    return await self._nonstream_chat_completion(request)
  File "/Users/aidand/dev/llama-stack/llama_stack/providers/remote/inference/fireworks/fireworks.py", line 216, in _nonstream_chat_completion
    r = await self._get_client().completion.acreate(**params)
  File "/Users/aidand/miniconda3/envs/llamastack-fireworks/lib/python3.10/site-packages/fireworks/client/base_completion.py", line 217, in _acreate_non_streaming
    response = await client.post_request_async_non_streaming(
  File "/Users/aidand/miniconda3/envs/llamastack-fireworks/lib/python3.10/site-packages/fireworks/client/api_client.py", line 191, in post_request_async_non_streaming
    await self._async_error_handling(response)
  File "/Users/aidand/miniconda3/envs/llamastack-fireworks/lib/python3.10/site-packages/fireworks/client/api_client.py", line 119, in _async_error_handling
    self._raise_for_status(resp)
  File "/Users/aidand/miniconda3/envs/llamastack-fireworks/lib/python3.10/site-packages/fireworks/client/api_client.py", line 113, in _raise_for_status
    self._raise_for(response.status_code, get_error_message)
  File "/Users/aidand/miniconda3/envs/llamastack-fireworks/lib/python3.10/site-packages/fireworks/client/api_client.py", line 73, in _raise_for
    raise InvalidRequestError(error_message())
fireworks.client.error.InvalidRequestError: {'error': {'object': 'error', 'type': 'invalid_request_error', 'message': 'The prompt is too long: 48034, model maximum context length: 32767'}}

Expected behavior

Should not fail unless it's over the 128k context limit

The text was updated successfully, but these errors were encountered:

aidando73 · 2024-12-30T23:03:05Z

Note that if I change this:

llama-stack/llama_stack/providers/remote/inference/fireworks/fireworks.py

Lines 207 to 208 in 8ba29b1

    
           request = ChatCompletionRequest( 
        
               model=model.provider_resource_id,

to:

        request = ChatCompletionRequest(
-            model=model.provider_resource_id,
+            model="accounts/fireworks/models/llama-v3p1-405b-instruct",

It works

aidando73 · 2024-12-30T23:03:39Z

@benjibc Do you know what the difference between fireworks/llama-v3p1-405b-instruct vs accounts/fireworks/models/llama-v3p3-70b-instruct is?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fireworks: maximum context length: 32k - for Llama 405b (should be: 128k context length) #697

Fireworks: maximum context length: 32k - for Llama 405b (should be: 128k context length) #697

aidando73 commented Dec 30, 2024

aidando73 commented Dec 30, 2024 •

edited

Loading

aidando73 commented Dec 30, 2024 •

edited

Loading

Fireworks: maximum context length: 32k - for Llama 405b (should be: 128k context length) #697

Fireworks: maximum context length: 32k - for Llama 405b (should be: 128k context length) #697

Comments

aidando73 commented Dec 30, 2024

System Info

Information

🐛 Describe the bug

Error logs

Expected behavior

aidando73 commented Dec 30, 2024 • edited Loading

aidando73 commented Dec 30, 2024 • edited Loading

aidando73 commented Dec 30, 2024 •

edited

Loading

aidando73 commented Dec 30, 2024 •

edited

Loading