Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fireworks: maximum context length: 32k - for Llama 405b (should be: 128k context length) #697

Open
1 of 2 tasks
aidando73 opened this issue Dec 30, 2024 · 2 comments
Open
1 of 2 tasks

Comments

@aidando73
Copy link
Contributor

System Info

PyTorch version: 2.5.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 15.2 (arm64)
GCC version: Could not collect
Clang version: 16.0.0 (clang-1600.0.26.6)
CMake version: version 3.31.0
Libc version: N/A

Python version: 3.10.16 (main, Dec 11 2024, 10:22:29) [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-15.2-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M3 Max

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] onnxruntime==1.20.1
[pip3] torch==2.5.1
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] torch                     2.5.1                    pypi_0    pypi

Information

  • The official example scripts
  • My own modified scripts

🐛 Describe the bug

With fireworks Llama 405b I'm hitting a 400 error saying the maximum context length is 32,767

Repro:

from llama_stack_client import LlamaStackClient
import os

MODEL_ID = "meta-llama/Llama-3.1-405B-Instruct-FP8"
messages = [{
    "role": "user",
    "content": "What is the weather in San Francisco?" * 6000,
}]

client = LlamaStackClient(base_url=f"http://localhost:{os.environ['LLAMA_STACK_PORT']}")

response = client.inference.chat_completion(
    model_id=MODEL_ID,
    messages=messages,
)

print(response.completion_message.content)

Error logs

identifier='meta-llama/Llama-3.1-405B-Instruct-FP8' provider_resource_id='fireworks/llama-v3p1-405b-instruct' provider_id='fireworks' type='model' metadata={} model_type=<ModelType.llm: 'llm'>
Traceback (most recent call last):
  File "/Users/aidand/dev/llama-stack/llama_stack/distribution/server/server.py", line 187, in endpoint
    return await maybe_await(value)
  File "/Users/aidand/dev/llama-stack/llama_stack/distribution/server/server.py", line 151, in maybe_await
    return await value
  File "/Users/aidand/dev/llama-stack/llama_stack/providers/utils/telemetry/trace_protocol.py", line 101, in async_wrapper
    result = await method(self, *args, **kwargs)
  File "/Users/aidand/dev/llama-stack/llama_stack/distribution/routers/routers.py", line 132, in chat_completion
    return await provider.chat_completion(**params)
  File "/Users/aidand/dev/llama-stack/llama_stack/providers/utils/telemetry/trace_protocol.py", line 101, in async_wrapper
    result = await method(self, *args, **kwargs)
  File "/Users/aidand/dev/llama-stack/llama_stack/providers/remote/inference/fireworks/fireworks.py", line 207, in chat_completion
    return await self._nonstream_chat_completion(request)
  File "/Users/aidand/dev/llama-stack/llama_stack/providers/remote/inference/fireworks/fireworks.py", line 216, in _nonstream_chat_completion
    r = await self._get_client().completion.acreate(**params)
  File "/Users/aidand/miniconda3/envs/llamastack-fireworks/lib/python3.10/site-packages/fireworks/client/base_completion.py", line 217, in _acreate_non_streaming
    response = await client.post_request_async_non_streaming(
  File "/Users/aidand/miniconda3/envs/llamastack-fireworks/lib/python3.10/site-packages/fireworks/client/api_client.py", line 191, in post_request_async_non_streaming
    await self._async_error_handling(response)
  File "/Users/aidand/miniconda3/envs/llamastack-fireworks/lib/python3.10/site-packages/fireworks/client/api_client.py", line 119, in _async_error_handling
    self._raise_for_status(resp)
  File "/Users/aidand/miniconda3/envs/llamastack-fireworks/lib/python3.10/site-packages/fireworks/client/api_client.py", line 113, in _raise_for_status
    self._raise_for(response.status_code, get_error_message)
  File "/Users/aidand/miniconda3/envs/llamastack-fireworks/lib/python3.10/site-packages/fireworks/client/api_client.py", line 73, in _raise_for
    raise InvalidRequestError(error_message())
fireworks.client.error.InvalidRequestError: {'error': {'object': 'error', 'type': 'invalid_request_error', 'message': 'The prompt is too long: 48034, model maximum context length: 32767'}}

Expected behavior

Should not fail unless it's over the 128k context limit

@aidando73
Copy link
Contributor Author

aidando73 commented Dec 30, 2024

Note that if I change this:

request = ChatCompletionRequest(
model=model.provider_resource_id,

to:

        request = ChatCompletionRequest(
-            model=model.provider_resource_id,
+            model="accounts/fireworks/models/llama-v3p1-405b-instruct",

It works

@aidando73
Copy link
Contributor Author

aidando73 commented Dec 30, 2024

@benjibc Do you know what the difference between fireworks/llama-v3p1-405b-instruct vs accounts/fireworks/models/llama-v3p3-70b-instruct is?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant