You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PyTorch version: 2.5.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 15.2 (arm64)
GCC version: Could not collect
Clang version: 16.0.0 (clang-1600.0.26.6)
CMake version: version 3.31.0
Libc version: N/A
Python version: 3.10.16 (main, Dec 11 2024, 10:22:29) [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-15.2-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Apple M3 Max
Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] onnxruntime==1.20.1
[pip3] torch==2.5.1
[conda] numpy 1.26.4 pypi_0 pypi
[conda] torch 2.5.1 pypi_0 pypi
Information
The official example scripts
My own modified scripts
🐛 Describe the bug
With fireworks Llama 405b I'm hitting a 400 error saying the maximum context length is 32,767
Repro:
fromllama_stack_clientimportLlamaStackClientimportosMODEL_ID="meta-llama/Llama-3.1-405B-Instruct-FP8"messages= [{
"role": "user",
"content": "What is the weather in San Francisco?"*6000,
}]
client=LlamaStackClient(base_url=f"http://localhost:{os.environ['LLAMA_STACK_PORT']}")
response=client.inference.chat_completion(
model_id=MODEL_ID,
messages=messages,
)
print(response.completion_message.content)
System Info
Information
🐛 Describe the bug
With fireworks Llama 405b I'm hitting a 400 error saying the maximum context length is 32,767
Repro:
Error logs
Expected behavior
Should not fail unless it's over the 128k context limit
The text was updated successfully, but these errors were encountered: