diff --git a/comps/agent/src/README.md b/comps/agent/src/README.md
index 8d8360a962..68263e2df3 100644
--- a/comps/agent/src/README.md
+++ b/comps/agent/src/README.md
@@ -8,7 +8,7 @@ This agent microservice is built on Langchain/Langgraph frameworks. Agents integ
 
 We currently support the following types of agents. Please refer to the example config yaml (links in the table in [Section 1.2](#12-llm-engine)) for each agent strategy to see what environment variables need to be set up.
 
-1. ReAct: use `react_langchain` or `react_langgraph` or `react_llama` as strategy. First introduced in this seminal [paper](https://arxiv.org/abs/2210.03629). The ReAct agent engages in "reason-act-observe" cycles to solve problems. Please refer to this [doc](https://python.langchain.com/v0.2/docs/how_to/migrate_agent/) to understand the differences between the langchain and langgraph versions of react agents. See table below to understand the validated LLMs for each react strategy.
+1. ReAct: use `react_langchain` or `react_langgraph` or `react_llama` as strategy. First introduced in this seminal [paper](https://arxiv.org/abs/2210.03629). The ReAct agent engages in "reason-act-observe" cycles to solve problems. Please refer to this [doc](https://python.langchain.com/v0.2/docs/how_to/migrate_agent/) to understand the differences between the langchain and langgraph versions of react agents. See table below to understand the validated LLMs for each react strategy. We recommend using `react_llama` as it has the most features enabled, including agent memory, multi-turn conversations and assistants APIs.
 2. RAG agent: use `rag_agent` or `rag_agent_llama` strategy. This agent is specifically designed for improving RAG performance. It has the capability to rephrase query, check relevancy of retrieved context, and iterate if context is not relevant. See table below to understand the validated LLMs for each rag agent strategy.
 3. Plan and execute: `plan_execute` strategy. This type of agent first makes a step-by-step plan given a user request, and then execute the plan sequentially (or in parallel, to be implemented in future). If the execution results can solve the problem, then the agent will output an answer; otherwise, it will replan and execute again.
 4. SQL agent: use `sql_agent_llama` or `sql_agent` strategy. This agent is specifically designed and optimized for answering questions aabout data in SQL databases. Users need to specify `db_name` and `db_path` for the agent to access the SQL database. For more technical details read descriptions [here](integrations/strategy/sqlagent/README.md).
@@ -16,26 +16,26 @@ We currently support the following types of agents. Please refer to the example
 **Note**:
 
 1. Due to the limitations in support for tool calling by TGI and vllm, we have developed subcategories of agent strategies (`rag_agent_llama`, `react_llama` and `sql_agent_llama`) specifically designed for open-source LLMs served with TGI and vllm.
-2. For advanced developers who want to implement their own agent strategies, please refer to [Section 5](#5-customize-agent-strategy) below.
+2. Currently only `react_llama` agent supports memory and multi-turn conversations.
+3. For advanced developers who want to implement their own agent strategies, please refer to [Section 5](#5-customize-agent-strategy) below.
 
 ### 1.2 LLM engine
 
-Agents use LLM for reasoning and planning. We support 3 options of LLM engine:
+Agents use LLM for reasoning and planning. We support 2 options of LLM engine:
 
-1. Open-source LLMs served with TGI. Follow the instructions in [Section 2.2.1](#221-start-agent-microservices-with-tgi).
-2. Open-source LLMs served with vllm. Follow the instructions in [Section 2.2.2](#222-start-agent-microservices-with-vllm).
-3. OpenAI LLMs via API calls. To use OpenAI llms, specify `llm_engine=openai` and `export OPENAI_API_KEY=<your-openai-key>`
+1. Open-source LLMs served with vllm. Follow the instructions in [Section 2.2](#22-start-agent-microservices-with-vllm).
+2. OpenAI LLMs via API calls. To use OpenAI llms, specify `llm_engine=openai` and `export OPENAI_API_KEY=<your-openai-key>`
 
-| Agent type       | `strategy` arg    | Validated LLMs (serving SW)                                                                                  | Notes                                                                                                                                                                                                                                                                          | Example config yaml                                               |
-| ---------------- | ----------------- | ------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------- |
-| ReAct            | `react_langchain` | [llama3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) (tgi-gaudi) (vllm-gaudi)   | Only allows tools with one input variable                                                                                                                                                                                                                                      | [react_langchain yaml](../../../tests/agent/react_langchain.yaml) |
-| ReAct            | `react_langgraph` | GPT-4o-mini, [llama3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) (vllm-gaudi), | if using vllm, need to specify `--enable-auto-tool-choice --tool-call-parser ${model_parser}`, refer to vllm docs for more info                                                                                                                                                | [react_langgraph yaml](../../../tests/agent/react_vllm.yaml)      |
-| ReAct            | `react_llama`     | [llama3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) (tgi-gaudi) (vllm-gaudi)   | Recommended for open-source LLMs, supports multiple tools and parallel tool calls.                                                                                                                                                                                             | [react_llama yaml](../../../tests/agent/reactllama.yaml)          |
-| RAG agent        | `rag_agent`       | GPT-4o-mini                                                                                                  |                                                                                                                                                                                                                                                                                | [rag_agent yaml](../../../tests/agent/ragagent_openai.yaml)       |
-| RAG agent        | `rag_agent_llama` | [llama3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) (tgi-gaudi) (vllm-gaudi)   | Recommended for open-source LLMs, only allows 1 tool with input variable to be "query"                                                                                                                                                                                         | [rag_agent_llama yaml](../../../tests/agent/ragagent.yaml)        |
-| Plan and execute | `plan_execute`    | GPT-4o-mini, [llama3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) (vllm-gaudi)  | use `--guided-decoding-backend lm-format-enforcer` when launching vllm.                                                                                                                                                                                                        | [plan_execute yaml](../../../tests/agent/planexec_openai.yaml)    |
-| SQL agent        | `sql_agent_llama` | [llama3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) (vllm-gaudi)               | database query tool is natively integrated using Langchain's [QuerySQLDataBaseTool](https://python.langchain.com/api_reference/community/tools/langchain_community.tools.sql_database.tool.QuerySQLDatabaseTool.html). User can also register their own tools with this agent. | [sql_agent_llama yaml](../../../tests/agent/sql_agent_llama.yaml) |
-| SQL agent        | `sql_agent`       | GPT-4o-mini                                                                                                  | database query tool is natively integrated using Langchain's [QuerySQLDataBaseTool](https://python.langchain.com/api_reference/community/tools/langchain_community.tools.sql_database.tool.QuerySQLDatabaseTool.html). User can also register their own tools with this agent. | [sql_agent yaml](../../../tests/agent/sql_agent_openai.yaml)      |
+| Agent type       | `strategy` arg    | Validated LLMs (serving SW)                                                                                                                                                       | Notes                                                                                                                                                                                                                                                                          | Example config yaml                                               |
+| ---------------- | ----------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------- |
+| ReAct            | `react_langchain` | [llama3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) (vllm-gaudi)                                                                                    | Only allows tools with one input variable                                                                                                                                                                                                                                      | [react_langchain yaml](../../../tests/agent/react_langchain.yaml) |
+| ReAct            | `react_langgraph` | GPT-4o-mini, [llama3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) (vllm-gaudi)                                                                       | if using vllm, need to specify `--enable-auto-tool-choice --tool-call-parser ${model_parser}`, refer to vllm docs for more info, only one tool call in each LLM output due to the limitations of llama3.1 modal and vllm tool call parser.                                     | [react_langgraph yaml](../../../tests/agent/react_vllm.yaml)      |
+| ReAct            | `react_llama`     | [llama3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct), [llama3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)(vllm-gaudi)  | Recommended for open-source LLMs, supports multiple tools and parallel tool calls.                                                                                                                                                                                             | [react_llama yaml](../../../tests/agent/reactllama.yaml)          |
+| RAG agent        | `rag_agent`       | GPT-4o-mini                                                                                                                                                                       |                                                                                                                                                                                                                                                                                | [rag_agent yaml](../../../tests/agent/ragagent_openai.yaml)       |
+| RAG agent        | `rag_agent_llama` | [llama3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct), [llama3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) (vllm-gaudi) | Recommended for open-source LLMs, only allows 1 tool with input variable to be "query"                                                                                                                                                                                         | [rag_agent_llama yaml](../../../tests/agent/ragagent.yaml)        |
+| Plan and execute | `plan_execute`    | GPT-4o-mini, [llama3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) (vllm-gaudi)                                                                       | use `--guided-decoding-backend lm-format-enforcer` when launching vllm.                                                                                                                                                                                                        | [plan_execute yaml](../../../tests/agent/planexec_openai.yaml)    |
+| SQL agent        | `sql_agent_llama` | [llama3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct), [llama3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) (vllm-gaudi) | database query tool is natively integrated using Langchain's [QuerySQLDataBaseTool](https://python.langchain.com/api_reference/community/tools/langchain_community.tools.sql_database.tool.QuerySQLDatabaseTool.html). User can also register their own tools with this agent. | [sql_agent_llama yaml](../../../tests/agent/sql_agent_llama.yaml) |
+| SQL agent        | `sql_agent`       | GPT-4o-mini                                                                                                                                                                       | database query tool is natively integrated using Langchain's [QuerySQLDataBaseTool](https://python.langchain.com/api_reference/community/tools/langchain_community.tools.sql_database.tool.QuerySQLDatabaseTool.html). User can also register their own tools with this agent. | [sql_agent yaml](../../../tests/agent/sql_agent_openai.yaml)      |
 
 ### 1.3 Tools
 
@@ -49,42 +49,77 @@ Examples of how to register tools can be found in [Section 4](#-4-provide-your-o
 
 ### 1.4 Agent APIs
 
-1. OpenAI compatible chat completions API
+We support two sets of APIs that are OpenAI compatible:
+
+1. OpenAI compatible chat completions API. Example usage with Python code below.
+
+```python
+url = f"http://{ip_address}:{agent_port}/v1/chat/completions"
+
+# single-turn, not streaming -> if agent is used as a worker agent (i.e., tool for supervisor agent)
+payload = {"messages": query, "stream": false}
+resp = requests.post(url=url, json=payload, proxies=proxies, stream=False)
+
+# multi-turn, streaming -> to interface with users
+query = {"role": "user", "messages": user_message, "thread_id": thread_id, "stream": stream}
+content = json.dumps(query)
+resp = requests.post(url=url, data=content, proxies=proxies, stream=True)
+for line in resp.iter_lines(decode_unicode=True):
+    print(line)
+```
+
 2. OpenAI compatible assistants APIs.
 
-**Note**: not all keywords are supported yet.
+   See example Python code [here](./test_assistant_api.py). There are 4 steps:
 
-## 🚀2. Start Agent Microservice
+   Step 1. create an assistant: /v1/assistants
+
+   Step 2. create a thread: /v1/threads
+
+   Step 3. send a message to the thread: /v1/threads/{thread_id}/messages
+
+   Step 4. run the assistant: /v1/threads/{thread_id}/runs
+
+**Note**:
+
+1. Currently only `reract_llama` agent is enabled for assistants APIs.
+2. Not all keywords of OpenAI APIs are supported yet.
+
+### 1.5 Agent memory
+
+We currently supports two types of memory.
+
+1. `checkpointer`: agent memory stored in RAM, so is volatile, the memory contains agent states within a thread. Used to enable multi-turn conversations between the user and the agent. Both chat completions API and assistants APIs support this type of memory.
+2. `store`: agent memory stored in Redis database, contains agent states in all threads. Only assistants APIs support this type of memory. Used to enable multi-turn conversations. In future we will explore algorithms to take advantage of the info contained in previous conversations to improve agent's performance.
+
+**Note**: Currently only `react_llama` agent supports memory and multi-turn conversations.
 
-### 2.1 Build Microservices
+#### How to enable agent memory?
+
+Specify `with_memory`=True. If want to use persistent memory, specify `memory_type`=`store`, and you need to launch a Redis database using the command below.
 
 ```bash
-cd GenAIComps/ # back to GenAIComps/ folder
-docker build -t opea/agent:latest -f comps/agent/src/Dockerfile . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
+# you can change the port from 6379 to another one.
+docker run -d -it -p 6379:6379 --rm --name "test-persistent-redis" --net=host --ipc=host --name redis-vector-db redis/redis-stack:7.2.0-v9
 ```
 
-#### 2.2.1 Start Agent microservices with TGI
+Examples of python code for multi-turn conversations using agent memory:
 
-```bash
-export ip_address=$(hostname -I | awk '{print $1}')
-export model="meta-llama/Meta-Llama-3.1-70B-Instruct"
-export HUGGINGFACEHUB_API_TOKEN=${HUGGINGFACEHUB_API_TOKEN}
-export HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN}
+1. [chat completions API with checkpointer](./test_chat_completion_multiturn.py)
+2. [assistants APIs with persistent store memory](./test_assistant_api.py)
 
-# TGI serving on 4 Gaudi2 cards
-docker run -d --runtime=habana --name "comps-tgi-gaudi-service" -p 8080:80 -v ./data:/data -e HF_TOKEN=$HF_TOKEN -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:latest --model-id $model --max-input-tokens 8192 --max-total-tokens 16384 --sharded true --num-shard 4
+To run the two examples above, first launch the agent microservice using [this docker compose yaml](../../../tests/agent/reactllama.yaml).
 
-# check status
-docker logs comps-tgi-gaudi-service
+## 🚀2. Start Agent Microservice
 
-# Agent: react_llama strategy
-docker run -d --runtime=runc --name="comps-agent-endpoint" -v $WORKPATH/comps/agent/src/tools:/home/user/comps/agent/src/tools -p 9090:9090 --ipc=host -e HUGGINGFACEHUB_API_TOKEN=${HUGGINGFACEHUB_API_TOKEN} -e model=${model} -e ip_address=${ip_address} -e strategy=react_llama -e llm_endpoint_url=http://${ip_address}:8080 -e llm_engine=tgi -e recursion_limit=15 -e require_human_feedback=false -e tools=/home/user/comps/agent/src/tools/custom_tools.yaml opea/agent:latest
+### 2.1 Build docker image for agent microservice
 
-# check status
-docker logs comps-agent-endpoint
+```bash
+cd GenAIComps/ # back to GenAIComps/ folder
+docker build -t opea/agent:latest -f comps/agent/src/Dockerfile . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
 ```
 
-#### 2.2.2 Start Agent microservices with vllm
+### 2.2 Start Agent microservices with vllm
 
 ```bash
 export ip_address=$(hostname -I | awk '{print $1}')
@@ -94,8 +129,9 @@ export HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN}
 export vllm_volume=${YOUR_LOCAL_DIR_FOR_MODELS}
 
 # build vLLM image
-git clone https://github.com/vllm-project/vllm.git
-cd ./vllm
+git clone https://github.com/HabanaAI/vllm-fork.git
+cd ./vllm-fork
+git checkout v0.6.4.post2+Gaudi-1.19.0
 docker build -f Dockerfile.hpu -t opea/vllm-gaudi:latest --shm-size=128g . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
 
 # vllm serving on 4 Gaudi2 cards
@@ -105,7 +141,7 @@ docker run -d --runtime=habana --rm --name "comps-vllm-gaudi-service" -p 8080:80
 docker logs comps-vllm-gaudi-service
 
 # Agent
-docker run -d --runtime=runc --name="comps-agent-endpoint" -v $WORKPATH/comps/agent/src/tools:/home/user/comps/agent/src/tools -p 9090:9090 --ipc=host -e HUGGINGFACEHUB_API_TOKEN=${HUGGINGFACEHUB_API_TOKEN} -e model=${model} -e ip_address=${ip_address} -e strategy=react_llama -e llm_endpoint_url=http://${ip_address}:8080 -e llm_engine=vllm -e recursion_limit=15 -e require_human_feedback=false -e tools=/home/user/comps/agent/src/tools/custom_tools.yaml opea/agent:latest
+docker run -d --runtime=runc --name="comps-agent-endpoint" -v $WORKPATH/comps/agent/src/tools:/home/user/comps/agent/src/tools -p 9090:9090 --ipc=host -e HUGGINGFACEHUB_API_TOKEN=${HUGGINGFACEHUB_API_TOKEN} -e model=${model} -e ip_address=${ip_address} -e strategy=react_llama -e with_memory=true -e llm_endpoint_url=http://${ip_address}:8080 -e llm_engine=vllm -e recursion_limit=15 -e require_human_feedback=false -e tools=/home/user/comps/agent/src/tools/custom_tools.yaml opea/agent:latest
 
 # check status
 docker logs comps-agent-endpoint
@@ -114,7 +150,7 @@ docker logs comps-agent-endpoint
 > debug mode
 >
 > ```bash
-> docker run --rm --runtime=runc --name="comps-agent-endpoint" -v ./comps/agent/src/:/home/user/comps/agent/src/ -p 9090:9090 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e HUGGINGFACEHUB_API_TOKEN=${HUGGINGFACEHUB_API_TOKEN} -e model=${model} -e ip_address=${ip_address} -e strategy=react_llama -e llm_endpoint_url=http://${ip_address}:8080 -e llm_engine=vllm -e recursion_limit=15 -e require_human_feedback=false -e tools=/home/user/comps/agent/src/tools/custom_tools.yaml opea/agent:latest
+> docker run --rm --runtime=runc --name="comps-agent-endpoint" -v ./comps/agent/src/:/home/user/comps/agent/src/ -p 9090:9090 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e HUGGINGFACEHUB_API_TOKEN=${HUGGINGFACEHUB_API_TOKEN} -e model=${model} -e ip_address=${ip_address} -e strategy=react_llama -e with_memory=true -e llm_endpoint_url=http://${ip_address}:8080 -e llm_engine=vllm -e recursion_limit=15 -e require_human_feedback=false -e tools=/home/user/comps/agent/src/tools/custom_tools.yaml opea/agent:latest
 > ```
 
 ## 🚀 3. Validate Microservice
@@ -123,9 +159,14 @@ Once microservice starts, user can use below script to invoke.
 
 ### 3.1 Use chat completions API
 
+For multi-turn conversations, first specify a `thread_id`.
+
 ```bash
+export thread_id=<thread-id>
 curl http://${ip_address}:9090/v1/chat/completions -X POST -H "Content-Type: application/json" -d '{
-     "query": "What is OPEA project?"
+     "messages": "What is OPEA project?",
+     "thread_id":${thread_id},
+     "stream":true
     }'
 
 # expected output
@@ -143,12 +184,12 @@ data: [DONE]
 # step1 create assistant to get `asssistant_id`
 
 curl http://${ip_address}:9090/v1/assistants -X POST -H "Content-Type: application/json" -d '{
-     "agent_config": {"llm_engine": "tgi", "llm_endpoint_url": "http://${ip_address}:8080", "tools": "/home/user/comps/agent/src/tools/custom_tools.yaml"}
+     "agent_config": {"llm_engine": "vllm", "llm_endpoint_url": "http://${ip_address}:8080", "tools": "/home/user/comps/agent/src/tools/custom_tools.yaml"}
     }'
 
 ## if want to persist your agent messages, set store config like this:
 curl http://${ip_address}:9090/v1/assistants -X POST -H "Content-Type: application/json" -d '{
-     "agent_config": {"llm_engine": "tgi", "llm_endpoint_url": "http://${ip_address}:8080", "tools": "/home/user/comps/agent/src/tools/custom_tools.yaml","with_store":true, "store_config":{"redis_uri":"redis://${ip_address}:6379"}}
+     "agent_config": {"llm_engine": "vllm", "llm_endpoint_url": "http://${ip_address}:8080", "tools": "/home/user/comps/agent/src/tools/custom_tools.yaml","with_memory":true, "memory_type":"store", "store_config":{"redis_uri":"redis://${ip_address}:6379"}}
     }'
 
 # step2 create thread to get `thread_id`
@@ -157,16 +198,16 @@ curl http://${ip_address}:9090/v1/threads -X POST -H "Content-Type: application/
 
 # step3 create messages
 
-curl http://${ip_address}:9091/v1/threads/{thread_id}/messages -X POST -H "Content-Type: application/json" -d '{"role": "user", "content": "What is OPEA project?"}'
+curl http://${ip_address}:9090/v1/threads/{thread_id}/messages -X POST -H "Content-Type: application/json" -d '{"role": "user", "content": "What is OPEA project?"}'
 
 
-## if agent is set with `with_store`, should add `assistant_id` in the messages for store
+## if agent is set with `memory_type`=store, should add `assistant_id` in the messages for store
 
-curl http://${ip_address}:9091/v1/threads/{thread_id}/messages -X POST -H "Content-Type: application/json" -d '{"role": "user", "content": "What is OPEA project?", "assistant_id": "{assistant_id}"}'
+curl http://${ip_address}:9090/v1/threads/{thread_id}/messages -X POST -H "Content-Type: application/json" -d '{"role": "user", "content": "What is OPEA project?", "assistant_id": "{assistant_id}"}'
 
 # step4 run
 
-curl http://${ip_address}:9091/v1/threads/{thread_id}/runs -X POST -H "Content-Type: application/json" -d '{"assistant_id": "{assistant_id}"}'
+curl http://${ip_address}:9090/v1/threads/{thread_id}/runs -X POST -H "Content-Type: application/json" -d '{"assistant_id": "{assistant_id}"}'
 
 
 ```
diff --git a/comps/agent/src/agent.py b/comps/agent/src/agent.py
index baeb179d2c..c7c2844eae 100644
--- a/comps/agent/src/agent.py
+++ b/comps/agent/src/agent.py
@@ -18,7 +18,7 @@
 from comps.agent.src.integrations.agent import instantiate_agent
 from comps.agent.src.integrations.global_var import assistants_global_kv, threads_global_kv
 from comps.agent.src.integrations.thread import instantiate_thread_memory, thread_completion_callback
-from comps.agent.src.integrations.utils import assemble_store_messages, get_args
+from comps.agent.src.integrations.utils import assemble_store_messages, get_args, get_latest_human_message_from_store
 from comps.cores.proto.api_protocol import (
     AssistantsObject,
     ChatCompletionRequest,
@@ -40,7 +40,7 @@
 
 logger.info("========initiating agent============")
 logger.info(f"args: {args}")
-agent_inst = instantiate_agent(args, args.strategy, with_memory=args.with_memory)
+agent_inst = instantiate_agent(args)
 
 
 class AgentCompletionRequest(ChatCompletionRequest):
@@ -76,7 +76,7 @@ async def llm_generate(input: AgentCompletionRequest):
     if isinstance(input.messages, str):
         messages = input.messages
     else:
-        # TODO: need handle multi-turn messages
+        # last user message
         messages = input.messages[-1]["content"]
 
     # 2. prepare the input for the agent
@@ -90,7 +90,6 @@ async def llm_generate(input: AgentCompletionRequest):
     else:
         logger.info("-----------NOT STREAMING-------------")
         response = await agent_inst.non_streaming_run(messages, config)
-        logger.info("-----------Response-------------")
         return GeneratedDoc(text=response, prompt=messages)
 
 
@@ -100,14 +99,14 @@ class RedisConfig(BaseModel):
 
 class AgentConfig(BaseModel):
     stream: Optional[bool] = False
-    agent_name: Optional[str] = "OPEA_Default_Agent"
+    agent_name: Optional[str] = "OPEA_Agent"
     strategy: Optional[str] = "react_llama"
-    role_description: Optional[str] = "LLM enhanced agent"
+    role_description: Optional[str] = "AI assistant"
     tools: Optional[str] = None
     recursion_limit: Optional[int] = 5
 
-    model: Optional[str] = "meta-llama/Meta-Llama-3-8B-Instruct"
-    llm_engine: Optional[str] = None
+    model: Optional[str] = "meta-llama/Llama-3.3-70B-Instruct"
+    llm_engine: Optional[str] = "vllm"
     llm_endpoint_url: Optional[str] = None
     max_new_tokens: Optional[int] = 1024
     top_k: Optional[int] = 10
@@ -117,10 +116,14 @@ class AgentConfig(BaseModel):
     return_full_text: Optional[bool] = False
     custom_prompt: Optional[str] = None
 
-    # short/long term memory
-    with_memory: Optional[bool] = False
-    # persistence
-    with_store: Optional[bool] = False
+    # # short/long term memory
+    with_memory: Optional[bool] = True
+    # agent memory config
+    # chat_completion api: only supports checkpointer memory
+    # assistants api: supports checkpointer and store memory
+    # checkpointer: in-memory checkpointer - MemorySaver()
+    # store: redis store
+    memory_type: Optional[str] = "checkpointer"  # choices: checkpointer, store
     store_config: Optional[RedisConfig] = None
 
     timeout: Optional[int] = 60
@@ -147,18 +150,17 @@ class CreateAssistant(CreateAssistantsRequest):
 )
 def create_assistants(input: CreateAssistant):
     # 1. initialize the agent
-    agent_inst = instantiate_agent(
-        input.agent_config, input.agent_config.strategy, with_memory=input.agent_config.with_memory
-    )
+    print("@@@ Initializing agent with config: ", input.agent_config)
+    agent_inst = instantiate_agent(input.agent_config)
     assistant_id = agent_inst.id
     created_at = int(datetime.now().timestamp())
     with assistants_global_kv as g_assistants:
         g_assistants[assistant_id] = (agent_inst, created_at)
     logger.info(f"Record assistant inst {assistant_id} in global KV")
 
-    if input.agent_config.with_store:
+    if input.agent_config.memory_type == "store":
         logger.info("Save Agent Config to database")
-        agent_inst.with_store = input.agent_config.with_store
+        # agent_inst.memory_type = input.agent_config.memory_type
         print(input)
         global db_client
         if db_client is None:
@@ -172,6 +174,7 @@ def create_assistants(input: CreateAssistant):
     return AssistantsObject(
         id=assistant_id,
         created_at=created_at,
+        model=input.agent_config.model,
     )
 
 
@@ -211,7 +214,7 @@ def create_messages(thread_id, input: CreateMessagesRequest):
     if isinstance(input.content, str):
         query = input.content
     else:
-        query = input.content[-1]["text"]
+        query = input.content[-1]["text"]  # content is a list of MessageContent
     msg_id, created_at = thread_inst.add_query(query)
 
     structured_content = MessageContent(text=query)
@@ -224,15 +227,18 @@ def create_messages(thread_id, input: CreateMessagesRequest):
         assistant_id=input.assistant_id,
     )
 
-    # save messages using assistant_id as key
+    # save messages using assistant_id_thread_id as key
     if input.assistant_id is not None:
         with assistants_global_kv as g_assistants:
             agent_inst, _ = g_assistants[input.assistant_id]
-        if agent_inst.with_store:
-            logger.info(f"Save Agent Messages, assistant_id: {input.assistant_id}, thread_id: {thread_id}")
+        if agent_inst.memory_type == "store":
+            logger.info(f"Save Messages, assistant_id: {input.assistant_id}, thread_id: {thread_id}")
             # if with store, db_client initialized already
             global db_client
-            db_client.put(msg_id, message.model_dump_json(), input.assistant_id)
+            namespace = f"{input.assistant_id}_{thread_id}"
+            # put(key: str, val: dict, collection: str = DEFAULT_COLLECTION)
+            db_client.put(msg_id, message.model_dump_json(), namespace)
+            logger.info(f"@@@ Save message to db: {msg_id}, {message.model_dump_json()}, {namespace}")
 
     return message
 
@@ -254,15 +260,24 @@ def create_run(thread_id, input: CreateRunResponse):
     with assistants_global_kv as g_assistants:
         agent_inst, _ = g_assistants[assistant_id]
 
-    config = {"recursion_limit": args.recursion_limit}
+    config = {
+        "recursion_limit": args.recursion_limit,
+        "configurable": {"session_id": thread_id, "thread_id": thread_id, "user_id": assistant_id},
+    }
 
-    if agent_inst.with_store:
-        # assemble multi-turn messages
+    if agent_inst.memory_type == "store":
         global db_client
-        input_query = assemble_store_messages(db_client.get_all(assistant_id))
+        namespace = f"{assistant_id}_{thread_id}"
+        # get the latest human message from store in the namespace
+        input_query = get_latest_human_message_from_store(db_client, namespace)
+        print("@@@@ Input_query from store: ", input_query)
     else:
         input_query = thread_inst.get_query()
+        print("@@@@ Input_query from thread_inst: ", input_query)
 
+    print("@@@ Agent instance:")
+    print(agent_inst.id)
+    print(agent_inst.args)
     try:
         return StreamingResponse(
             thread_completion_callback(agent_inst.stream_generator(input_query, config, thread_id), thread_id),
diff --git a/comps/agent/src/integrations/agent.py b/comps/agent/src/integrations/agent.py
index a7713a29bb..8503ebfd34 100644
--- a/comps/agent/src/integrations/agent.py
+++ b/comps/agent/src/integrations/agent.py
@@ -1,9 +1,13 @@
 # Copyright (C) 2024 Intel Corporation
 # SPDX-License-Identifier: Apache-2.0
+from .storage.persistence_redis import RedisPersistence
 from .utils import load_python_prompt
 
 
-def instantiate_agent(args, strategy="react_langchain", with_memory=False):
+def instantiate_agent(args):
+    strategy = args.strategy
+    with_memory = args.with_memory
+
     if args.custom_prompt is not None:
         print(f">>>>>> custom_prompt enabled, {args.custom_prompt}")
         custom_prompt = load_python_prompt(args.custom_prompt)
@@ -22,7 +26,7 @@ def instantiate_agent(args, strategy="react_langchain", with_memory=False):
         print("Initializing ReAct Agent with LLAMA")
         from .strategy.react import ReActAgentLlama
 
-        return ReActAgentLlama(args, with_memory, custom_prompt=custom_prompt)
+        return ReActAgentLlama(args, custom_prompt=custom_prompt)
     elif strategy == "plan_execute":
         from .strategy.planexec import PlanExecuteAgentWithLangGraph
 
diff --git a/comps/agent/src/integrations/strategy/base_agent.py b/comps/agent/src/integrations/strategy/base_agent.py
index 8c0048b879..dd9f070bed 100644
--- a/comps/agent/src/integrations/strategy/base_agent.py
+++ b/comps/agent/src/integrations/strategy/base_agent.py
@@ -3,6 +3,9 @@
 
 from uuid import uuid4
 
+from langgraph.checkpoint.memory import MemorySaver
+
+from ..storage.persistence_redis import RedisPersistence
 from ..tools import get_tools_descriptions
 from ..utils import adapt_custom_prompt, setup_chat_model
 
@@ -12,11 +15,25 @@ def __init__(self, args, local_vars=None, **kwargs) -> None:
         self.llm = setup_chat_model(args)
         self.tools_descriptions = get_tools_descriptions(args.tools)
         self.app = None
-        self.memory = None
         self.id = f"assistant_{self.__class__.__name__}_{uuid4()}"
         self.args = args
         adapt_custom_prompt(local_vars, kwargs.get("custom_prompt"))
-        print(self.tools_descriptions)
+        print("Registered tools: ", self.tools_descriptions)
+
+        if args.with_memory:
+            if args.memory_type == "checkpointer":
+                self.memory_type = "checkpointer"
+                self.checkpointer = MemorySaver()
+                self.store = None
+            elif args.memory_type == "store":
+                # print("Using Redis as store: ", args.store_config.redis_uri)
+                self.store = RedisPersistence(args.store_config.redis_uri)
+                self.memory_type = "store"
+            else:
+                raise ValueError("Invalid memory type!")
+        else:
+            self.store = None
+            self.checkpointer = None
 
     @property
     def is_vllm(self):
@@ -60,10 +77,7 @@ async def non_streaming_run(self, query, config):
         try:
             async for s in self.app.astream(initial_state, config=config, stream_mode="values"):
                 message = s["messages"][-1]
-                if isinstance(message, tuple):
-                    print(message)
-                else:
-                    message.pretty_print()
+                message.pretty_print()
 
             last_message = s["messages"][-1]
             print("******Response: ", last_message.content)
diff --git a/comps/agent/src/integrations/strategy/react/planner.py b/comps/agent/src/integrations/strategy/react/planner.py
index d500412171..1189d97a28 100644
--- a/comps/agent/src/integrations/strategy/react/planner.py
+++ b/comps/agent/src/integrations/strategy/react/planner.py
@@ -11,6 +11,7 @@
 from langgraph.prebuilt import create_react_agent
 
 from ...global_var import threads_global_kv
+from ...storage.persistence_redis import RedisPersistence
 from ...utils import filter_tools, has_multi_tool_inputs, tool_renderer
 from ..base_agent import BaseAgent
 from .prompt import REACT_SYS_MESSAGE, hwchase17_react_prompt
@@ -148,7 +149,13 @@ async def non_streaming_run(self, query, config):
 
 from ...storage.persistence_memory import AgentPersistence, PersistenceConfig
 from ...utils import setup_chat_model
-from .utils import assemble_history, assemble_memory, convert_json_to_tool_call
+from .utils import (
+    assemble_history,
+    assemble_memory,
+    assemble_memory_from_store,
+    convert_json_to_tool_call,
+    save_state_to_store,
+)
 
 
 class AgentState(TypedDict):
@@ -165,7 +172,7 @@ class ReActAgentNodeLlama:
     A workaround for open-source llm served by TGI-gaudi.
     """
 
-    def __init__(self, tools, args):
+    def __init__(self, tools, args, store=None):
         from .prompt import REACT_AGENT_LLAMA_PROMPT
         from .utils import ReActLlamaOutputParser
 
@@ -176,22 +183,34 @@ def __init__(self, tools, args):
         )
         llm = setup_chat_model(args)
         self.tools = tools
-        self.chain = prompt | llm | output_parser
+        self.chain = prompt | llm
+        self.output_parser = output_parser
         self.with_memory = args.with_memory
+        self.memory_type = args.memory_type
+        self.store = store
 
-    def __call__(self, state):
+    def __call__(self, state, config):
 
         print("---CALL Agent node---")
         messages = state["messages"]
 
         # assemble a prompt from messages
         if self.with_memory:
-            query, history = assemble_memory(messages)
-            print("@@@ Query: ", history)
+            if self.memory_type == "checkpointer":
+                query, history, thread_history = assemble_memory(messages)
+            elif self.memory_type == "store":
+                # use thread_id, assistant_id to search memory from store
+                print("@@@ Load memory from store....")
+                query, history, thread_history = assemble_memory_from_store(config, self.store)  # TODO
+            else:
+                raise ValueError("Invalid memory type!")
         else:
             query = messages[0].content
             history = assemble_history(messages)
-        print("@@@ History: ", history)
+            thread_history = ""
+
+        print("@@@ Turn History:\n", history)
+        print("@@@ Thread history:\n", thread_history)
 
         tools_used = self.tools
         if state.get("tool_choice") is not None:
@@ -200,31 +219,40 @@ def __call__(self, state):
         tools_descriptions = tool_renderer(tools_used)
         print("@@@ Tools description: ", tools_descriptions)
 
-        # invoke chain
-        output = self.chain.invoke({"input": query, "history": history, "tools": tools_descriptions})
+        # invoke chain: raw output from llm
+        response = self.chain.invoke(
+            {"input": query, "history": history, "tools": tools_descriptions, "thread_history": thread_history}
+        )
+        response = response.content
+
+        # parse tool calls or answers from raw output: result is a list
+        output = self.output_parser.parse(response)
         print("@@@ Output from chain: ", output)
 
         # convert output to tool calls
         tool_calls = []
-        for res in output:
-            if "tool" in res:
-                add_kw_tc, tool_call = convert_json_to_tool_call(res)
-                # print("Tool call:\n", tool_call)
-                tool_calls.append(tool_call)
-
-        if tool_calls:
-            ai_message = AIMessage(content="", additional_kwargs=add_kw_tc, tool_calls=tool_calls)
-        elif "answer" in output[0]:
-            ai_message = AIMessage(content=str(output[0]["answer"]))
+        if output:
+            for res in output:
+                if "tool" in res:
+                    tool_call = convert_json_to_tool_call(res)
+                    # print("Tool call:\n", tool_call)
+                    tool_calls.append(tool_call)
+
+            if tool_calls:
+                ai_message = AIMessage(content=response, tool_calls=tool_calls)
+            elif "answer" in output[0]:
+                ai_message = AIMessage(content=str(output[0]["answer"]))
         else:
-            ai_message = AIMessage(content=output)
+            ai_message = AIMessage(content=response)
+
         return {"messages": [ai_message]}
 
 
 class ReActAgentLlama(BaseAgent):
-    def __init__(self, args, with_memory=False, **kwargs):
+    def __init__(self, args, **kwargs):
         super().__init__(args, local_vars=globals(), **kwargs)
-        agent = ReActAgentNodeLlama(tools=self.tools_descriptions, args=args)
+
+        agent = ReActAgentNodeLlama(tools=self.tools_descriptions, args=args, store=self.store)
         tool_node = ToolNode(self.tools_descriptions)
 
         workflow = StateGraph(AgentState)
@@ -261,11 +289,12 @@ def __init__(self, args, with_memory=False, **kwargs):
         workflow.add_edge("tools", "agent")
 
         if args.with_memory:
-            self.persistence = AgentPersistence(
-                config=PersistenceConfig(checkpointer=args.with_memory, store=args.with_store)
-            )
-            print(self.persistence.checkpointer)
-            self.app = workflow.compile(checkpointer=self.persistence.checkpointer, store=self.persistence.store)
+            if args.memory_type == "checkpointer":
+                self.app = workflow.compile(checkpointer=self.checkpointer)
+            elif args.memory_type == "store":
+                self.app = workflow.compile(store=self.store)
+            else:
+                raise ValueError("Invalid memory type!")
         else:
             self.app = workflow.compile()
 
@@ -281,6 +310,7 @@ def should_continue(self, state: AgentState):
             return "continue"
 
     def prepare_initial_state(self, query):
+        print("---Prepare initial state---")
         return {"messages": [HumanMessage(content=query)]}
 
     async def stream_generator(self, query, config, thread_id=None):
@@ -289,11 +319,15 @@ async def stream_generator(self, query, config, thread_id=None):
             initial_state["tool_choice"] = config.pop("tool_choice")
 
         try:
+            print("---Start running---")
             async for event in self.app.astream(initial_state, config=config, stream_mode=["updates"]):
+                print(event)
                 event_type = event[0]
                 data = event[1]
                 if event_type == "updates":
                     for node_name, node_state in data.items():
+                        if self.memory_type == "store":
+                            save_state_to_store(node_state, config, self.store)
                         print(f"--- CALL {node_name} node ---\n")
                         for k, v in node_state.items():
                             if v is not None:
@@ -321,17 +355,16 @@ async def stream_generator(self, query, config, thread_id=None):
             yield str(e)
 
     async def non_streaming_run(self, query, config):
+        # for use as worker agent (tool of supervisor agent)
+        # only used in chatcompletion api
+        # chat completion api only supports checkpointer memory
         initial_state = self.prepare_initial_state(query)
         if "tool_choice" in config:
             initial_state["tool_choice"] = config.pop("tool_choice")
         try:
             async for s in self.app.astream(initial_state, config=config, stream_mode="values"):
                 message = s["messages"][-1]
-                if isinstance(message, tuple):
-                    print(message)
-                else:
-                    message.pretty_print()
-
+                message.pretty_print()
             last_message = s["messages"][-1]
             print("******Response: ", last_message.content)
             return last_message.content
diff --git a/comps/agent/src/integrations/strategy/react/prompt.py b/comps/agent/src/integrations/strategy/react/prompt.py
index 681c1325f2..d4b0b0d28f 100644
--- a/comps/agent/src/integrations/strategy/react/prompt.py
+++ b/comps/agent/src/integrations/strategy/react/prompt.py
@@ -17,9 +17,10 @@
 3. Give concise, factual and relevant answers.
 """
 
+
 REACT_AGENT_LLAMA_PROMPT = """\
-You are tasked with answering user questions.
-You have the following tools to gather information:
+You are a help assistant engaged in multi-turn conversations with users.
+You have access to the following tools:
 {tools}
 
 **Procedure:**
@@ -48,12 +49,14 @@
 * If you did not get the answer at first, do not give up. Reflect on the steps that you have taken and try a different way. Think out of the box. You hard work will be rewarded.
 * Do not make up tool outputs.
 
-======= Your task =======
-Question: {input}
+======= Conversations with user in previous turns =======
+{thread_history}
+======= End of previous conversations =======
 
-Execution History:
+======= Your execution History in this turn =========
 {history}
-========================
+======= End of execution history ==========
 
-Now take a deep breath and think step by step to solve the problem.
+Now take a deep breath and think step by step to answer user's question in this turn.
+USER MESSAGE: {input}
 """
diff --git a/comps/agent/src/integrations/strategy/react/utils.py b/comps/agent/src/integrations/strategy/react/utils.py
index 19e51032bd..89301d8938 100644
--- a/comps/agent/src/integrations/strategy/react/utils.py
+++ b/comps/agent/src/integrations/strategy/react/utils.py
@@ -2,12 +2,15 @@
 # SPDX-License-Identifier: Apache-2.0
 
 import json
+import time
 import uuid
+from typing import Any, Dict, List, Literal, Optional, Union
 
 from huggingface_hub import ChatCompletionOutputFunctionDefinition, ChatCompletionOutputToolCall
 from langchain_core.messages import AIMessage, HumanMessage, ToolMessage
 from langchain_core.messages.tool import ToolCall
 from langchain_core.output_parsers import BaseOutputParser
+from pydantic import BaseModel
 
 
 class ReActLlamaOutputParser(BaseOutputParser):
@@ -32,24 +35,15 @@ def parse(self, text: str):
         if output:
             return output
         else:
-            return text
+            return None
 
 
 def convert_json_to_tool_call(json_str):
     tool_name = json_str["tool"]
     tool_args = json_str["args"]
     tcid = str(uuid.uuid4())
-    add_kw_tc = {
-        "tool_calls": [
-            ChatCompletionOutputToolCall(
-                function=ChatCompletionOutputFunctionDefinition(arguments=tool_args, name=tool_name, description=None),
-                id=tcid,
-                type="function",
-            )
-        ]
-    }
     tool_call = ToolCall(name=tool_name, args=tool_args, id=tcid)
-    return add_kw_tc, tool_call
+    return tool_call
 
 
 def get_tool_output(messages, id):
@@ -86,21 +80,31 @@ def assemble_history(messages):
 
 def assemble_memory(messages):
     """
-    messages: Human, AI, TOOL, AI, TOOL, etc.
+    Assemble memory from messages within this thread (i.e., same thread id)
+    messages: Human, AI, TOOL, AI, TOOL, etc. in a thread with multi-turn conversations
+    output:
+    query - user input of current turn.
+    conversation_history - history user input and final ai output in previous turns.
+    query_history - history of tool calls and outputs in current turn.
+
+    How to disect turns: each human message signals the start of a new turn.
     """
     query = ""
     query_id = None
     query_history = ""
     breaker = "-" * 10
 
-    # get query
+    # get most recent human input
     for m in messages[::-1]:
         if isinstance(m, HumanMessage):
             query = m.content
             query_id = m.id
+            most_recent_human_message_idx = messages.index(m)
             break
 
-    for m in messages:
+    # get query history in this turn
+    # start from the most recent human input
+    for m in messages[most_recent_human_message_idx:]:
         if isinstance(m, AIMessage):
             # if there is tool call
             if hasattr(m, "tool_calls") and len(m.tool_calls) > 0:
@@ -115,8 +119,140 @@ def assemble_memory(messages):
                 query_history += f"Assistant Output: {m.content}\n"
 
         elif isinstance(m, HumanMessage):
-            if m.id == query_id:
-                continue
-            query_history += f"Human Input: {m.content}\n"
+            query_history += f"User Input: {m.content}\n"
+
+    # get conversion history of previous turns
+    conversation_history = ""
+    for i, m in enumerate(messages[:most_recent_human_message_idx]):
+        if isinstance(m, HumanMessage):
+            conversation_history += f"User Input: {m.content}\n"
+        elif isinstance(m, AIMessage) and isinstance(messages[i + 1], HumanMessage):
+            conversation_history += f"Assistant Output: {m.content}\n"
+    return query, query_history, conversation_history
+
+
+class ToolCallObject(BaseModel):
+    name: str
+    args: Dict[str, Any]
+    id: str
+
+
+class StoreMessage(BaseModel):
+    id: str
+    object: str = "thread.message"
+    created_at: float
+    role: str
+    content: Optional[str] = None
+    tool_calls: Optional[List[ToolCallObject]] = None
+    tool_call_id: Optional[str] = None
+
+
+def convert_to_message_object(message):
+    if isinstance(message, HumanMessage):
+        message_object = StoreMessage(
+            id=message.id,
+            created_at=time.time(),
+            role="user",
+            content=message.content,
+        )
+    elif isinstance(message, AIMessage):
+        if message.tool_calls:
+            tool_calls = []
+            for tool_call in message.tool_calls:
+                tool_calls.append(
+                    {
+                        "name": tool_call["name"],
+                        "args": tool_call["args"],
+                        "id": tool_call["id"],
+                    }
+                )
+        else:
+            tool_calls = None
+
+        message_object = StoreMessage(
+            id=message.id,
+            created_at=time.time(),
+            role="assistant",
+            content=message.content,
+            tool_calls=tool_calls,
+        )
+
+    elif isinstance(message, ToolMessage):
+        message_object = StoreMessage(
+            id=message.id,
+            created_at=time.time(),
+            role="tool",
+            content=message.content,
+            tool_call_id=message.tool_call_id,
+        )
+    else:
+        raise ValueError("Invalid message type")
+
+    return message_object
+
+
+def save_state_to_store(state, config, store):
+    last_message = state["messages"][-1]
+
+    assistant_id = config["configurable"]["user_id"]
+    thread_id = config["configurable"]["thread_id"]
+    namespace = f"{assistant_id}_{thread_id}"
+
+    # Create a new memory ID
+    memory_id = str(uuid.uuid4())
+
+    # convert message into MessageObject
+    message_object = convert_to_message_object(last_message)
+    store.put(memory_id, message_object.model_dump_json(), namespace)
+
+
+def convert_from_message_object(message_object):
+    if message_object["role"] == "user":
+        try:
+            # MessageObject class has a different structure from StoreMessage
+            message = HumanMessage(content=message_object["content"][0]["text"], id=message_object["id"])
+        except:
+            message = HumanMessage(content=message_object["content"], id=message_object["id"])
+    elif message_object["role"] == "assistant":
+        if message_object["tool_calls"]:
+            tool_calls = []
+            for tool_call in message_object["tool_calls"]:
+                tool_calls.append(ToolCall(name=tool_call["name"], args=tool_call["args"], id=tool_call["id"]))
+            message = AIMessage(content=message_object["content"], id=message_object["id"], tool_calls=tool_calls)
+        else:
+            message = AIMessage(content=message_object["content"], id=message_object["id"])
+    elif message_object["role"] == "tool":
+        message = ToolMessage(content=message_object["content"], tool_call_id=message_object["tool_call_id"])
+    else:
+        raise ValueError("Invalid message role")
+    return message
+
+
+def assemble_memory_from_store(config, store):
+    """
+    store: RedisPersistence
+    """
+    assistant_id = config["configurable"]["user_id"]
+    thread_id = config["configurable"]["thread_id"]
+    namespace = f"{assistant_id}_{thread_id}"
+    print("@@@Namespace: ", namespace)
+
+    # get all the messages in this thread
+    saved_all = store.get_all(namespace)
+    message_objects = []
+    messages = []
+    for saved in saved_all:
+        message_object = json.loads(saved_all[saved])
+        print("@@@@ Saved memory:\n", message_object)
+        message_objects.append(message_object)
+
+    message_objects = sorted(message_objects, key=lambda x: x["created_at"])
+
+    for message_object in message_objects:
+        message = convert_from_message_object(message_object)
+        messages.append(message)
+
+    # print("@@@@ All messages:\n", messages)
 
-    return query, query_history
+    query, query_history, conversation_history = assemble_memory(messages)
+    return query, query_history, conversation_history
diff --git a/comps/agent/src/integrations/utils.py b/comps/agent/src/integrations/utils.py
index b39b1f603a..77f4e1cadb 100644
--- a/comps/agent/src/integrations/utils.py
+++ b/comps/agent/src/integrations/utils.py
@@ -133,6 +133,19 @@ def assemble_store_messages(messages):
     return "\n".join(inputs)
 
 
+def get_latest_human_message_from_store(store, namespace):
+    messages = store.get_all(namespace)
+    human_messages = []
+    for mid in messages:
+        message = json.loads(messages[mid])
+        if message["role"] == "user":
+            human_messages.append(message)
+
+    human_messages = sorted(human_messages, key=lambda x: x["created_at"])
+    latest_human_message = human_messages[-1]
+    return latest_human_message["content"][0]["text"]
+
+
 def get_args():
     parser = argparse.ArgumentParser()
     # llm args
@@ -156,8 +169,8 @@ def get_args():
     parser.add_argument("--repetition_penalty", type=float, default=1.03)
     parser.add_argument("--return_full_text", type=bool, default=False)
     parser.add_argument("--custom_prompt", type=str, default=None)
-    parser.add_argument("--with_memory", type=bool, default=False)
-    parser.add_argument("--with_store", type=bool, default=False)
+    parser.add_argument("--with_memory", type=str, default="true")
+    parser.add_argument("--memory_type", type=str, default="checkpointer", help="choices: checkpointer, store")
     parser.add_argument("--timeout", type=int, default=60)
 
     # for sql agent
@@ -179,6 +192,11 @@ def get_args():
     else:
         sys_args.stream = False
 
+    if sys_args.with_memory == "true":
+        sys_args.with_memory = True
+    else:
+        sys_args.with_memory = False
+
     if sys_args.use_hints == "true":
         print("SQL agent will use hints")
         sys_args.use_hints = True
diff --git a/comps/agent/src/test.py b/comps/agent/src/test.py
index 1ca089ed0c..2993e466d4 100644
--- a/comps/agent/src/test.py
+++ b/comps/agent/src/test.py
@@ -155,10 +155,133 @@ def test_ut(args):
         print(tool)
 
 
+def run_agent(agent, config, input_message):
+    initial_state = agent.prepare_initial_state(input_message)
+
+    try:
+        for s in agent.app.stream(initial_state, config=config, stream_mode="values"):
+            message = s["messages"][-1]
+            message.pretty_print()
+
+        last_message = s["messages"][-1]
+        print("******Response: ", last_message.content)
+    except Exception as e:
+        print(str(e))
+
+
+def stream_generator(agent, config, input_message):
+    from integrations.strategy.react.utils import save_state_to_store
+
+    initial_state = agent.prepare_initial_state(input_message)
+    # try:
+    for event in agent.app.stream(initial_state, config=config, stream_mode=["updates"]):
+        print(event)
+        event_type = event[0]
+        data = event[1]
+        if event_type == "updates":
+            for node_name, node_state in data.items():
+                print(f"@@@ {node_name} : {node_state}")
+                print(" @@@ Save message to store....")
+                save_state_to_store(node_state, config, agent.store)
+                print(f"--- CALL {node_name} node ---\n")
+                for k, v in node_state.items():
+                    if v is not None:
+                        print(f"------- {k}, {v} -------\n\n")
+                        if node_name == "agent":
+                            if v[0].content == "":
+                                tool_names = []
+                                for tool_call in v[0].tool_calls:
+                                    tool_names.append(tool_call["name"])
+                                result = {"tool": tool_names}
+                            else:
+                                result = {"content": [v[0].content.replace("\n\n", "\n")]}
+                            # ui needs this format
+                            print(f"data: {json.dumps(result)}\n\n")
+                        elif node_name == "tools":
+                            full_content = v[0].content
+                            tool_name = v[0].name
+                            result = {"tool": tool_name, "content": [full_content]}
+                            print(f"data: {json.dumps(result)}\n\n")
+                            if not full_content:
+                                continue
+
+    print("data: [DONE]\n\n")
+    # except Exception as e:
+    #     print(str(e))
+
+
+import time
+from uuid import uuid4
+
+
+def save_message_to_store(db_client, namespace, input_message):
+    msg_id = str(uuid4())
+    input_object = json.dumps({"role": "user", "content": input_message, "id": msg_id, "created_at": int(time.time())})
+    db_client.put(msg_id, input_object, namespace)
+
+
+def test_memory(args):
+    from integrations.agent import instantiate_agent
+
+    agent = instantiate_agent(args)
+    print(args)
+
+    assistant_id = "my_assistant"
+    thread_id = str(uuid4())
+    namespace = f"{assistant_id}_{thread_id}"
+    db_client = agent.store
+
+    config = {
+        "recursion_limit": 5,
+        "configurable": {"session_id": thread_id, "thread_id": thread_id, "user_id": assistant_id},
+    }
+
+    input_message = "Hi! I'm Bob."
+    save_message_to_store(db_client, namespace, input_message)
+    run_agent(agent, config, input_message)
+    time.sleep(1)
+    print("============== End of first turn ==============")
+
+    input_message = "What's OPEA project?"
+    save_message_to_store(db_client, namespace, input_message)
+    run_agent(agent, config, input_message)
+    time.sleep(1)
+    print("============== End of second turn ==============")
+
+    input_message = "what's my name?"
+    save_message_to_store(db_client, namespace, input_message)
+    run_agent(agent, config, input_message)
+    time.sleep(1)
+    print("============== End of third turn ==============")
+
+    # input_message = "Hi! I'm Bob."
+    # msg_id = str(uuid4())
+    # input_object = json.dumps({"role": "user", "content": input_message, "id": msg_id, "created_at": int(time.time())})
+    # db_client.put(msg_id, input_object, namespace)
+    # stream_generator(agent, config, input_message)
+    # print("============== End of first turn ==============")
+
+    # time.sleep(1)
+    # input_message = "What's OPEA project?"
+    # msg_id = str(uuid4())
+    # input_object = json.dumps({"role": "user", "content": input_message, "id": msg_id, "created_at": int(time.time())})
+    # db_client.put(msg_id, input_object, namespace)
+    # stream_generator(agent, config, input_message)
+    # print("============== End of second turn ==============")
+
+    # time.sleep(1)
+    # input_message = "what's my name?"
+    # msg_id = str(uuid4())
+    # input_object = json.dumps({"role": "user", "content": input_message, "id": msg_id, "created_at": int(time.time())})
+    # db_client.put(msg_id, input_object, namespace)
+    # stream_generator(agent, config, input_message)
+    # print("============== End of third turn ==============")
+
+
 if __name__ == "__main__":
     args1, _ = get_args()
     parser = argparse.ArgumentParser()
-    parser.add_argument("--strategy", type=str, default="react")
+    parser.add_argument("--strategy", type=str, default="react_llama")
     parser.add_argument("--local_test", action="store_true", help="Test with local mode")
     parser.add_argument("--endpoint_test", action="store_true", help="Test with endpoint mode")
     parser.add_argument("--assistants_api_test", action="store_true", help="Test with endpoint mode")
@@ -175,13 +298,15 @@ def test_ut(args):
     for key, value in vars(args1).items():
         setattr(args, key, value)
 
-    if args.local_test:
-        test_agent_local(args)
-    elif args.endpoint_test:
-        test_agent_http(args)
-    elif args.ut:
-        test_ut(args)
-    elif args.assistants_api_test:
-        test_assistants_http(args)
-    else:
-        print("Please specify the test type")
+    # if args.local_test:
+    #     test_agent_local(args)
+    # elif args.endpoint_test:
+    #     test_agent_http(args)
+    # elif args.ut:
+    #     test_ut(args)
+    # elif args.assistants_api_test:
+    #     test_assistants_http(args)
+    # else:
+    #     print("Please specify the test type")
+
+    test_memory(args)
diff --git a/comps/agent/src/test_assistant_api.py b/comps/agent/src/test_assistant_api.py
index 2a1be1883c..37e0b278cb 100644
--- a/comps/agent/src/test_assistant_api.py
+++ b/comps/agent/src/test_assistant_api.py
@@ -3,12 +3,12 @@
 
 import argparse
 import json
+import time
 
 import requests
-from integrations.utils import get_args
 
 
-def test_assistants_http(args):
+def test_assistants_http(args, agent_config=None):
     proxies = {"http": ""}
     url = f"http://{args.ip_addr}:{args.ext_port}/v1"
 
@@ -33,13 +33,9 @@ def process_request(api, query, is_stream=False):
             return False
 
     # step 1. create assistants
-    # query = {}
+
     query = {
-        "agent_config": {
-            "llm_engine": "tgi",
-            "llm_endpoint_url": args.llm_endpoint_url,
-            "tools": "/home/user/tools/custom_tools.yaml",
-        }
+        "agent_config": agent_config,
     }
 
     if ret := process_request("assistants", query):
@@ -59,99 +55,55 @@ def process_request(api, query, is_stream=False):
         return
 
     # step 3. add messages
-    if args.query is None:
-        query = {"role": "user", "content": "How old was Bill Gates when he built Microsoft?"}
-    else:
-        query = {"role": "user", "content": args.query}
-    if ret := process_request(f"threads/{thread_id}/messages", query):
-        pass
-    else:
-        print("Error when add messages !!!!")
-        return
+    def add_message_and_run(user_message):
+        query = {"role": "user", "content": user_message, "assistant_id": assistant_id}
+        if ret := process_request(f"threads/{thread_id}/messages", query):
+            pass
+        else:
+            print("Error when add messages !!!!")
+            return
 
-    # step 4. run
-    print("You may cancel the running process with cmdline")
-    print(f"curl {url}/threads/{thread_id}/runs/cancel -X POST -H 'Content-Type: application/json'")
+        print("You may cancel the running process with cmdline")
+        print(f"curl {url}/threads/{thread_id}/runs/cancel -X POST -H 'Content-Type: application/json'")
 
-    query = {"assistant_id": assistant_id}
-    process_request(f"threads/{thread_id}/runs", query, is_stream=True)
+        query = {"assistant_id": assistant_id}
+        process_request(f"threads/{thread_id}/runs", query, is_stream=True)
 
-    # ---------------------------------------- test persistent
-    # step 1. create assistants
+    # step 4. First turn
+    user_message = "Hi! I'm Bob."
+    add_message_and_run(user_message)
+    time.sleep(1)
 
-    query = {
-        "agent_config": {
-            "llm_engine": "tgi",
-            "llm_endpoint_url": args.llm_endpoint_url,
-            "tools": "/home/user/comps/agent/src/tools/custom_tools.yaml",
-            "with_store": True,
-            "store_config": {"redis_uri": f"redis://{args.ip_addr}:6379"},
-        }
-    }
+    # step 5. Second turn
+    user_message = "What is OPEA?"
+    add_message_and_run(user_message)
+    time.sleep(1)
 
-    if ret := process_request("assistants", query):
-        assistant_id = ret.get("id")
-        print("Created Assistant Id: ", assistant_id)
-    else:
-        print("Error when creating assistants !!!!")
-        return
-
-    # step 2. create threads
-    query = {}
-    if ret := process_request("threads", query):
-        thread_id = ret.get("id")
-        print("Created Thread Id: ", thread_id)
-    else:
-        print("Error when creating threads !!!!")
-        return
-
-    # step 3. add messages
-    if args.query is None:
-        query = {
-            "role": "user",
-            "content": "How old was Bill Gates when he built Microsoft?",
-            "assistant_id": assistant_id,
-        }
-    else:
-        query = {"role": "user", "content": args.query, "assistant_id": assistant_id}
-    if ret := process_request(f"threads/{thread_id}/messages", query):
-        pass
-    else:
-        print("Error when add messages !!!!")
-        return
-
-    # step 4. run
-    print("You may cancel the running process with cmdline")
-    print(f"curl {url}/threads/{thread_id}/runs/cancel -X POST -H 'Content-Type: application/json'")
-
-    query = {"assistant_id": assistant_id}
-    process_request(f"threads/{thread_id}/runs", query, is_stream=True)
+    # step 6. Third turn
+    user_message = "What is my name?"
+    add_message_and_run(user_message)
+    time.sleep(1)
 
 
 if __name__ == "__main__":
-    args1, _ = get_args()
     parser = argparse.ArgumentParser()
-    parser.add_argument("--strategy", type=str, default="react")
-    parser.add_argument("--local_test", action="store_true", help="Test with local mode")
-    parser.add_argument("--endpoint_test", action="store_true", help="Test with endpoint mode")
-    parser.add_argument("--assistants_api_test", action="store_true", help="Test with endpoint mode")
-    parser.add_argument("--q", type=int, default=0)
+    parser.add_argument("--strategy", type=str, default="react_llama")
     parser.add_argument("--ip_addr", type=str, default="127.0.0.1", help="endpoint ip address")
     parser.add_argument("--ext_port", type=str, default="9090", help="endpoint port")
-    parser.add_argument("--query", type=str, default=None)
-    parser.add_argument("--filedir", type=str, default="./", help="test file directory")
-    parser.add_argument("--filename", type=str, default="query.csv", help="query_list_file")
-    parser.add_argument("--output", type=str, default="output.csv", help="query_list_file")
-    parser.add_argument("--ut", action="store_true", help="ut")
-    parser.add_argument("--llm_endpoint_url", type=str, default="http://localhost:8085", help="tgi/vllm endpoint")
+    parser.add_argument("--llm_endpoint_url", type=str, default="http://localhost:8086", help="tgi/vllm endpoint")
 
     args, _ = parser.parse_known_args()
 
-    for key, value in vars(args1).items():
-        setattr(args, key, value)
+    agent_config = {
+        "strategy": "react_llama",
+        "stream": True,
+        "llm_engine": "vllm",
+        "llm_endpoint_url": args.llm_endpoint_url,
+        "tools": "/home/user/comps/agent/src/tools/custom_tools.yaml",
+        "with_memory": True,
+        "memory_type": "store",
+        "store_config": {"redis_uri": f"redis://{args.ip_addr}:6379"},
+    }
 
-    if args.assistants_api_test:
-        print("test args:", args)
-        test_assistants_http(args)
-    else:
-        print("Please specify the test type")
+    print("test args:", args)
+    test_assistants_http(args, agent_config)
diff --git a/comps/agent/src/test_chat_completion_multiturn.py b/comps/agent/src/test_chat_completion_multiturn.py
new file mode 100644
index 0000000000..94166db380
--- /dev/null
+++ b/comps/agent/src/test_chat_completion_multiturn.py
@@ -0,0 +1,64 @@
+# Copyright (C) 2025 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import argparse
+import json
+import uuid
+
+import requests
+
+
+def process_request(url, query, is_stream=False):
+    proxies = {"http": ""}
+    content = json.dumps(query) if query is not None else None
+    try:
+        resp = requests.post(url=url, data=content, proxies=proxies, stream=is_stream)
+        if not is_stream:
+            ret = resp.json()
+            print(ret)
+        else:
+            for line in resp.iter_lines(decode_unicode=True):
+                print(line)
+            ret = None
+
+        resp.raise_for_status()  # Raise an exception for unsuccessful HTTP status codes
+        return ret
+    except requests.exceptions.RequestException as e:
+        ret = f"An error occurred:{e}"
+        print(ret)
+        return False
+
+
+def add_message_and_run(url, user_message, thread_id, stream=False):
+    query = {"role": "user", "messages": user_message, "thread_id": thread_id, "stream": stream}
+    ret = process_request(url, query, is_stream=stream)
+    print("Response: ", ret)
+
+
+def test_chat_completion_http(args):
+    url = f"http://{args.ip_addr}:{args.ext_port}/v1/chat/completions"
+    thread_id = f"{uuid.uuid4()}"
+
+    # first turn
+    user_message = "Hi! I'm Bob."
+    add_message_and_run(url, user_message, thread_id, stream=args.stream)
+
+    # second turn
+    user_message = "What's OPEA project?"
+    add_message_and_run(url, user_message, thread_id, stream=args.stream)
+
+    # third turn
+    user_message = "What is my name?"
+    add_message_and_run(url, user_message, thread_id, stream=args.stream)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--ip_addr", type=str, default="127.0.0.1", help="endpoint ip address")
+    parser.add_argument("--ext_port", type=str, default="9090", help="endpoint port")
+    parser.add_argument("--llm_endpoint_url", type=str, default="http://localhost:8086", help="tgi/vllm endpoint")
+    parser.add_argument("--stream", action="store_true", help="streaming mode")
+    args, _ = parser.parse_known_args()
+
+    print(args)
+    test_chat_completion_http(args)
diff --git a/tests/agent/build_vllm_gaudi.sh b/tests/agent/build_vllm_gaudi.sh
new file mode 100644
index 0000000000..bc38fdad3d
--- /dev/null
+++ b/tests/agent/build_vllm_gaudi.sh
@@ -0,0 +1,24 @@
+# Copyright (C) 2025 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+function build_vllm_docker_images() {
+    echo "Building the vllm docker images"
+    cd $WORKDIR
+    echo $WORKPATH
+    if [ ! -d "./vllm" ]; then
+        git clone https://github.com/HabanaAI/vllm-fork.git
+    fi
+    cd ./vllm-fork
+    # git fetch --all
+    # git checkout v0.6.4.post2+Gaudi-1.19.0
+    # sed -i 's/triton/triton==3.1.0/g' requirements-hpu.txt
+    docker build --no-cache -f Dockerfile.hpu -t opea/vllm-gaudi:comps --shm-size=128g . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
+    if [ $? -ne 0 ]; then
+        echo "opea/vllm-gaudi:comps failed"
+        exit 1
+    else
+        echo "opea/vllm-gaudi:comps successful"
+    fi
+}
+
+build_vllm_docker_images
diff --git a/tests/agent/launch_vllm_gaudi.sh b/tests/agent/launch_vllm_gaudi.sh
new file mode 100644
index 0000000000..737435f14f
--- /dev/null
+++ b/tests/agent/launch_vllm_gaudi.sh
@@ -0,0 +1,30 @@
+# Copyright (C) 2025 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+model="meta-llama/Llama-3.3-70B-Instruct"
+MAX_INPUT=16384
+vllm_port=8086
+vllm_volume=$HF_CACHE_DIR
+echo "token is ${HF_TOKEN}"
+LOG_PATH=$WORKDIR
+
+echo "start vllm gaudi service"
+echo "**************model is $model**************"
+docker run -d --runtime=habana --rm --name "vllm-gaudi-server" -e HABANA_VISIBLE_DEVICES=0,1,2,3 -p $vllm_port:80 -v $vllm_volume:/data -e HF_TOKEN=$HF_TOKEN -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN -e HF_HOME=/data -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e no_proxy=$no_proxy -e VLLM_SKIP_WARMUP=true --cap-add=sys_nice --ipc=host opea/vllm-gaudi:comps --model ${model} --host 0.0.0.0 --port 80 --max-seq-len-to-capture $MAX_INPUT --tensor-parallel-size 4
+# sleep 5s
+# echo "Waiting vllm gaudi ready"
+# n=0
+# until [[ "$n" -ge 100 ]] || [[ $ready == true ]]; do
+#     docker logs vllm-gaudi-server &> ${LOG_PATH}/vllm-gaudi-service.log
+#     n=$((n+1))
+#     if grep -q "Uvicorn running on" ${LOG_PATH}/vllm-gaudi-service.log; then
+#         break
+#     fi
+#     if grep -q "No such container" ${LOG_PATH}/vllm-gaudi-service.log; then
+#         echo "container vllm-gaudi-server not found"
+#         exit 1
+#     fi
+#     sleep 5s
+# done
+# sleep 5s
+# echo "Service started successfully"
diff --git a/tests/agent/ragagent.yaml b/tests/agent/ragagent.yaml
index c8bda98feb..5842511e87 100644
--- a/tests/agent/ragagent.yaml
+++ b/tests/agent/ragagent.yaml
@@ -14,7 +14,8 @@ services:
       ip_address: ${ip_address}
       strategy: rag_agent_llama
       recursion_limit: ${recursion_limit}
-      llm_engine: tgi
+      llm_engine: vllm
+      with_memory: false
       HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
       llm_endpoint_url: ${LLM_ENDPOINT_URL}
       timeout: 500
diff --git a/tests/agent/reactllama.yaml b/tests/agent/reactllama.yaml
index 62346852ee..4bcf732817 100644
--- a/tests/agent/reactllama.yaml
+++ b/tests/agent/reactllama.yaml
@@ -13,15 +13,17 @@ services:
     environment:
       ip_address: ${ip_address}
       strategy: react_llama
+      with_memory: true
+      memory_type: checkpointer
       recursion_limit: ${recursion_limit}
-      llm_engine: tgi
+      llm_engine: vllm
       HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
       llm_endpoint_url: ${LLM_ENDPOINT_URL}
       model: ${LLM_MODEL_ID}
       temperature: ${temperature}
       max_new_tokens: ${max_new_tokens}
       top_k: 10
-      stream: false
+      stream: true
       tools: /home/user/tools/custom_tools.yaml
       require_human_feedback: false
       no_proxy: ${no_proxy}
diff --git a/tests/agent/sql_agent_llama.yaml b/tests/agent/sql_agent_llama.yaml
index 08fc91fafc..15f4314b8b 100644
--- a/tests/agent/sql_agent_llama.yaml
+++ b/tests/agent/sql_agent_llama.yaml
@@ -8,17 +8,17 @@ services:
     volumes:
       - ${TOOLSET_PATH}:/home/user/tools/ # tools
       # - ${WORKDIR}/GenAIComps/comps:/home/user/comps # code
-      - ${WORKDIR}/TAG-Bench/:/home/user/TAG-Bench # SQL database and hints_file
+      - ${WORKDIR}/GenAIComps/tests/agent/:/home/user/chinook-db/
     ports:
       - "9095:9095"
     ipc: host
     environment:
       ip_address: ${ip_address}
       strategy: sql_agent_llama
+      with_memory: false
       db_name: ${db_name}
       db_path: ${db_path}
       use_hints: false
-      hints_file: /home/user/TAG-Bench/${db_name}_hints.csv
       recursion_limit: ${recursion_limit}
       llm_engine: vllm
       HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
diff --git a/tests/agent/sql_agent_test/test_sql_agent.sh b/tests/agent/sql_agent_test/test_sql_agent.sh
index edbdb01e48..ac2084187a 100644
--- a/tests/agent/sql_agent_test/test_sql_agent.sh
+++ b/tests/agent/sql_agent_test/test_sql_agent.sh
@@ -23,17 +23,17 @@ export ip_address=$(hostname -I | awk '{print $1}')
 vllm_port=8086
 vllm_volume=${HF_CACHE_DIR}
 
-export model=meta-llama/Meta-Llama-3.1-70B-Instruct
+export model=meta-llama/Llama-3.3-70B-Instruct #meta-llama/Meta-Llama-3.1-70B-Instruct
 export HUGGINGFACEHUB_API_TOKEN=${HF_TOKEN}
-export LLM_MODEL_ID="meta-llama/Meta-Llama-3.1-70B-Instruct"
+export LLM_MODEL_ID="meta-llama/Llama-3.3-70B-Instruct" #"meta-llama/Meta-Llama-3.1-70B-Instruct"
 export LLM_ENDPOINT_URL="http://${ip_address}:${vllm_port}"
 export temperature=0.01
 export max_new_tokens=4096
 export TOOLSET_PATH=$WORKPATH/comps/agent/src/tools/ # $WORKPATH/tests/agent/sql_agent_test/
 echo "TOOLSET_PATH=${TOOLSET_PATH}"
 export recursion_limit=15
-export db_name=california_schools
-export db_path="sqlite:////home/user/TAG-Bench/dev_folder/dev_databases/${db_name}/${db_name}.sqlite"
+export db_name=Chinook
+export db_path="sqlite:////home/user/chinook-db/Chinook_Sqlite.sqlite"
 
 # for using Google search API
 export GOOGLE_CSE_ID=${GOOGLE_CSE_ID}
@@ -57,6 +57,22 @@ function prepare_data() {
     echo "Data preparation done!"
 }
 
+function download_chinook_data(){
+    echo "Downloading chinook data..."
+    cd $WORKDIR
+    git clone https://github.com/lerocha/chinook-database.git
+    cp chinook-database/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite $WORKDIR/GenAIComps/tests/agent/
+}
+
+function remove_chinook_data(){
+    echo "Removing chinook data..."
+    cd $WORKDIR
+    if [ -d "chinook-database" ]; then
+        rm -rf chinook-database
+    fi
+    echo "Chinook data removed!"
+}
+
 function remove_data() {
     echo "Removing data..."
     cd $WORKDIR
@@ -102,7 +118,6 @@ function build_vllm_docker_images() {
 }
 
 function start_vllm_service() {
-    # redis endpoint
     echo "token is ${HF_TOKEN}"
 
     #single card
@@ -164,7 +179,7 @@ function run_benchmark() {
 
 
 echo "Preparing data...."
-prepare_data
+download_chinook_data
 
 echo "launching sql_agent_llama service...."
 start_sql_agent_llama_service
@@ -176,4 +191,4 @@ echo "Running test...."
 run_test
 
 echo "Removing data...."
-remove_data
+remove_chinook_data
diff --git a/tests/agent/test.py b/tests/agent/test.py
index 2da0f6c8e6..6223bc0819 100644
--- a/tests/agent/test.py
+++ b/tests/agent/test.py
@@ -20,9 +20,7 @@ def generate_answer_agent_api(url, prompt):
 def process_request(url, query, is_stream=False):
     proxies = {"http": ""}
 
-    payload = {
-        "messages": query,
-    }
+    payload = {"messages": query, "stream": is_stream}
 
     try:
         resp = requests.post(url=url, json=payload, proxies=proxies, stream=is_stream)
@@ -51,7 +49,7 @@ def process_request(url, query, is_stream=False):
     ip_address = os.getenv("ip_address", "localhost")
     url = f"http://{ip_address}:9095/v1/chat/completions"
     if args.test_sql_agent:
-        prompt = "How many schools have the average score in Math over 560 in the SAT test?"
+        prompt = "Who has the most albums?"
     else:
         prompt = "What is OPEA?"
 
diff --git a/tests/agent/test_agent_langchain_on_intel_hpu.sh b/tests/agent/test_agent_langchain_on_intel_hpu.sh
index 2c12354723..ae2fd984a7 100644
--- a/tests/agent/test_agent_langchain_on_intel_hpu.sh
+++ b/tests/agent/test_agent_langchain_on_intel_hpu.sh
@@ -14,21 +14,24 @@ tgi_port=8085
 tgi_volume=$WORKPATH/data
 
 vllm_port=8086
-export vllm_volume=$WORKPATH/data
-echo "vllm_volume:"
-ls $vllm_volume
+export HF_CACHE_DIR=/data2/huggingface
+echo  "HF_CACHE_DIR=$HF_CACHE_DIR"
+ls $HF_CACHE_DIR
+export vllm_volume=${HF_CACHE_DIR}
+
 
 export WORKPATH=$WORKPATH
 
 export agent_image="opea/agent:comps"
 export agent_container_name="test-comps-agent-endpoint"
 
-export model=meta-llama/Meta-Llama-3.1-70B-Instruct
+export model=meta-llama/Llama-3.3-70B-Instruct #meta-llama/Meta-Llama-3.1-70B-Instruct
 export HUGGINGFACEHUB_API_TOKEN=${HF_TOKEN}
 export ip_address=$(hostname -I | awk '{print $1}')
 export HUGGINGFACEHUB_API_TOKEN=${HUGGINGFACEHUB_API_TOKEN}
-export LLM_MODEL_ID="meta-llama/Meta-Llama-3.1-70B-Instruct"
+export LLM_MODEL_ID="meta-llama/Llama-3.3-70B-Instruct" #"meta-llama/Meta-Llama-3.1-70B-Instruct"
 export LLM_ENDPOINT_URL="http://${ip_address}:${vllm_port}"
+echo "LLM_ENDPOINT_URL: $LLM_ENDPOINT_URL"
 export temperature=0.01
 export max_new_tokens=4096
 export TOOLSET_PATH=$WORKPATH/comps/agent/src/tools/
@@ -113,7 +116,6 @@ function start_vllm_service() {
 }
 
 function start_vllm_auto_tool_choice_service() {
-    # redis endpoint
     echo "token is ${HF_TOKEN}"
 
     #single card
@@ -164,15 +166,12 @@ function start_vllm_service_70B() {
     echo "Service started successfully"
 }
 
-function start_react_agent_service() {
-
-    echo "Starting redis for testing agent persistent"
-
-    docker run -d -it -p 6379:6379 --rm --name "test-persistent-redis" --net=host --ipc=host --name redis-vector-db redis/redis-stack:7.2.0-v9
+agent_start_wait_time=2m
 
+function start_react_agent_service() {
     echo "Starting react agent microservice"
     docker compose -f $WORKPATH/tests/agent/react_langchain.yaml up -d
-    sleep 120s
+    sleep $agent_start_wait_time
     docker logs test-comps-agent-endpoint
     echo "Service started successfully"
 }
@@ -181,16 +180,20 @@ function start_react_agent_service() {
 function start_react_langgraph_agent_service_openai() {
     echo "Starting react agent microservice"
     docker compose -f $WORKPATH/tests/agent/react_langgraph_openai.yaml up -d
-    sleep 120s
+    sleep $agent_start_wait_time
     docker logs test-comps-agent-endpoint
     echo "Service started successfully"
 }
 
 
 function start_react_llama_agent_service() {
-    echo "Starting react_langgraph agent microservice"
+    echo "Starting redis for testing agent persistent"
+
+    docker run -d -it -p 6379:6379 --rm --name "test-persistent-redis" --net=host --ipc=host --name redis-vector-db redis/redis-stack:7.2.0-v9
+
+    echo "Starting react_llama agent microservice"
     docker compose -f $WORKPATH/tests/agent/reactllama.yaml up -d
-    sleep 120s
+    sleep $agent_start_wait_time
     docker logs test-comps-agent-endpoint
     echo "Service started successfully"
 }
@@ -198,7 +201,7 @@ function start_react_llama_agent_service() {
 function start_react_langgraph_agent_service_vllm() {
     echo "Starting react_langgraph agent microservice"
     docker compose -f $WORKPATH/tests/agent/react_vllm.yaml up -d
-    sleep 120s
+    sleep $agent_start_wait_time
     docker logs test-comps-agent-endpoint
     echo "Service started successfully"
 }
@@ -206,7 +209,7 @@ function start_react_langgraph_agent_service_vllm() {
 function start_planexec_agent_service_vllm() {
     echo "Starting planexec agent microservice"
     docker compose -f $WORKPATH/tests/agent/planexec_vllm.yaml up -d
-    sleep 120s
+    sleep $agent_start_wait_time
     docker logs test-comps-agent-endpoint
     echo "Service started successfully"
 }
@@ -214,7 +217,7 @@ function start_planexec_agent_service_vllm() {
 function start_ragagent_agent_service() {
     echo "Starting rag agent microservice"
     docker compose -f $WORKPATH/tests/agent/ragagent.yaml up -d
-    sleep 120s
+    sleep $agent_start_wait_time
     docker logs test-comps-agent-endpoint
     echo "Service started successfully"
 }
@@ -222,7 +225,7 @@ function start_ragagent_agent_service() {
 function start_ragagent_agent_service_openai() {
     echo "Starting rag agent microservice"
     docker compose -f $WORKPATH/tests/agent/ragagent_openai.yaml up -d
-    sleep 120s
+    sleep $agent_start_wait_time
     docker logs test-comps-agent-endpoint
     echo "Service started successfully"
 }
@@ -230,7 +233,7 @@ function start_ragagent_agent_service_openai() {
 function start_planexec_agent_service_openai() {
     echo "Starting plan execute agent microservice"
     docker compose -f $WORKPATH/tests/agent/planexec_openai.yaml up -d
-    sleep 120s
+    sleep $agent_start_wait_time
     docker logs test-comps-agent-endpoint
     echo "Service started successfully"
 }
@@ -255,9 +258,6 @@ function validate() {
 
 function validate_microservice() {
     echo "Testing agent service - chat completion API"
-    # local CONTENT=$(http_proxy="" curl http://${ip_address}:9095/v1/chat/completions -X POST -H "Content-Type: application/json" -d '{
-    #  "query": "What is OPEA?"
-    # }')
     CONTENT=$(python3 $WORKPATH/tests/agent/test.py)
     local EXIT_CODE=$(validate "$CONTENT" "OPEA" "test-agent")
     echo "$EXIT_CODE"
@@ -279,16 +279,16 @@ function validate_microservice() {
 }
 
 
-function validate_microservice_streaming() {
+function validate_microservice_multi_turn_streaming() {
     echo "Testing agent service - chat completion API"
-    CONTENT=$(python3 $WORKPATH/tests/agent/test.py --stream)
+    CONTENT=$(python3 $WORKPATH/comps/agent/src/test_chat_completion_multiturn.py --ip_addr ${ip_address} --ext_port 9095 --stream true --llm_endpoint_url $LLM_ENDPOINT_URL 2>&1 | tee ${LOG_PATH}/test-agent.log)
     local EXIT_CODE=$(validate "$CONTENT" "OPEA" "test-agent")
     echo "$EXIT_CODE"
     local EXIT_CODE="${EXIT_CODE:0-1}"
     echo "return value is $EXIT_CODE"
     if [ "$EXIT_CODE" == "1" ]; then
-        echo "==================TGI logs ======================"
-        docker logs test-comps-tgi-gaudi-service
+        echo "==================vllm logs ======================"
+        docker logs test-comps-vllm-gaudi-service
         echo "==================Agent logs ======================"
         docker logs test-comps-agent-endpoint
         exit 1
@@ -298,16 +298,16 @@ function validate_microservice_streaming() {
 function validate_assistant_api() {
     cd $WORKPATH
     echo "Testing agent service - assistant api"
-    local CONTENT=$(python3 comps/agent/src/test_assistant_api.py --ip_addr ${ip_address} --ext_port 9095 --assistants_api_test --query 'What is Intel OPEA project?' --llm_endpoint_url $LLM_ENDPOINT_URL 2>&1 | tee ${LOG_PATH}/test-agent-assistantsapi.log)
+    local CONTENT=$(python3 $WORKPATH/comps/agent/src/test_assistant_api.py --ip_addr ${ip_address} --ext_port 9095 --llm_endpoint_url $LLM_ENDPOINT_URL)
     local EXIT_CODE=$(validate "$CONTENT" "OPEA" "test-agent-assistantsapi")
     echo "$EXIT_CODE"
     local EXIT_CODE="${EXIT_CODE:0-1}"
     echo "return value is $EXIT_CODE"
     if [ "$EXIT_CODE" == "1" ]; then
-        echo "==================TGI logs ======================"
-        docker logs comps-tgi-gaudi-service
+        echo "==================vllm logs ======================"
+        docker logs test-comps-vllm-gaudi-service
         echo "==================Agent logs ======================"
-        docker logs comps-agent-endpoint
+        docker logs test-comps-agent-endpoint
         exit 1
     fi
 }
@@ -356,7 +356,7 @@ function stop_docker() {
 function validate_sql_agent(){
     cd $WORKPATH/tests/
     local CONTENT=$(bash agent/sql_agent_test/test_sql_agent.sh)
-    local EXIT_CODE=$(validate "$CONTENT" "173" "test-sql-agent")
+    local EXIT_CODE=$(validate "$CONTENT" "Iron" "test-sql-agent")
     echo "$EXIT_CODE"
     local EXIT_CODE="${EXIT_CODE:0-1}"
     echo "return value is $EXIT_CODE"
@@ -376,74 +376,77 @@ function main() {
     build_docker_images
     build_vllm_docker_images
 
-    # ==================== Tests with 70B model ====================
-    # RAG agent, react_llama, react, assistant apis
+    # # ==================== Tests with 70B model ====================
+    # # RAG agent, react_llama, react, assistant apis
 
     start_vllm_service_70B
 
     # # test rag agent
+    # chat completion API, no memory, single-turn
     start_ragagent_agent_service
-    echo "=============Testing RAG Agent============="
+    echo "=============Testing RAG Agent: chat completion, single-turn, not streaming ============="
     validate_microservice
     stop_agent_docker
     echo "============================================="
 
     # # # test react_llama
-    start_react_llama_agent_service
+    start_react_llama_agent_service # also starts redis db
     echo "===========Testing ReAct Llama ============="
+    # chat completion single-turn not streaming
+    echo "=============Testing ReAct Llama: chat completion, single-turn, not streaming ============="
     validate_microservice
-    stop_agent_docker
-    echo "============================================="
 
+    # multi-turn streaming
+    echo "=============Testing ReAct Llama: chat completion, multi-turn streaming ============="
+    validate_microservice_multi_turn_streaming
 
-    # # # test react
-    start_react_agent_service
-    echo "=============Testing ReAct Langchain============="
-    validate_microservice_streaming
+    # test assistant api multi-turn streaming
+    echo "=============Testing ReAct Llama: assistant api, multi-turn streaming, persistent ============="
     validate_assistant_api
     stop_agent_docker
     echo "============================================="
 
-    # # test sql agent
-    echo "=============Testing SQL llama============="
-    validate_sql_agent
-    stop_docker
-    echo "============================================="
 
-    echo "===========Testing Plan Execute VLLM Llama3.1 ============="
-    start_vllm_service
-    start_planexec_agent_service_vllm
-    validate_microservice
-    stop_agent_docker
-    stop_vllm_docker
-    echo "============================================="
-
-    echo "===========Testing ReAct Langgraph VLLM llama3.1 ============="
-    export model_parser=llama3_json
-    start_vllm_auto_tool_choice_service
-    start_react_langgraph_agent_service_vllm
-    validate_microservice
+    # # # test sql agent
+    echo "=============Testing SQL llama: chat completion, single-turn, not streaming ============="
+    validate_sql_agent
     stop_agent_docker
-    stop_vllm_docker
     echo "============================================="
 
-    # # ==================== OpenAI tests ====================
-    # start_ragagent_agent_service_openai
-    # echo "=============Testing RAG Agent OpenAI============="
-    # validate_microservice
-    # stop_agent_docker
-    # echo "============================================="
-
-    # start_react_langgraph_agent_service_openai
-    # echo "===========Testing ReAct Langgraph OpenAI ============="
-    # validate_microservice
-    # stop_agent_docker
-    # echo "============================================="
-
-    # start_planexec_agent_service_openai
-    # echo "===========Testing Plan Execute OpenAI ============="
-    # validate_microservice
-    # stop_agent_docker
+    # # echo "===========Testing Plan Execute VLLM Llama3.1 ============="
+    # # start_vllm_service
+    # # start_planexec_agent_service_vllm
+    # # validate_microservice
+    # # stop_agent_docker
+    # # stop_vllm_docker
+    # # echo "============================================="
+
+    # # echo "===========Testing ReAct Langgraph VLLM llama3.1 ============="
+    # # export model_parser=llama3_json
+    # # start_vllm_auto_tool_choice_service
+    # # start_react_langgraph_agent_service_vllm
+    # # validate_microservice
+    # # stop_agent_docker
+    # # stop_vllm_docker
+    # # echo "============================================="
+
+    # # # ==================== OpenAI tests ====================
+    # # start_ragagent_agent_service_openai
+    # # echo "=============Testing RAG Agent OpenAI============="
+    # # validate_microservice
+    # # stop_agent_docker
+    # # echo "============================================="
+
+    # # start_react_langgraph_agent_service_openai
+    # # echo "===========Testing ReAct Langgraph OpenAI ============="
+    # # validate_microservice
+    # # stop_agent_docker
+    # # echo "============================================="
+
+    # # start_planexec_agent_service_openai
+    # # echo "===========Testing Plan Execute OpenAI ============="
+    # # validate_microservice
+    # # stop_agent_docker
 
     stop_docker