[Issue]: Error executing verb "select" in create_final_entities: "['type', 'description'] not in index" #926

cco9ktb · 2024-08-14T08:19:48Z

Is there an existing issue for this?

I have searched the existing issues
I have checked #657 to validate if my issue is covered by community support

Describe the issue

An error occurred when the workflow executed create_final_entities.
Error executing verb "select" in create_final_entities: "['type', 'description'] not in index"

The problem seems to be with the input data.
I tried to modify select.py to print log.
The select.py after modified:

@verb(name="select", treats_input_tables_as_immutable=True)
def select(
    input: VerbInput,
    columns: list[str],
    **_kwargs: dict,
) -> VerbResult:
    """Select verb implementation."""
    input_table = input.get_input()
    print("Columns that input table have:", input_table.columns)
    print("Columns needed:", columns)
    output = cast(Table, input_table[columns])
    return create_verb_result(output)

The following console output is obtained:

🚀 Reading settings from ragtest/settings.yaml
Columns that input table have: Index(['level', 'title', 'source_id', 'degree',
'human_readable_id', 'id',
       'graph_embedding', 'cluster'],
      dtype='object')
Columns needed: ['id', 'title', 'type', 'description', 'human_readable_id',
'graph_embedding', 'source_id']
❌ create_final_entities
None
⠦ GraphRAG Indexer
├── Loading Input (InputFileType.text) - 1 files loaded (1 filtered) ━ 100% … 0…
└── create_final_entities
❌ Errors occurred during the pipeline run, see logs for more details.

Steps to reproduce

Get error after execute python -m graphrag.index --root ./ragtest --resume 20240813-090332

GraphRAG Config Used

encoding_model: cl100k_base
skip_workflows: []
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_chat # or azure_openai_chat
model: mistral-nemo:12b-instruct-2407-q8_0
model_supports_json: true # recommended if this is available for your model.
max_tokens: 4096
request_timeout: 600.0
api_base: http://localhost:11434/v1

api_version: 2024-02-15-preview

organization: <organization_id>

deployment_name: <azure_model_deployment_name>

tokens_per_minute: 150_000 # set a leaky bucket throttle

requests_per_minute: 10_000 # set a leaky bucket throttle

max_retries: 10

max_retry_wait: 10.0

sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times

concurrent_requests: 10 # the number of parallel inflight requests that may be made

temperature: 0 # temperature for sampling

top_p: 1 # top-p sampling

n: 1 # Number of completions to generate

parallelization:
stagger: 0.3

num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:

parallelization: override the global parallelization settings for embeddings

async_mode: threaded # or asyncio
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_embedding # or azure_openai_embedding
model: nomic-embed-text
api_base: http://localhost:11434/api
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
# batch_size: 16 # the number of documents to send in a single request
# batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
# target: required # or optional

chunks:
size: 1000
overlap: 100
group_by_columns: [id] # by default, we don't allow chunks to cross documents

input:
type: file # or blob
file_type: text # or csv
base_dir: "input"
file_encoding: utf-8
file_pattern: ".*\.txt$"

cache:
type: file # or blob
base_dir: "cache"

connection_string: <azure_blob_storage_connection_string>

container_name: <azure_blob_storage_container_name>

storage:
type: file # or blob
base_dir: "output/${timestamp}/artifacts"

connection_string: <azure_blob_storage_connection_string>

container_name: <azure_blob_storage_container_name>

reporting:
type: file # or console, blob
base_dir: "output/${timestamp}/reports"

connection_string: <azure_blob_storage_connection_string>

container_name: <azure_blob_storage_container_name>

entity_extraction:

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

prompt: "prompts/entity_extraction.txt"
entity_types: [organization,person,geo,event]
max_gleanings: 1

summarize_descriptions:

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

prompt: "prompts/summarize_descriptions.txt"
max_length: 500

claim_extraction:

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

enabled: true

prompt: "prompts/claim_extraction.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 1

community_reports:

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 8000

cluster_graph:
max_cluster_size: 10

embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes

num_walks: 10

walk_length: 40

window_size: 2

iterations: 3

random_seed: 597832

umap:
enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
graphml: false
raw_entities: false
top_level_nodes: false

local_search:

text_unit_prop: 0.5

community_prop: 0.1

conversation_history_max_turns: 5

top_k_mapped_entities: 10

top_k_relationships: 10

llm_temperature: 0 # temperature for sampling

llm_top_p: 1 # top-p sampling

llm_n: 1 # Number of completions to generate

max_tokens: 12000

global_search:

llm_temperature: 0 # temperature for sampling

llm_top_p: 1 # top-p sampling

llm_n: 1 # Number of completions to generate

max_tokens: 12000

data_max_tokens: 12000

map_max_tokens: 1000

reduce_max_tokens: 2000

concurrency: 32

Logs and screenshots

here is indexing-engine.log

15:42:31,698 graphrag.index.create_pipeline_config INFO skipping workflows 
15:42:31,705 graphrag.index.run INFO Running pipeline
15:42:31,705 graphrag.index.storage.file_pipeline_storage INFO Creating file storage at ragtest/output/20240813-090332/artifacts
15:42:31,706 graphrag.index.input.load_input INFO loading input from root_dir=input
15:42:31,706 graphrag.index.input.load_input INFO using file storage for input
15:42:31,706 graphrag.index.storage.file_pipeline_storage INFO search ragtest/input for files matching .*\.txt$
15:42:31,706 graphrag.index.input.text INFO found text files from input, found [('Three Kingdom.txt', {})]
15:42:31,713 graphrag.index.input.text INFO Found 1 files, loading 1
15:42:31,714 graphrag.index.workflows.load INFO Workflow Run Order: ['create_base_text_units', 'create_base_extracted_entities', 'create_summarized_entities', 'create_base_entity_graph', 'create_final_entities', 'create_final_nodes', 'create_final_communities', 'join_text_units_to_entity_ids', 'create_final_relationships', 'join_text_units_to_relationship_ids', 'create_final_community_reports', 'create_final_text_units', 'create_base_documents', 'create_final_documents']
15:42:31,714 graphrag.index.run INFO Final # of rows loaded: 1
15:42:31,808 graphrag.index.run INFO Running workflow: create_base_text_units...
15:42:31,808 graphrag.index.run INFO Skipping create_base_text_units because it already exists
15:42:31,901 graphrag.index.run INFO Running workflow: create_base_extracted_entities...
15:42:31,902 graphrag.index.run INFO Skipping create_base_extracted_entities because it already exists
15:42:31,995 graphrag.index.run INFO Running workflow: create_summarized_entities...
15:42:31,995 graphrag.index.run INFO Skipping create_summarized_entities because it already exists
15:42:32,87 graphrag.index.run INFO Running workflow: create_base_entity_graph...
15:42:32,87 graphrag.index.run INFO Skipping create_base_entity_graph because it already exists
15:42:32,179 graphrag.index.run INFO Running workflow: create_final_entities...
15:42:32,180 graphrag.index.run INFO dependencies for create_final_entities: ['create_base_entity_graph']
15:42:32,180 graphrag.index.run INFO read table from storage: create_base_entity_graph.parquet
15:42:32,185 datashaper.workflow.workflow INFO executing verb unpack_graph
15:42:32,188 datashaper.workflow.workflow INFO executing verb rename
15:42:32,189 datashaper.workflow.workflow INFO executing verb select
15:42:32,193 datashaper.workflow.workflow ERROR Error executing verb "select" in create_final_entities: "['type', 'description'] not in index"
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/datashaper/workflow/workflow.py", line 410, in _execute_verb
    result = node.verb.func(**verb_args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/datashaper/engine/verbs/select.py", line 27, in select
    output = cast(Table, input_table[columns])
                         ~~~~~~~~~~~^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/frame.py", line 4108, in __getitem__
    indexer = self.columns._get_indexer_strict(key, "columns")[1]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/indexes/base.py", line 6200, in _get_indexer_strict
    self._raise_if_missing(keyarr, indexer, axis_name)
  File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/indexes/base.py", line 6252, in _raise_if_missing
    raise KeyError(f"{not_found} not in index")
KeyError: "['type', 'description'] not in index"
15:42:32,195 graphrag.index.reporting.file_workflow_callbacks INFO Error executing verb "select" in create_final_entities: "['type', 'description'] not in index" details=None
15:42:32,196 graphrag.index.run ERROR error running workflow create_final_entities
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/graphrag/index/run.py", line 325, in run_pipeline
    result = await workflow.run(context, callbacks)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/datashaper/workflow/workflow.py", line 369, in run
    timing = await self._execute_verb(node, context, callbacks)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/datashaper/workflow/workflow.py", line 410, in _execute_verb
    result = node.verb.func(**verb_args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/datashaper/engine/verbs/select.py", line 27, in select
    output = cast(Table, input_table[columns])
                         ~~~~~~~~~~~^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/frame.py", line 4108, in __getitem__
    indexer = self.columns._get_indexer_strict(key, "columns")[1]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/indexes/base.py", line 6200, in _get_indexer_strict
    self._raise_if_missing(keyarr, indexer, axis_name)
  File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/indexes/base.py", line 6252, in _raise_if_missing
    raise KeyError(f"{not_found} not in index")
KeyError: "['type', 'description'] not in index"
15:42:32,196 graphrag.index.reporting.file_workflow_callbacks INFO Error running pipeline! details=None

here is logs.json

{"type": "error", "data": "Error executing verb \"select\" in create_final_entities: \"['type', 'description'] not in index\"", "stack": "Traceback (most recent call last):\n  File \"/opt/anaconda3/lib/python3.12/site-packages/datashaper/workflow/workflow.py\", line 410, in _execute_verb\n    result = node.verb.func(**verb_args)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/opt/anaconda3/lib/python3.12/site-packages/datashaper/engine/verbs/select.py\", line 27, in select\n    output = cast(Table, input_table[columns])\n                         ~~~~~~~~~~~^^^^^^^^^\n  File \"/opt/anaconda3/lib/python3.12/site-packages/pandas/core/frame.py\", line 4108, in __getitem__\n    indexer = self.columns._get_indexer_strict(key, \"columns\")[1]\n              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/opt/anaconda3/lib/python3.12/site-packages/pandas/core/indexes/base.py\", line 6200, in _get_indexer_strict\n    self._raise_if_missing(keyarr, indexer, axis_name)\n  File \"/opt/anaconda3/lib/python3.12/site-packages/pandas/core/indexes/base.py\", line 6252, in _raise_if_missing\n    raise KeyError(f\"{not_found} not in index\")\nKeyError: \"['type', 'description'] not in index\"\n", "source": "\"['type', 'description'] not in index\"", "details": null}
{"type": "error", "data": "Error running pipeline!", "stack": "Traceback (most recent call last):\n  File \"/opt/anaconda3/lib/python3.12/site-packages/graphrag/index/run.py\", line 325, in run_pipeline\n    result = await workflow.run(context, callbacks)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/opt/anaconda3/lib/python3.12/site-packages/datashaper/workflow/workflow.py\", line 369, in run\n    timing = await self._execute_verb(node, context, callbacks)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/opt/anaconda3/lib/python3.12/site-packages/datashaper/workflow/workflow.py\", line 410, in _execute_verb\n    result = node.verb.func(**verb_args)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/opt/anaconda3/lib/python3.12/site-packages/datashaper/engine/verbs/select.py\", line 27, in select\n    output = cast(Table, input_table[columns])\n                         ~~~~~~~~~~~^^^^^^^^^\n  File \"/opt/anaconda3/lib/python3.12/site-packages/pandas/core/frame.py\", line 4108, in __getitem__\n    indexer = self.columns._get_indexer_strict(key, \"columns\")[1]\n              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/opt/anaconda3/lib/python3.12/site-packages/pandas/core/indexes/base.py\", line 6200, in _get_indexer_strict\n    self._raise_if_missing(keyarr, indexer, axis_name)\n  File \"/opt/anaconda3/lib/python3.12/site-packages/pandas/core/indexes/base.py\", line 6252, in _raise_if_missing\n    raise KeyError(f\"{not_found} not in index\")\nKeyError: \"['type', 'description'] not in index\"\n", "source": "\"['type', 'description'] not in index\"", "details": null}

Additional Information

GraphRAG Version: 0.3.0
Operating System: MacOS 14.6.1
Python Version: 3.12
Related Issues:

The text was updated successfully, but these errors were encountered:

natoverse · 2024-08-14T18:27:12Z

It seems like the LLM response is not including the required information for each entity during extraction, so downstream processing of the data frame fails. You might try tuning the prompt to align better with the model you are using.

Linking this to #657 for alternate models.

cco9ktb added the triage Default label assignment, indicates new issue needs reviewed by a maintainer label Aug 14, 2024

natoverse closed this as not planned Won't fix, can't repro, duplicate, stale Aug 14, 2024

natoverse added community_support Issue handled by community members and removed triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: Error executing verb "select" in create_final_entities: "['type', 'description'] not in index" #926

[Issue]: Error executing verb "select" in create_final_entities: "['type', 'description'] not in index" #926

cco9ktb commented Aug 14, 2024

natoverse commented Aug 14, 2024

[Issue]: Error executing verb "select" in create_final_entities: "['type', 'description'] not in index" #926

[Issue]: Error executing verb "select" in create_final_entities: "['type', 'description'] not in index" #926

Comments

cco9ktb commented Aug 14, 2024

Is there an existing issue for this?

Describe the issue

Steps to reproduce

GraphRAG Config Used

api_version: 2024-02-15-preview

organization: <organization_id>

deployment_name: <azure_model_deployment_name>

tokens_per_minute: 150_000 # set a leaky bucket throttle

requests_per_minute: 10_000 # set a leaky bucket throttle

max_retries: 10

max_retry_wait: 10.0

sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times

temperature: 0 # temperature for sampling

top_p: 1 # top-p sampling

n: 1 # Number of completions to generate

num_threads: 50 # the number of threads to use for parallel processing

parallelization: override the global parallelization settings for embeddings

connection_string: <azure_blob_storage_connection_string>

container_name: <azure_blob_storage_container_name>

connection_string: <azure_blob_storage_connection_string>

container_name: <azure_blob_storage_container_name>

connection_string: <azure_blob_storage_connection_string>

container_name: <azure_blob_storage_container_name>

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

enabled: true

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

num_walks: 10

walk_length: 40

window_size: 2

iterations: 3

random_seed: 597832

text_unit_prop: 0.5

community_prop: 0.1

conversation_history_max_turns: 5

top_k_mapped_entities: 10

top_k_relationships: 10

llm_temperature: 0 # temperature for sampling

llm_top_p: 1 # top-p sampling

llm_n: 1 # Number of completions to generate

max_tokens: 12000

llm_temperature: 0 # temperature for sampling

llm_top_p: 1 # top-p sampling

llm_n: 1 # Number of completions to generate

max_tokens: 12000

data_max_tokens: 12000

map_max_tokens: 1000

reduce_max_tokens: 2000

concurrency: 32

Logs and screenshots

Additional Information

natoverse commented Aug 14, 2024