Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate uniprot info into initial prompt #396

Open
wants to merge 6 commits into
base: development
Choose a base branch
from

Conversation

JuliaS92
Copy link
Collaborator

New:

  • token estimate for uniprot info of all included proteins
  • switch uniprot related instructions depending on including/excluding this information

Note:

  • prompt needs to be specifically updated through a button click

@JuliaS92 JuliaS92 requested review from mschwoer and boopthesnoot and removed request for mschwoer January 21, 2025 10:43
Base automatically changed from chat_optimization to development January 22, 2025 08:35
Comment on lines +221 to +222
dummy_model = LLMIntegration(model_name, api_key="lorem", load_tools=False)
tokens = dummy_model.estimate_tokens(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit hacky ;-)
please refactor such that the estimate_tokens becomes static and you can just call LLMIntegration.estimate_tokens here:

    def estimate_tokens(
       model: str  = None, messages: List[Dict[str, str]], average_chars_per_token: float = 3.6
    ) -> float:
...

)
st.markdown(f"Total tokens: {tokens:.0f}")
with c5:
st.checkbox(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need the extra checkbox? Can I not just click "update prompt"? (set state[StateKeys.INTEGRATE_UNIPROT]=True after button click)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need a way to also revert to an initial prompt that does not contain the uniprot information. Uncheck the box, then update.

):
"""Get the initial prompt for the LLM model."""
group1 = parameter_dict["group1"]
group2 = parameter_dict["group2"]
column = parameter_dict["column"]
if uniprot_info:
uniprot_instructions = (
f"We have already retireved relevant information from Uniprot for these proteins:{os.linesep}{os.linesep}{uniprot_info}{os.linesep}{os.linesep}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and other prompts) are using interpunctation quite sparsely. My knowledge may be outdated, but I learned that the more structured a prompt, the better. Shall we add some backticks or quoatation marks?

just an example:

===
Uniprot information for protein "VCL"
- protein name: `Vinculin (or Metavinculin)`
- entryType of this protein is `UniProtKB reviewed (Swiss-Prot)`.
- primaryAccession of this protein is `P18206`.
- secondaryAccessions of this protein is `Q16450, Q5SWX2, Q7Z3B8, Q8IXU7`.
- XXXXX is `Actin filament (F-actin)-binding protein involved in cell-matrix adhesion and cell-cell adhesion. Regulates cell-surface E-cadherin expression and potentiates mechanosensing by the E-cadherin complex. May also play important roles in cell morphology and locomotion`
===

===
Uniprot information for protein "XYZ"

instead of

The protein VCL is called Vinculin (or Metavinculin).
Uniprot information:
- entryType of this protein is UniProtKB reviewed (Swiss-Prot).
- primaryAccession of this protein is P18206.
- secondaryAccessions of this protein is Q16450, Q5SWX2, Q7Z3B8, Q8IXU7.
- Actin filament (F-actin)-binding protein involved in cell-matrix adhesion and cell-cell adhesion. Regulates cell-surface E-cadherin expression and potentiates mechanosensing by the E-cadherin complex. May also play important roles in cell morphology and locomotion

(maybe a different PR)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will consider this when I start the next PR on prompt engineering.

uniprot_instructions = (
"You have the ability to retrieve curated information from Uniprot about these proteins. "
"Please do so for individual proteins if you have little information about a protein or find a protein particularly important in the specific context."
)
return (
f"We've recently identified several proteins that appear to be differently regulated in cells "
f"when comparing {group1} and {group2} in the {column} group. "
f"From our proteomics experiments, we know that the following ones are upregulated: {', '.join(upregulated_genes)}.{os.linesep}{os.linesep}"
f"Here is the list of proteins that are downregulated: {', '.join(downregulated_genes)}.{os.linesep}{os.linesep}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also prompt engineering:

Here is a comma-separated list of proteins that are downregulated: `{', '.join(downregulated_genes)}.{os.linesep}{os.linesep}`

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will consider this when I start the next PR on prompt engineering.

):
"""Get the initial prompt for the LLM model."""
group1 = parameter_dict["group1"]
group2 = parameter_dict["group2"]
column = parameter_dict["column"]
if uniprot_info:
uniprot_instructions = (
f"We have already retireved relevant information from Uniprot for these proteins:{os.linesep}{os.linesep}{uniprot_info}{os.linesep}{os.linesep}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"proteins" or "genes"? :)

@@ -24,6 +24,9 @@ def protein_selector(df: pd.DataFrame, title: str, state_key: str) -> List[str]:
selected_proteins (List[str]): A list of selected proteins.
"""
st.write(title)
if len(df) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we move this check to 06_LLM.py? e.g. right after st.markdown("##### Genes of interest")? I feel here it's a bit hidden

def display_uniprot(
regulated_genes_dict,
feature_to_repr_map,
model_name: str = Models.OLLAMA_31_70B,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't specify a default here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants