-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate uniprot info into initial prompt #396
base: development
Are you sure you want to change the base?
Conversation
dummy_model = LLMIntegration(model_name, api_key="lorem", load_tools=False) | ||
tokens = dummy_model.estimate_tokens( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bit hacky ;-)
please refactor such that the estimate_tokens
becomes static and you can just call LLMIntegration.estimate_tokens
here:
def estimate_tokens(
model: str = None, messages: List[Dict[str, str]], average_chars_per_token: float = 3.6
) -> float:
...
) | ||
st.markdown(f"Total tokens: {tokens:.0f}") | ||
with c5: | ||
st.checkbox( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need the extra checkbox? Can I not just click "update prompt"? (set state[StateKeys.INTEGRATE_UNIPROT]=True
after button click)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need a way to also revert to an initial prompt that does not contain the uniprot information. Uncheck the box, then update.
): | ||
"""Get the initial prompt for the LLM model.""" | ||
group1 = parameter_dict["group1"] | ||
group2 = parameter_dict["group2"] | ||
column = parameter_dict["column"] | ||
if uniprot_info: | ||
uniprot_instructions = ( | ||
f"We have already retireved relevant information from Uniprot for these proteins:{os.linesep}{os.linesep}{uniprot_info}{os.linesep}{os.linesep}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (and other prompts) are using interpunctation quite sparsely. My knowledge may be outdated, but I learned that the more structured a prompt, the better. Shall we add some backticks or quoatation marks?
just an example:
===
Uniprot information for protein "VCL"
- protein name: `Vinculin (or Metavinculin)`
- entryType of this protein is `UniProtKB reviewed (Swiss-Prot)`.
- primaryAccession of this protein is `P18206`.
- secondaryAccessions of this protein is `Q16450, Q5SWX2, Q7Z3B8, Q8IXU7`.
- XXXXX is `Actin filament (F-actin)-binding protein involved in cell-matrix adhesion and cell-cell adhesion. Regulates cell-surface E-cadherin expression and potentiates mechanosensing by the E-cadherin complex. May also play important roles in cell morphology and locomotion`
===
===
Uniprot information for protein "XYZ"
instead of
The protein VCL is called Vinculin (or Metavinculin).
Uniprot information:
- entryType of this protein is UniProtKB reviewed (Swiss-Prot).
- primaryAccession of this protein is P18206.
- secondaryAccessions of this protein is Q16450, Q5SWX2, Q7Z3B8, Q8IXU7.
- Actin filament (F-actin)-binding protein involved in cell-matrix adhesion and cell-cell adhesion. Regulates cell-surface E-cadherin expression and potentiates mechanosensing by the E-cadherin complex. May also play important roles in cell morphology and locomotion
(maybe a different PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will consider this when I start the next PR on prompt engineering.
uniprot_instructions = ( | ||
"You have the ability to retrieve curated information from Uniprot about these proteins. " | ||
"Please do so for individual proteins if you have little information about a protein or find a protein particularly important in the specific context." | ||
) | ||
return ( | ||
f"We've recently identified several proteins that appear to be differently regulated in cells " | ||
f"when comparing {group1} and {group2} in the {column} group. " | ||
f"From our proteomics experiments, we know that the following ones are upregulated: {', '.join(upregulated_genes)}.{os.linesep}{os.linesep}" | ||
f"Here is the list of proteins that are downregulated: {', '.join(downregulated_genes)}.{os.linesep}{os.linesep}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also prompt engineering:
Here is a comma-separated list of proteins that are downregulated: `{', '.join(downregulated_genes)}.{os.linesep}{os.linesep}`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will consider this when I start the next PR on prompt engineering.
): | ||
"""Get the initial prompt for the LLM model.""" | ||
group1 = parameter_dict["group1"] | ||
group2 = parameter_dict["group2"] | ||
column = parameter_dict["column"] | ||
if uniprot_info: | ||
uniprot_instructions = ( | ||
f"We have already retireved relevant information from Uniprot for these proteins:{os.linesep}{os.linesep}{uniprot_info}{os.linesep}{os.linesep}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"proteins" or "genes"? :)
@@ -24,6 +24,9 @@ def protein_selector(df: pd.DataFrame, title: str, state_key: str) -> List[str]: | |||
selected_proteins (List[str]): A list of selected proteins. | |||
""" | |||
st.write(title) | |||
if len(df) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we move this check to 06_LLM.py
? e.g. right after st.markdown("##### Genes of interest")
? I feel here it's a bit hidden
def display_uniprot( | ||
regulated_genes_dict, | ||
feature_to_repr_map, | ||
model_name: str = Models.OLLAMA_31_70B, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please don't specify a default here
New:
Note: