Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compel to text #93

Open
HatmanStack opened this issue Jun 15, 2024 · 3 comments
Open

Compel to text #93

HatmanStack opened this issue Jun 15, 2024 · 3 comments

Comments

@HatmanStack
Copy link

HatmanStack commented Jun 15, 2024

I'm playing around a bit with compel and HF inference api for long prompts 150 token+. One thing the api expects is text for input so I'm trying to convert cosine similarties between token and text embeddings. Am I headed in the right direction or is this a waste of time? Code:

tokenizer = AutoTokenizer.from_pretrained(item.modelID, subfolder="tokenizer") 
clip = CLIPTextModel.from_pretrained(item.modelID, subfolder="text_encoder")

compel = Compel(tokenizer=tokenizer, text_encoder=clip)
conditioning = compel.build_conditioning_tensor(prompt)
token_embeddings = clip.get_input_embeddings().weight
normalized_token_embeddings = normalize(token_embeddings, dim=1)

# Reshape the conditioning tensor to match the shape of the token embeddings
normalized_conditioning = normalize(conditioning.view(-1, normalized_token_embeddings.shape[1]), dim=1)
cosine_similarities = torch.mm(normalized_conditioning, normalized_token_embeddings.t())

max_similarity_indices = torch.argmax(cosine_similarities, dim=1)
# Convert the token indices back into text
text = tokenizer.batch_decode(max_similarity_indices.tolist(), skip_special_tokens=True)
promptString = " ".join(text)
@damian0815
Copy link
Owner

damian0815 commented Jun 20, 2024

hmm. not sure exactly what you're trying to achieve but i don't think what you're doing will help - the raw input_embedding matrix isn't useful as-is, it needs to be selectively pushed through the whole CLIP encoder (which is what the token_ids do, they index into the input_embedding matrix)

you might find this interesting though - https://github.com/YuxinWenRick/hard-prompts-made-easy . it's a system for simplifying/adjusting prompts by learning more efficient ways of prompting the same thing - eg you can convert a 75 token prompt to a 20 token prompt that produces a similar CLIP embedding. maybe you can use that to optimize your 150 token prompts down to 75.

@HatmanStack
Copy link
Author

It was stumbling in the dark. The results were lackluster, just a vague semblance to the original prompt. Which is still kind of amazing tbh. I thought investing more time might give me some type of path forward. Your suggestion intuitively seems like it would get better results. Although, my brain keeps itching with ideas about sentence structure and weighting words like in Compel. Anything to get better results than the garbled mess I was working with. Tokens are fun.

@damian0815
Copy link
Owner

right, yeah. part of the problem is that CLIP text encoder is basically a black box, and the other part is that the >75 token hack is, well, a hack. in my experience you can get just as good "quality" by tweaking your short prompt (eg with a thesaurus website just try swapping out words for other similar words) than by writing a 150 token prompt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants