Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add search command #157

Merged
merged 9 commits into from
Oct 21, 2023
Merged

add search command #157

merged 9 commits into from
Oct 21, 2023

Conversation

granawkins
Copy link
Member

This PR implements a /search command using embeddings, along with some follow-ups from #144. To use it, add the --use-embedding flag.

Copy link
Member

@biobootloader biobootloader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! I have a lot of thoughts, some of these would be follow up PRs:

  1. The first time I ran it I got Embedding batch 1/13... to Embedding batch 13/13... printed out. The second time it just printed Embedding batch 1/1.... I understand the first call had to embed everything, but why did the second call still have a batch? Maybe the batch algorithm always returns at least one batch that might be empty?

  2. I don't think we are tracking embedding costs with the cost logger. I know they are very low, but it might be good to show what they are, so users know? Relatedly, I worry someone will run this on a massive codebase and it'll actually cost them a lot. Like if they have tons of json / data files checked in to the repo or something. Should we have some warning if the number of batches is actually crazy, and it'll cost more than like $1?

  3. What do we do if a file is too big too embed all at once?

  4. When we split embedding sections from whole files to smaller parts it'll be great to show the code in the search results. Formatting that might be tricky. In the VS Code extension we'd make it easy to jump to those files / sections

@@ -229,7 +229,7 @@ def run_cli():
help="Exclude the file structure/syntax map from the system prompt",
)
parser.add_argument(
"--embedding",
"--use-embedding",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does --use-embeddings make more sense?

@granawkins
Copy link
Member Author

We talked about these but for the record:

  1. The first time I ran it I got Embedding batch 1/13... to Embedding batch 13/13... printed out. The second time it just printed Embedding batch 1/1.... I understand the first call had to embed everything, but why did the second call still have a batch? Maybe the batch algorithm always returns at least one batch that might be empty?

It's embedding the prompt. So if it's a new prompt (hash not in db), there'll always be at least a batch of 1.

  1. I don't think we are tracking embedding costs with the cost logger. I know they are very low, but it might be good to show what they are, so users know? Relatedly, I worry someone will run this on a massive codebase and it'll actually cost them a lot. Like if they have tons of json / data files checked in to the repo or something. Should we have some warning if the number of batches is actually crazy, and it'll cost more than like $1?

I've set this up as you said: give option to ignore embeddings if the cost > $1, otherwise display the cost (with 4 decimal places) afterwards.

  1. What do we do if a file is too big too embed all at once?

These are ignored for now, but I'll make sure they're included when we implement file-splitting.

  1. When we split embedding sections from whole files to smaller parts it'll be great to show the code in the search results. Formatting that might be tricky. In the VS Code extension we'd make it easy to jump to those files / sections

Agreed!

Copy link
Member

@jakethekoenig jakethekoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me just a few small things.

One other note: I notice with small searches, like when I searched for just "parser" the init files score very highly. I wonder if we should not embed very small files. Say ones under 10 characters?

Other than that the searches I tried seemed to surface pretty relevant files.

await stream.send(str(e), color="red")
return

for i, (feature, score) in enumerate(results):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should 1 index instead of 0 index.


for i, (feature, score) in enumerate(results):
_i = f"{i}: " if i < 10 else f"{i}:"
await stream.send(f"{_i} {score:.3f} | {feature.path}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use :2 to force i to take up the same vertical space:

await stream.send(f"{i:2} {score:.3f} | {feature.path}")

I think that'd be a bit cleaner.

await stream.send("\nShow More results? ")
if not await ask_yes_no(default_yes=True):
break
await stream.send("Search complete", color="green")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think from a UX perspective it's better not to send this message. The user will know the search is over. We don't send a similar message when they ask a general question and the model responds.

@@ -171,6 +171,7 @@ async def get_model_response(self) -> list[FileEdit]:
conversation_history = "\n".join([m["content"] for m in messages_snapshot])
tokens = count_tokens(conversation_history, self.model)
response_buffer = 1000
print()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be removed

SEARCH_RESULT_BATCH_SIZE = 10


class SearchCommand(Command, command_name="search"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a SearchCommandTest?

) -> list[tuple[CodeFile, float]]:
"""Return the top n features that are most similar to the query."""
if not self.settings.use_embeddings:
raise UserError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if we just sent a red error rather than crashed

Copy link
Member

@PCSwingle PCSwingle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with all of Jake's comments and left one of my own, but looks good to merge after all of that is done! Thanks for adding this!

@granawkins
Copy link
Member Author

granawkins commented Oct 21, 2023

Thanks for the great feedback, I think I hit everything.

One other note: I notice with small searches, like when I searched for just "parser" the init files score very highly. I wonder if we should not embed very small files. Say ones under 10 characters?

Hmm.. the git_root relative path is included in the embedding. I think returning empty files, as it did, will be useful in some cases. Moving this convo to slack.

@granawkins granawkins merged commit 60b7458 into main Oct 21, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants