Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting code into several files #71

Open
gkruger1 opened this issue Sep 8, 2024 · 11 comments
Open

Splitting code into several files #71

gkruger1 opened this issue Sep 8, 2024 · 11 comments
Labels
enhancement New feature or request needs discussion Issues needing discussion and a decision to be made before action can be taken

Comments

@gkruger1
Copy link

gkruger1 commented Sep 8, 2024

Hi

I've been using this for a while now (also replied to you on Reddit some time ago) and it's really saving me a lot of time. Would it be possible to add the ability to generate more than one repopack file for a specific codebase? At a certain number of lines of code, the LLMs tend not to be able to see the rest. It would be great if I can specify the number of files (or perhaps number of lines per file) it should split the complete codebase into. Best would be if the code of a specific file is not split when creating the separate files.

Thanks for your work on this.

@yamadashy
Copy link
Owner

yamadashy commented Sep 9, 2024

Hi, @gkruger1
Thank you for your interesting suggestion! I'm glad to hear that Repopack has been helpful in saving you time.

Your idea of splitting the codebase into multiple files is intriguing. I appreciate you sharing this thought-provoking concept.

Here's an initial outline of how I plan to approach this:

  1. I'll add a new configuration option to specify the maximum number of files per output.
  2. I'll maintain the current file ordering when splitting the content.
  3. I'll start a new output file once the specified file count is reached.
  4. I'll name the output files based on the original output.filePath, using the format -..

For example, if output.filePath is set to "repopack-output.txt" and the maximum files per output is set to 3, I would generate:

  • repopack-output-1.txt
  • repopack-output-2.txt
  • repopack-output-3.txt

Additionally, to ensure each file is self-contained and fully functional:

  1. I'll make sure each output file contains identical content except for the actual file contents. This includes the file summary, repository structure, and any other metadata.
  2. I'll add information to the summary section of each file, indicating that the output is split and specifying which part of the split this file represents (e.g., "Part 1 of 3").

This approach should provide a straightforward way to split large codebases while maintaining the simplicity and current functionality of Repopack.

What do you think about this implementation? Do you have any concerns or additional suggestions?

Thank you again for your valuable input and for using Repopack. Your feedback helps me continually improve the tool.

@gkruger1
Copy link
Author

gkruger1 commented Sep 9, 2024

Hi.

Your suggestion is excellent! Also, thanks for explaining your implementation procedure. You've thought of several things I didn't consider, like placing the file summary, repository structure, and any other metadata in each file and also indicating "Part 1 of #"

Is it possible to also add an alternative option to split the codebase into multiple files based on a certain number of lines of code but keeping each code file intact? When I split the repopack manually, I sometimes find that the LLM can't see a certain number of lines at the end of the document and I have to then take out that last code file.

Sonnet 3.5 would, for example, give me this feedback:
"...the complete surrounding code context is not fully visible. This partial view of the code makes it difficult to determine if the entire file content is present in the text file you've provided."

Thanks again.

@yamadashy
Copy link
Owner

Thank you for your detailed feedback and additional suggestion!

After considering your feedback, Here's what I'm thinking:

  1. I'll add two new configuration options:

    • maxLinesPerOutput: Maximum number of lines per output file
    • maxFilesPerOutput: Maximum number of files per output file
  2. The splitting logic will work as follows:

    • First, it will split the content based on maxLinesPerOutput, ensuring that individual files are not split in the middle.
    • If the number of split files exceeds maxFilesPerOutput, it will limit the output to that number of files.
  3. File naming will follow the previously discussed format: <fileName>-<number>.<extension>.

Here's an example of how you might use these options in the configuration:

{
  "output": {
    "maxLinesPerOutput": 10000,
    "maxFilesPerOutput": 100
  }
}

What do you think about this approach? Do you think it would solve the issues you've been facing with manual splitting and LLM processing?

Thank you again for your thoughtful feedback and for using Repopack!

@gkruger1
Copy link
Author

I apologize for the delay in replying.

Thank you for your explanation. This would definitely solve my problems.

If I understand you correctly, "If the number of split files exceeds maxFilesPerOutput, it will limit the output to that number of files.", this could mean that some of the code files that the user requested, could be omitted if they put a max files limit and didn't think things through?

If this is the case, would it be possible to give a warning if some code will be lost and give the user the option to continue?

@UsamaKarim
Copy link

Aren't the project is already divided in files, we have the ability to pack the specific directory already, if the output becomes increasing longer the context window of LLM, the output can be decrease to specific directory.

@fridaystreet
Copy link

fridaystreet commented Oct 8, 2024

I was thinking the same thing, but I was actually think it would be better to set the size based on the number of tokens. You already report the number of tokens, we don't howvere know how many tokens x number of lines of our code generates. so it's bit of a guessing game still

We know what size the context window is of each model, so it would be great if it could be split by the maximum context window size.

It should also adhere to the principle as per above, that it won't cut off the end of the last file. So it's up to a maximum of x tokens preserving the complete files.

This would be really awesome

@fridaystreet
Copy link

Hopefully you don't mind, but I created a PR for the split by maxToken size. Well it was more of an experiement actually.

I used repopack to generate context for this repo and got gemini to buoild the feature. It did a pretty good job for the most part, but got a bit confused towards the end. So I went in and tidied things up a bit.

It's just draft as I haven't finsihed checking tests and doco yet, but have a look and if you think it's up to scratch great, if not no stress. I'll use it myself for now as I really need the functionality.

Cheers

#113

@yamadashy
Copy link
Owner

yamadashy commented Oct 9, 2024

@fridaystreet
Thank you for creating the PR!
I've left a few comments, but I think it's a good implementation that meets the current requirements!

The current requirement is "to split the output file based on some condition," and I agree that splitting by token count seems appropriate for meeting this requirement.
However, I haven't fully grasped the essence of file splitting effects yet, so I would appreciate it if you could verify the effectiveness of this approach.

Thank you again for your work on this. I really appreciate your contribution!

@fridaystreet
Copy link

@yamadashy, no problem at all as I said was a good opportunity to try and actually build something with gemini. Thanks for taking a look and taking the time to provide all the feedback comments. Alot of that bloat and over engineering of variable checking etc was gemini lol

I'll go through your comments and make the updates as requested and finalise tests and doco and get the pr into a final state for further review.

Cheers

@fridaystreet
Copy link

fridaystreet commented Oct 9, 2024

I'll report back on effectiveness, but what I'm hoping/seeing is that some of the problem seems to be more the implementations in vscde of the likes of Cody and the gemini plugin. Cody has clear info on max token size per request which is helpful, but haven't seem to be able to find google plugin yet. They seem to restrict the size of the individual files they will send to well below the full context window available, certainly from gemini perspective, and for obvious reasons. So as I say, my hope is that by experimenting with token sized outputs I can find the Max size it will take for one file and if it's split over multiple files of that size if it would potentially mean you could get it to pull them in.

Just a theory at this stage, but at least now I can test it. Worst case, at least we can just build specific include filters across multiple folders and ensure we're using the full number of tokens it can handle with exactly the files we need

@fridaystreet
Copy link

fridaystreet commented Oct 10, 2024

Well not quite what I hoped regarding getting it to read all the split files, but certainly a much improved use of the context you can get access to.

Gemini is a bit of a pain, well the vscode plugin anyway. It will read the file even if it's too big and only read what it can, which is fine and I guess by design, but it doesn't tell you it doesn't have the entire file. Cody is streets ahead UX wise on the google plugin as it does show you if the file your trying to add is too big, so with trial and error you could get the same result as using the max-tokens setting added here, but that would be a head ache. Cody has a clear guide on it's maximum input tokens, so now it's pretty easy to put together a repopack file using include statements that has as much specific content in the context as possible.

My main problems were

  1. trying to do this on judgement of the files included alone was a massive trial and error headache. You can keep asking what's the last few lines of the file you have in the context to check if the end of the repopack file is being picked up, but as I say that's a bit of a nightmare and time consuming when working on bigger repos.
  2. If you don't get it right and don't realise, the ai starts hallucinating big time and just makes up random content and file paths.
  3. using the built in current folder type context loading the vscode plugins tend to use, means many times, the include files/related files aren't in the context unless you go and open them all first.

So here's what I've found so far:

  • The google code assist plugin will only injest the full output file of no more than 15000 tokens with gemini 1.5 pro. Which is quite some amount short of it's capabilities
  • Cody will take a file up to 30000 tokens and pass it to gemini
  • While having the summary that explains the file inside the file seems like a good idea at first, both cody and google code assist plugin will allow a decent amount of extra tokens just in the message alone. Given this, you can actually increase your codebase coverage in context by completely removing the summary from the file and passing it in front of the message.
  • Taking this one step further, rather than using the generated summary at all (barr possibly the details specific to the codebase ie details about the file splitting and maybe the tree, although I haven't seen a huge benefit in either the split details or the tree personally). It seems to get better comprehension results starting the prompt with something like this

"The current open file repopack-output.xml was generated using the codebase in the remote repository https://github.com/yamadashy/repopack, go and analyse that repository so you can understand the structure of this file in order to answer my questions going forward"

Maybe something along these lines is all that's needed in the top of the file rather than the full summary?

Will keep you posted on anything else I find.

@yamadashy yamadashy added the enhancement New feature or request label Nov 6, 2024
@yamadashy yamadashy added the needs discussion Issues needing discussion and a decision to be made before action can be taken label Nov 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs discussion Issues needing discussion and a decision to be made before action can be taken
Projects
None yet
Development

No branches or pull requests

4 participants