-
-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Splitting code into several files #71
Comments
Hi, @gkruger1 Your idea of splitting the codebase into multiple files is intriguing. I appreciate you sharing this thought-provoking concept. Here's an initial outline of how I plan to approach this:
For example, if output.filePath is set to "repopack-output.txt" and the maximum files per output is set to 3, I would generate:
Additionally, to ensure each file is self-contained and fully functional:
This approach should provide a straightforward way to split large codebases while maintaining the simplicity and current functionality of Repopack. What do you think about this implementation? Do you have any concerns or additional suggestions? Thank you again for your valuable input and for using Repopack. Your feedback helps me continually improve the tool. |
Hi. Your suggestion is excellent! Also, thanks for explaining your implementation procedure. You've thought of several things I didn't consider, like placing the file summary, repository structure, and any other metadata in each file and also indicating "Part 1 of #" Is it possible to also add an alternative option to split the codebase into multiple files based on a certain number of lines of code but keeping each code file intact? When I split the repopack manually, I sometimes find that the LLM can't see a certain number of lines at the end of the document and I have to then take out that last code file. Sonnet 3.5 would, for example, give me this feedback: Thanks again. |
Thank you for your detailed feedback and additional suggestion! After considering your feedback, Here's what I'm thinking:
Here's an example of how you might use these options in the configuration: {
"output": {
"maxLinesPerOutput": 10000,
"maxFilesPerOutput": 100
}
} What do you think about this approach? Do you think it would solve the issues you've been facing with manual splitting and LLM processing? Thank you again for your thoughtful feedback and for using Repopack! |
I apologize for the delay in replying. Thank you for your explanation. This would definitely solve my problems. If I understand you correctly, "If the number of split files exceeds maxFilesPerOutput, it will limit the output to that number of files.", this could mean that some of the code files that the user requested, could be omitted if they put a max files limit and didn't think things through? If this is the case, would it be possible to give a warning if some code will be lost and give the user the option to continue? |
Aren't the project is already divided in files, we have the ability to pack the specific directory already, if the output becomes increasing longer the context window of LLM, the output can be decrease to specific directory. |
I was thinking the same thing, but I was actually think it would be better to set the size based on the number of tokens. You already report the number of tokens, we don't howvere know how many tokens x number of lines of our code generates. so it's bit of a guessing game still We know what size the context window is of each model, so it would be great if it could be split by the maximum context window size. It should also adhere to the principle as per above, that it won't cut off the end of the last file. So it's up to a maximum of x tokens preserving the complete files. This would be really awesome |
Hopefully you don't mind, but I created a PR for the split by maxToken size. Well it was more of an experiement actually. I used repopack to generate context for this repo and got gemini to buoild the feature. It did a pretty good job for the most part, but got a bit confused towards the end. So I went in and tidied things up a bit. It's just draft as I haven't finsihed checking tests and doco yet, but have a look and if you think it's up to scratch great, if not no stress. I'll use it myself for now as I really need the functionality. Cheers |
@fridaystreet The current requirement is "to split the output file based on some condition," and I agree that splitting by token count seems appropriate for meeting this requirement. Thank you again for your work on this. I really appreciate your contribution! |
@yamadashy, no problem at all as I said was a good opportunity to try and actually build something with gemini. Thanks for taking a look and taking the time to provide all the feedback comments. Alot of that bloat and over engineering of variable checking etc was gemini lol I'll go through your comments and make the updates as requested and finalise tests and doco and get the pr into a final state for further review. Cheers |
I'll report back on effectiveness, but what I'm hoping/seeing is that some of the problem seems to be more the implementations in vscde of the likes of Cody and the gemini plugin. Cody has clear info on max token size per request which is helpful, but haven't seem to be able to find google plugin yet. They seem to restrict the size of the individual files they will send to well below the full context window available, certainly from gemini perspective, and for obvious reasons. So as I say, my hope is that by experimenting with token sized outputs I can find the Max size it will take for one file and if it's split over multiple files of that size if it would potentially mean you could get it to pull them in. Just a theory at this stage, but at least now I can test it. Worst case, at least we can just build specific include filters across multiple folders and ensure we're using the full number of tokens it can handle with exactly the files we need |
Well not quite what I hoped regarding getting it to read all the split files, but certainly a much improved use of the context you can get access to. Gemini is a bit of a pain, well the vscode plugin anyway. It will read the file even if it's too big and only read what it can, which is fine and I guess by design, but it doesn't tell you it doesn't have the entire file. Cody is streets ahead UX wise on the google plugin as it does show you if the file your trying to add is too big, so with trial and error you could get the same result as using the max-tokens setting added here, but that would be a head ache. Cody has a clear guide on it's maximum input tokens, so now it's pretty easy to put together a repopack file using include statements that has as much specific content in the context as possible. My main problems were
So here's what I've found so far:
"The current open file repopack-output.xml was generated using the codebase in the remote repository https://github.com/yamadashy/repopack, go and analyse that repository so you can understand the structure of this file in order to answer my questions going forward" Maybe something along these lines is all that's needed in the top of the file rather than the full summary? Will keep you posted on anything else I find. |
Hi
I've been using this for a while now (also replied to you on Reddit some time ago) and it's really saving me a lot of time. Would it be possible to add the ability to generate more than one repopack file for a specific codebase? At a certain number of lines of code, the LLMs tend not to be able to see the rest. It would be great if I can specify the number of files (or perhaps number of lines per file) it should split the complete codebase into. Best would be if the code of a specific file is not split when creating the separate files.
Thanks for your work on this.
The text was updated successfully, but these errors were encountered: