- General Questions
- Configuration and Setup
- Usage and Features
- Troubleshooting
- Maintenance and Updates
paperweight is a personal project that automatically retrieves, filters, and summarizes recent academic papers from arXiv based on user-specified categories and preferences. It then sends notifications to the user via email.
The program checks for new papers every time it is run. It compares the current date against the date stored in last_processed_date.txt
in the root directory. If this file doesn't exist, it assumes it's the first run and pulls papers from the last seven days.
The last_processed_date.txt
file is automatically created and updated by paperweight to keep track of when it last successfully processed papers. This file:
- Is created in the root directory of the project after the first successful run.
- Contains a single date in the format YYYY-MM-DD.
- Is used to determine which papers to fetch on subsequent runs, avoiding duplicate processing.
- Can be safely deleted if you want to reset the last processed date (paperweight will then fetch papers from the last 7 days on the next run).
paperweight uses the following logic to determine which papers to fetch:
- If it's the first run (no
last_processed_date.txt
file exists), it fetches papers from the last 7 days. - On subsequent runs, it fetches papers published since the date in
last_processed_date.txt
. - The number of papers fetched per category is limited by the
max_results
setting in your configuration.
Currently, paperweight only supports arXiv as a source for academic papers. Support for additional sources may be added in future updates.
To set up paperweight:
- Clone the repository and navigate to the project directory.
- Copy
config-base.yaml
toconfig.yaml
and edit it with your preferences. - Create a
.env
file in the project root and add your API keys if using summarization functionality. - Install the package using
pip install .
- Run the application using the
paperweight
command.
For more detailed instructions, please refer to the README.md file. For detailed configuration instructions, please see the configuration guide.
Currently, paperweight supports two LLM providers for summarization: OpenAI's GPT and Google's Gemini. This feature is currently in BETA and may have some limitations. You can specify the provider in the config.yaml
file under the analyzer
section:
analyzer:
type: summary
llm_provider: openai # or gemini
Make sure you have the appropriate API key set. You can set this in your config.yaml
, as an environment variable, or in your .env
file. For more details on securely managing API keys, please refer to the environment variables guide.
If you experience any issues with the summarization feature, you can switch to the abstract
type in your configuration. We encourage users to report any problems or suggestions related to the BETA features by opening an issue on our GitHub repository.
You can customize paper retrieval and processing by editing the config.yaml
file. Key settings include:
- arXiv categories
- Keywords and exclusion keywords
- Scoring weights
- Minimum score threshold
For a detailed explanation of all configuration options, please see the configuration guide.
Relevance scores are calculated based on the presence of keywords, important words, and exclusion keywords in the paper's title, abstract, and content. Papers are then ranked based on these scores. A higher score indicates that the paper is more likely to be relevant to your interests as defined in the configuration.
Yes, you can use exclusion keywords to make certain papers less likely to be recommended. In your config.yaml
, add exclusion keywords under the processor
section:
processor:
exclusion_keywords:
- keyword1
- keyword2
Note that this doesn't completely exclude papers with these keywords, but significantly reduces their relevance score. The effectiveness of exclusion keywords depends on their weight relative to other scoring factors.
The --force-refresh
argument allows you to ignore the last_processed_date.txt
file and fetch papers from the last 7 days. This can be useful if you want to reprocess recent papers or if you've made significant changes to your configuration. Use it like this:
paperweight --force-refresh
Currently, the email format and content are not customizable. This feature may be added in future updates.
To troubleshoot paper download or processing issues:
- Set the logging level to DEBUG in your
config.yaml
:
logging:
level: DEBUG
file: paperweight.log
- Run paperweight again and check the log file for detailed information about each step of the process.
- Look for any error messages or warnings in the log that might indicate the source of the problem.
If you encounter Python dependency issues:
- Ensure you're using Python 3.10 or higher.
- Try creating a new virtual environment and installing paperweight fresh.
- Update your pip and setuptools:
pip install --upgrade pip setuptools
- If you're still having issues, check the project's
setup.py
file for the list of required packages and versions, and try installing them manually.
If you're not receiving email notifications:
- Check your spam folder and verify your email configuration in
config.yaml
. - Ensure your SMTP settings are correct, especially if using Gmail or other providers with specific security requirements.
- Check the log file (default:
paperweight.log
) for any error messages.
If you continue to have problems, please open an issue on the project's GitHub page.
If you encounter API rate limits:
- Try reducing the
max_results
value in your config file. - Run paperweight less frequently.
- Check your API usage on the provider's website.
- Consider using the 'abstract' analyzer type instead of 'summary' to limit external API use.
Check the log file to ensure it's still processing. Large paper sets or enabled summarization can increase runtime. The program will update the log file as it progresses through different stages of paper retrieval and processing.
Review your keyword and scoring settings in the configuration file. The relevance of papers is determined by these settings. Adjust keywords, exclusion keywords, and scoring weights as needed to refine results. You may need to experiment with different configurations to achieve the desired paper selection.
As paperweight doesn't have an established distribution pipeline yet, to update:
- Pull the most recent version from the GitHub repository.
- Reinstall the package using
pip install .
in the project directory.
Contributions to paperweight are welcome! You can contribute by:
- Submitting issues for bugs or feature requests on the GitHub repository.
- Creating pull requests with bug fixes or new features.
- Improving documentation or writing tests.
Please refer to the project's GitHub page for more information on contributing.
Currently, paperweight does not have a built-in feature to export or save processed paper data. This could be a valuable feature to add in the future.
While paperweight should theoretically work with papers in languages other than English, this functionality has not been extensively tested. The effectiveness may vary depending on the language and the LLM provider used for summarization.
No, paperweight requires an internet connection to fetch papers from arXiv and to use the LLM APIs for summarization (if enabled). It cannot be used in a fully offline environment.