-
Notifications
You must be signed in to change notification settings - Fork 963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PyPI User API #15769
Comments
It would be nice to have this functionality moved forward from the XMLRPC API, both |
I share these concerns 🙂 -- as a related concern, I think it's potentially dangerous for PyPI to bless security companies as "credible" for the purposes of access to user data without reaching a common agreement about how that user data is stored or re-exposed (e.g. not making public blanket statements about new user trustworthiness, since this effectively punishes ecosystem newcomers). But that's my personal 0.02c here; I have no actual say in this matter 😉 See #13409 for related discussion (that concerns "management" APIs specifically, but ties in with how we might expose authentication). |
I would like to note that the information that I am interested in is publicly exposed via the BigQuery dataset, and retained in line with the proposed principles; the exposure of an API to perform this action would be no additional information exposed, but instead serve as a more accessible conduit to map out related packages, since otherwise that information needs to be gathered through numerous queries on already public data. I've also not asserted any negative actions to new users; but instead over our corpus of over 1.5 million packages scanned and 1400 positive detection's, we have noted that fresh user accounts often create the bulk majority of the malicious uploads in the ecosystem. It is important for us to effectively filter data down as soon as possible in our pipeline to support more intensive code security scanning tools. I am generally uninterested in any sort of prejudgement for users, but instead ensuring I am efficiently parsing the ~10k lines of code per second that PyPI generally maintains as far as data flow goes. |
Does the BigQuery dataset currently expose which user publishes each version of a package? I believe it lists |
I think it would be reasonable to expose the same information on our
I don't see how the proposed API would be any different than scraping the current public user/project pages in this regard -- if this is going to happen I assume it is already happening.
No, it doesn't. |
This is actually a gap in the JSON API. I don't believe |
What's the problem this feature will solve?
Currently, there is no way to enumerate user information programmatically without BigQuery. This is problematic for security organizations, which may use these characteristics to inform and contextualize automated detection engines. For instance, new users are statistically significant in the proportion of malware they create. A long-lived account maintaining several packages, consequentially, is less likely to upload malware. Additionally, individuals that maintain packages as a part of teams or organizations (which may be recursively enumerated by this endpoint) are also statistically less likely to upload malware.
Describe the solution you'd like
I am proposing the following JSON endpoint/response:
Obvious concessions can be made to account for specific data types in date/time pertinent fields as it is relevant; I have left them as strings to clearly denote the intent of the information this is to convey.
Additional context
I have concerns about the misuse of this feature to scrape PyPI's users, potentially facilitating things like automated social engineering attacks by mapping out relationships between authors. I propose that this feature be restricted to credible security organizations following conventions with currently in-development anti-malware API's.
I would like to take a moment to elaborate how we intend to use this to more effectively secure the Python Package Index:
We (@vipyrsec) have, in the past, been requested to perform additional scans on uploads by a malicious author to canvas the full scope of potentially malicious packages on a given account. This would facilitate these additional scans, and allow us to treat users as potentially malicious instead of packages. The benefit to this is twofold; the odds of a given user staging an undetected payload is significantly reduced as we are able to utilize more computationally expensive static code analysis tools on these packages, and we can do this programmatically to inform our reporting in anticipation of this request.
We often use account age in conjunction with less clear malicious behavior. An account that has maintained good standing historically and has contributed multiple non-malicious packages over some interval is less likely to have staged malicious code; and thus, requires a less critical examination. A new user uploading "azure-data-interactions" is much more alarming than a long-standing package maintainer uploading this package.
The text was updated successfully, but these errors were encountered: