Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PyPI User API #15769

Open
import-pandas-as-numpy opened this issue Apr 12, 2024 · 6 comments
Open

Add PyPI User API #15769

import-pandas-as-numpy opened this issue Apr 12, 2024 · 6 comments

Comments

@import-pandas-as-numpy
Copy link

What's the problem this feature will solve?

Currently, there is no way to enumerate user information programmatically without BigQuery. This is problematic for security organizations, which may use these characteristics to inform and contextualize automated detection engines. For instance, new users are statistically significant in the proportion of malware they create. A long-lived account maintaining several packages, consequentially, is less likely to upload malware. Additionally, individuals that maintain packages as a part of teams or organizations (which may be recursively enumerated by this endpoint) are also statistically less likely to upload malware.

Describe the solution you'd like
I am proposing the following JSON endpoint/response:

GET /user/import-pandas-as-numpy/json
{
  "username": "import-pandas-as-numpy",
  "name": "Rem",
  "joined": "May 1, 2023",
  "packages": [
    {
      "package": "safepull",
      "latest_version": "v2.0.0",
      "last_released": "Apr 1, 2024"
    },
    {
      "package": "foo",
      "latest_version": "v1.1.1"
      "last_released": "Apr 11, 2024"
    }
  ]
}

Obvious concessions can be made to account for specific data types in date/time pertinent fields as it is relevant; I have left them as strings to clearly denote the intent of the information this is to convey.

Additional context
I have concerns about the misuse of this feature to scrape PyPI's users, potentially facilitating things like automated social engineering attacks by mapping out relationships between authors. I propose that this feature be restricted to credible security organizations following conventions with currently in-development anti-malware API's.

I would like to take a moment to elaborate how we intend to use this to more effectively secure the Python Package Index:

  • We (@vipyrsec) have, in the past, been requested to perform additional scans on uploads by a malicious author to canvas the full scope of potentially malicious packages on a given account. This would facilitate these additional scans, and allow us to treat users as potentially malicious instead of packages. The benefit to this is twofold; the odds of a given user staging an undetected payload is significantly reduced as we are able to utilize more computationally expensive static code analysis tools on these packages, and we can do this programmatically to inform our reporting in anticipation of this request.

  • We often use account age in conjunction with less clear malicious behavior. An account that has maintained good standing historically and has contributed multiple non-malicious packages over some interval is less likely to have staged malicious code; and thus, requires a less critical examination. A new user uploading "azure-data-interactions" is much more alarming than a long-standing package maintainer uploading this package.

@import-pandas-as-numpy import-pandas-as-numpy added feature request requires triaging maintainers need to do initial inspection of issue labels Apr 12, 2024
@matt-phylum
Copy link

It would be nice to have this functionality moved forward from the XMLRPC API, both package_roles(package_name) and user_packages(user), with the roles included.

@woodruffw
Copy link
Member

I have concerns about the misuse of this feature to scrape PyPI's users, potentially facilitating things like automated social engineering attacks by mapping out relationships between authors. I propose that this feature be restricted to credible security organizations following conventions with currently in-development anti-malware API's.

I share these concerns 🙂 -- as a related concern, I think it's potentially dangerous for PyPI to bless security companies as "credible" for the purposes of access to user data without reaching a common agreement about how that user data is stored or re-exposed (e.g. not making public blanket statements about new user trustworthiness, since this effectively punishes ecosystem newcomers). But that's my personal 0.02c here; I have no actual say in this matter 😉

See #13409 for related discussion (that concerns "management" APIs specifically, but ties in with how we might expose authentication).

@import-pandas-as-numpy
Copy link
Author

I have concerns about the misuse of this feature to scrape PyPI's users, potentially facilitating things like automated social engineering attacks by mapping out relationships between authors. I propose that this feature be restricted to credible security organizations following conventions with currently in-development anti-malware API's.

I share these concerns 🙂 -- as a related concern, I think it's potentially dangerous for PyPI to bless security companies as "credible" for the purposes of access to user data without reaching a common agreement about how that user data is stored or re-exposed (e.g. not making public blanket statements about new user trustworthiness, since this effectively punishes ecosystem newcomers). But that's my personal 0.02c here; I have no actual say in this matter 😉

See #13409 for related discussion (that concerns "management" APIs specifically, but ties in with how we might expose authentication).

I would like to note that the information that I am interested in is publicly exposed via the BigQuery dataset, and retained in line with the proposed principles; the exposure of an API to perform this action would be no additional information exposed, but instead serve as a more accessible conduit to map out related packages, since otherwise that information needs to be gathered through numerous queries on already public data.

I've also not asserted any negative actions to new users; but instead over our corpus of over 1.5 million packages scanned and 1400 positive detection's, we have noted that fresh user accounts often create the bulk majority of the malicious uploads in the ecosystem. It is important for us to effectively filter data down as soon as possible in our pipeline to support more intensive code security scanning tools.

I am generally uninterested in any sort of prejudgement for users, but instead ensuring I am efficiently parsing the ~10k lines of code per second that PyPI generally maintains as far as data flow goes.

@woodruffw
Copy link
Member

I would like to note that the information that I am interested in is publicly exposed via the BigQuery dataset, and retained in line with the proposed principles

Does the BigQuery dataset currently expose which user publishes each version of a package? I believe it lists maintainer and maintainer_email from the project metadata, but not the PyPI username itself. I also believe it doesn't report the user's join date. The latter is public information, but the former would be a new piece of user activity to store and expose. I think there's probably a reasonable argument for exposing it, but that's just to highlight that it isn't exactly the same as the current distribution_metadata table.

@di
Copy link
Member

di commented Jul 1, 2024

I think it would be reasonable to expose the same information on our /user/ pages via a JSON API. #2914 seems related as well.

I have concerns about the misuse of this feature to scrape PyPI's users, potentially facilitating things like automated social engineering attacks by mapping out relationships between authors.

I don't see how the proposed API would be any different than scraping the current public user/project pages in this regard -- if this is going to happen I assume it is already happening.

Does the BigQuery dataset currently expose which user publishes each version of a package?

No, it doesn't.

@miketheman miketheman removed the requires triaging maintainers need to do initial inspection of issue label Oct 11, 2024
@warsaw
Copy link
Contributor

warsaw commented Oct 25, 2024

I believe it lists maintainer and maintainer_email from the project metadata, but not the PyPI username itself.

This is actually a gap in the JSON API. I don't believe GET /pypi/<project>/json currently exposes the user information at all. author / author_email / maintainer / maintainer email are not the same because they come from the latest release, if I'm reading the code in warehouse/legacy/api/json.py correctly, and those metadata fields could be spoofed. The user information however is available on the web UI for a project (and linked to the user URL), so you really just have to scrape the web page to get it, which isn't very convenient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants