Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP 500 errors during indexing #328

Open
skitt opened this issue Sep 12, 2024 · 6 comments
Open

HTTP 500 errors during indexing #328

skitt opened this issue Sep 12, 2024 · 6 comments
Labels

Comments

@skitt
Copy link

skitt commented Sep 12, 2024

See https://elixir.bootlin.com/linux/v6.10.9/source/block/blk-core.c:

Screenshot of the 500 error

This happens on all versions of the kernel I’ve tried, back to https://elixir.bootlin.com/linux/v2.6.39.4/source/block/blk-core.c at least.

@fstachura
Copy link
Collaborator

Hello,

I can't reproduce this. Does this happen when you click on an identifier? Or on a source page?

@skitt
Copy link
Author

skitt commented Sep 12, 2024

Wow, it’s working now... Presumably a transient issue then. It happened whenever I tried accessing blk-core.c in any way, including through the links in the initial post.

@skitt skitt closed this as not planned Won't fix, can't repro, duplicate, stale Sep 12, 2024
@tleb
Copy link
Member

tleb commented Sep 12, 2024

The logs have traces:

2024-09-12 13:13:04 [FALCON] [ERROR] GET /linux/v2.6.39.4/source/block/blk-core.c => Traceback (most recent call last):
  File "falcon/app.py", line 375, in falcon.app.App.__call__
  File "/usr/local/elixir/elixir/web.py", line 103, in on_get
    resp.status, resp.text = generate_source_page(req.context, query, project, version, path)
  File "/usr/local/elixir/elixir/web.py", line 412, in generate_source_page
    'code': generate_source(q, project, version, path),
  File "/usr/local/elixir/elixir/web.py", line 313, in generate_source
    code = q.query('file', version, path)
  File "/usr/local/elixir/elixir/query.py", line 191, in query
    (lib.compatibleFamily(self.db.defs.get(tok2).get_families(), family) or
  File "/usr/local/elixir/elixir/data.py", line 167, in get
    p = self.ctype(p)
  File "/usr/local/elixir/elixir/data.py", line 60, in __init__
    self.data, self.families = data.split(b'#')
ValueError: too many values to unpack (expected 2)

This is somewhat known; but we have (1) no tracking issue and (2) no fix currently. I haven't digged deeper but it is surprising that it can resolve itself over time. IE we seem to get an error from data read from database, but it stops appearing over time.

@tleb tleb reopened this Sep 12, 2024
@fstachura
Copy link
Collaborator

I did some research on this and it's not clear to me if Elixir is handling database concurrency correctly. Berkeley DB has this concept of "products" with solutions for different concurrency/availability requirements. See: https://docs.oracle.com/cd/E17276_01/html/programmer_reference/intro_products.html

Berkeley DB Data Store Berkeley DB Concurrent Data Store
Provides indexed, single-reader/single-writer embedded data storage Adds simple locking with multiple-reader/single-writer capabilities
Provides concurrent read-write access No Yes

It seems to me that Elixir currently does not use the Concurrent Data Store
The following documents suggest, that Berkeley DB environments have to be used for it to work. But AFAIK Elixir just opens the database directly, without an environment.
Chapter 10. Berkeley DB Concurrent Data Store Applications
Chapter 9. The Berkeley DB Environment

But also, from the products document:

The Berkeley DB Data Store product is an embeddable, high-performance data store. This product supports multiple concurrent threads of control, including multiple processes and multiple threads of control within a process. However, Berkeley DB Data Store does not support locking, and hence does not guarantee correct behavior if more than one thread of control is updating the database at a time.

So maybe these just issues with how databases are shared between threads in the update script.

If neither of that is the problem, then perhaps the values in the database are not updated "atomically". What I mean is that it is possible that the update script sometimes writes a temporary value that does not have the correct format into defs/refs databases. Maybe the key is initially empty and that's why split with unpack fails.

@tleb You probably know more about the update script than I do.

tldr three things to try/investigate:

  • Concurrent Data Store
  • How databases are shared in update.py, also see DB_THREAD flag
  • Correctness of single key-level updates

@tleb
Copy link
Member

tleb commented Sep 24, 2024

Hm, a somewhat simpler approach would be to avoid any kind of concurrent access. Current production server can handle a duplicate of the data from any project. Biggest project data is Linux at 31G, and prod server has 39G left. We could scale its disk as well, that is an option.

That way we avoid any potential concurrency issue.

That has a few consequences:

  • [-] It raises the bar for hosting an instance.
  • [+] It gives us a path for indexing on another server.
  • [-] We need to handle that side-copy. We must copy the data, then do indexing work, then update the data used by the HTTP server (using symlinks I guess). The end operation must be done atomically else we lost the benefit of not requiring synchronisation.
  • [+] It lets us switch to another (non-concurrent) database if we ever desire that.

Question:

  • What do you think?
  • What part of the code should implement that? Maybe utils/index-all-repositories? Or should it be done by update.py?

$ du -hc linux/data/* | sort -hr
31G	total
15G	linux/data/references.db
12G	linux/data/versions.db
4.0G	linux/data/definitions.db
196M	linux/data/hashes.db
149M	linux/data/blobs.db
113M	linux/data/doccomments.db
102M	linux/data/filenames.db
34M	linux/data/compatibledts.db
5.2M	linux/data/compatibledts_docs.db
8.0K	linux/data/variables.db

@tleb tleb changed the title 500 error on block/blk-core.c in the kernel HTTP 500 errors during indexing Sep 24, 2024
@tleb
Copy link
Member

tleb commented Nov 8, 2024

The same sort of HTTP 500 trace still appears on Oct 28, Oct 29, Nov 1, Nov 2 and Nov 4. All those dates are after the deploy of 0b8d735 (Oct 11). So the shared flag is not enough.

I maintain the simpler approach is to do indexing on the side and update symlinks. Prod server storage size is even less of an issue now that I ran some git gc, we have 69G available.

@tleb tleb added the bug label Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants