Skip to content

Commit

Permalink
doc: add initial announcement
Browse files Browse the repository at this point in the history
Signed-off-by: Fengguang Wu <[email protected]>
  • Loading branch information
Fengguang Wu committed Nov 10, 2017
1 parent 8c0cf51 commit 5d8db2f
Showing 1 changed file with 177 additions and 0 deletions.
177 changes: 177 additions & 0 deletions doc/announce.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
0-day kernel build/boot testing farm
====================================

(June 2012 MSR by Fengguang Wu <[email protected]>)

the problem
-----------

The linux kernel has a vibrant community and fast development cycles, which is
excellent. On the other hand, the large changesets carry bugs and regressions.
Judging by the pains that me as a typical kernel developer encountered in the
daily hacking, there are a lot improvements to be desired.

Build errors are often regarded as trivial ones. However we obviously lack an
effective way to prevent many of them from leaking into Linus' tree, not to
mention the linux-next tree, where it hurts many -mm developers.

According to Geert's "Build regressions/improvements in v3.4" report, there are
~100 known build bugs shipped with the official Linux 3.4 release. The numbers
are somehow exaggerated because it contains build failures for many less-cared
archs, but that fact still stroked me.

The attached xfs.png and drm.png represent my initial build status for the
typical dev trees. Each red 'c' character indicates one commit that won't build
for one kconfig. A line full of 'c' indicates one build bug inherited from the
base tree (ie. Linus' tree); a range of 'c' characters mean a build error is
introduced and fixed _some time_ later, which will be a problem for bisects.

Runtime oopses are more challenging. As you may discover in LKML, lots of the
bug reports are simply ignored, because it's often really hard to track down
user reported problems. Hard-to-reproduce bugs are virtually not fixable; bugs
for old kernels are not cared by upstream developers; regressions not bisected
down to one particular commit could kill quite some brain cells, and there is
the question "who is to blame for^W^Wown this bug?". To be frank, the only way
to guarantee the prompt fix of a bug is to explicitly tell the developer: hi,
your XXX commit triggered this YYY bug.

It boils down to one question: How can we make sure every regressions are
caught, root caused and fixed in some timely and easy fashion? There are lots
of works to do in each development stage, and the part of problem I'm trying
to attack is: quality assurance in the very early development stage, as soon as
new commits are pushed to public git trees.

0-day kernel build test farm
----------------------------

In order to effectively improve Linux kernel quality and fuel its R&D cycles,
I'm setting up this 0-day kernel build test farm with highlights:

0. 0 efforts to use
1. 1-hour response time (aka. 0-day)
2. "brute-force" commit-by-commit tests
3. auto test all branches in all developers' git trees
4. automated error notification to the right developer

### 0 efforts to use

We need to encourage, but NOT rely on the developers' self-descipline to do
tests on their own. I noticed that even the most seasoned maintainers who
manage their own professional build tests may act carelessly at times and push
untested commits publicly. IMHO this is human nature that we need to face
rather than blame. Then there are the more typical developers who only build
and run their kernels for one config and hardware. We have to accept that not
every one will bother or have the time/resources to carry out thorough tests.

So the most effective way for quickly improving Linux quality would be to run
a test farm that works 7x24 on all the new commits. I'm not trying sell shiny
test tools to the kernel developers (at least, it's not the NO.1 goal), but
rather take on efforts to set up and maintain one test farm and make it
perform well.

The kernel developers are delighted to find that, all of a sudden, they are
backed by a professional build testing system. The responses have mostly been
positive, and the few negative ones did help improve the system.

### 1-hour response time (aka. 0-day)

This is indeed a very important and possible target. It creates excellent user
experiences, makes the developers feel like at home because they can hardly do
better even when kicking off tests on their own machines. It makes Intel look
good, professional and powerful, and brings Intel very close to the community.

Quite a few developers (including myself) overuse linux-next as their catch-all
testbed..even for the silly build errors. linux-next is re-assembled and tested
on a daily basis and I'm trying to outrace it and get errors notified/fixed
before the linux-next merge.

### auto test all branches in all developers' git trees

There are nice tools to help developers to do in-house tests; there are well
established build farms that work daily on the linux-next tree. However, there
is still one big gap lying in between: the various dev branches inside the
various git trees asks for more 3rd party testing.

Our test farm will auto grab all newly created or updated branches and make
sure every new piece of works are properly tested, hopefully before being
merged by linux-next as well as the non-rebaseable Linus/tip/net etc. upstream
trees.

### "brute-force" commit-by-commit tests

It's a common expectation for the developers to do bisectibility tests, however
there have been no way to *ensure* this. Perhaps, it was deemed impossible for
some central server(s) to carry out bisectibility tests for all the 10000+
commits merged in one Linux release. However, my experiments show that, by
taking advantage of some optimizations, it only requires one single 2-socket
SandyBridge server to do basic build tests for each and every commit. And
adding more servers will further improve the test coverage and response time.

The most important caveat is, if it takes half hour to build the 1st commit from
scratch, the following 10 commits (as incremental changes) typically only takes
another half hour to compile. In that sense, it's not really 'brute-force'
compilations. Considering the guarantees of bisectibility and the ability to
find out the right developer to notify, the cost is well deserved.

### automated error notification to the right developer

Compile errors are trivial ones after all. They are best suitable for automation.
That helps guarantee the response time: once human checks are involved, the added
delays will be unpredictable. And it will help reduce long term maintenance cost.

current status
--------------

We are running two 2-socket SandyBridge compile servers. They build 300-400
commits and ~10000 kernels per day. 30 kconfigs are tested for each commit.

We are "routinely" catching 1-2 new build error(s) on each working day. New
build warnings and sparse check warnings are also discovered on a daily basis.

Most of the built kernels will be boot tested. The supporting hardwares are
several less powerful boxes, each runs 4-12 kvm instances, each can boot test a
kernel in about 1 minute. Once boot up, some heavier tests on memory management,
I/O and trinity fuzzer will be selectively executed. This system is proved to
be good at catching runtime errors. For example, here is the list of bug
reports I sent:

11372 N F Jun 22 Cc LKML ( 200:0) &-&->Re: boot hang on commit "PM / ACPI: Fix suspend/resume regression caused by cpuidle cleanup."
11995 N F Jun 23 Cc LKML ( 101:0) BUG: tracer_alloc_buffers returned with preemption imbalance
12141 N F Jun 24 Cc LKML ( 39:0) boot hang on CONFIG_FB_VGA16
12142 F Jun 24 Cc LKML ( 77:0) vfs/for-next: NULL pointer dereference in sysfs_dentry_delete()
606 F Jun 25 To Joern Engel ( 71:0) NULL dereference in logfs_get_wblocks()
13017 N F Jun 26 Cc LKML ( 106:0) BUG: No init found on NFSROOT
13019 F Jun 27 Cc LKML ( 90:0) `-> BUG: held lock freed!

534 F Jul 03 Cc LKML ( 44:0) genirq: Flags mismatch irq 4. 00000000 (serial) vs. 00000000 (lirc_sir)
539 F Jul 03 Cc LKML (7640:2) [mac80211-next:for-john] WARNING: at /c/kernel-tests/net/net/wireless/core.c:471 wiphy_register+0
606 r F Jul 06 Cc LKML ( 351:1) general protection fault on ttm_init()
626 F Jul 08 Cc LKML (3047:2) WARNING: __GFP_FS allocations with IRQs disabled (kmemcheck_alloc_shadow)
645 r F Jul 09 Cc LKML (3324:2) rcu_dyntick and suspicious RCU usage
659 F Jul 10 Cc LKML (5418:2) [kgdb:kgdb-next] KGDB: BP remove failed: ffffffff81026ed0
662 F Jul 10 Cc LKML (5019:2) [Staging/speakup] BUG: spinlock trylock failure on UP on CPU#0, trinity-child0/484
663 F Jul 10 Cc LKML (2999:2) linux-next: Early crashed kernel on CONFIG_SLOB
664 F Jul 10 Cc LKML (3068:2) Kernel boot hangs on commit "switch fput to task_work_add"
665 F Jul 10 To LKML (3643:2) isdnloop: stack-protector: Kernel stack is corrupted in: ffffffff81e5b55b
666 F Jul 10 Cc LKML (4748:2) ftrace_ops_list_func() triggered WARNING: at kernel/lockdep.c:3506
667 F Jul 11 Cc LKML (2769:2) WARNING: at drivers/misc/kgdbts.c:813 run_simple_test()

The pile of bug reports around July 10 are some aged bugs found by the newly
setup randconfig boot tests. Besides, I didn't send out two machine specific
bugs, which we may need to resolve on ourselves.

It's been hard time for me to bring these tests up. However it seemed to pay
off. The initial number of bugs they exposed indicates they will be effective
in catching new regressions in the future.

summary
-------

Hopefully this will be a valuable long term project for the Linux community as
well as Intel. We are probably the best candidate to run these tests, not only
because hardware is cheap for Intel, but also that we are in the unique position
that have all the bleeding edge hardwares to test run the new kernels, and are
actually the most willing to make sure they fit well with each other.

Thanks,
Fengguang

0 comments on commit 5d8db2f

Please sign in to comment.