Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running time improvement of the bitset_len method #13352

Closed
dcoudert opened this issue Aug 8, 2012 · 59 comments
Closed

Running time improvement of the bitset_len method #13352

dcoudert opened this issue Aug 8, 2012 · 59 comments

Comments

@dcoudert
Copy link
Contributor

dcoudert commented Aug 8, 2012

This patch improves the running time of the bitset_len method by using mpn_popcount().

Before:

sage: B = Bitset('10'*10000)
sage: len(B)
10000
sage: %timeit len(B)
1000 loops, best of 3: 245 us per loop

After:

sage: B = Bitset('10'*10000)
sage: len(B)
10000
sage: %timeit len(B)
100000 loops, best of 3: 2.52 us per loop

This patch also reorganizes the bitset files, including deleting the sage/misc/bitset_pxd.pxi file.

APPLY:

CC: @nathanncohen @nexttime

Component: misc

Author: David Coudert, Jeroen Demeyer

Reviewer: David Coudert

Merged: sage-5.13.rc0

Issue created by migration from https://trac.sagemath.org/ticket/13352

@nexttime
Copy link
Mannequin

nexttime mannequin commented Aug 9, 2012

comment:2

I'd rather "implement" the functions in C (where you can use the C preprocessor and __builtin_popcount() if available) and write a thin Cython wrapper.

GCC e.g. chooses efficient machine implementations (probably a single instruction), depending on the (configured or specified) target and perhaps also further compiler flags.

Note that GMP has functions like mpz_popcount() for mpz_ts as well.


Not sure what the parallel refers to ... ;-)

@dcoudert
Copy link
Contributor Author

dcoudert commented Aug 9, 2012

trials not working

@dcoudert

This comment has been minimized.

@dcoudert
Copy link
Contributor Author

dcoudert commented Aug 9, 2012

comment:3

Attachment: trac_13352-bitcount-in-c.patch.gz

Replying to @nexttime:

I'd rather "implement" the functions in C (where you can use the C preprocessor and __builtin_popcount() if available) and write a thin Cython wrapper.

GCC e.g. chooses efficient machine implementations (probably a single instruction), depending on the (configured or specified) target and perhaps also further compiler flags.

As reported in https://groups.google.com/forum/#!topic/sage-devel/GnudTUwbGG4, I tried to implement the function in C with __builtin_popcount(). I was able to compile successfully, but then sage was not working anymore.
See attached patch attachment: trac_13352-bitcount-in-c.patch.

Note that GMP has functions like mpz_popcount() for mpz_ts as well.

It could be another solution, but I have not been able yet to find how to access the functions used for int behind?

Not sure what the parallel refers to ... ;-)

In fact, the 4 lines n = __COUNT32__(n, 0) can be executed in parallel and gcc is sometimes able to optimize the sequence of instructions. For instance, all methods have similar running times on my macbook air, but parallel methods are slower on my old PC (system still in 32-bits...).

@dcoudert

This comment has been minimized.

@nexttime
Copy link
Mannequin

nexttime mannequin commented Aug 9, 2012

comment:5

Replying to @dcoudert:

Replying to @nexttime:

Not sure what the parallel refers to ... ;-)

In fact, the 4 lines n = __COUNT32__(n, 0) can be executed in parallel

Well, obviously not, since you have the data dependence on n. ;-)

Btw., you should use UINT32_C(1) instead of 0x1u etc., and e.g. UINT32_MAX instead of (<uint32_t>-1) (Cython code) etc.

The types of the shift parameters (of __TWO??__()) are slightly overdim'ed...

@dcoudert
Copy link
Contributor Author

dcoudert commented Aug 9, 2012

comment:6

I'm able to import UINT32_MAX, but not UINT32_C.

  • with from libc.stdint cimport UINT32_C: it is apparently not imported
  • with from libc.stdint import UINT32_C: it compiles, but sage crashes with weird messages

Any special trick?

@jdemeyer
Copy link

comment:7

Attachment: trac_13352_bitset_len.patch.gz

@jdemeyer jdemeyer modified the milestones: sage-5.11, sage-5.12 Aug 13, 2013
@dcoudert
Copy link
Contributor Author

Attachment: trac_13352_bitset_len_v2.patch.gz

@dcoudert
Copy link
Contributor Author

comment:8

Hello,

Since I have now understood how to use __builtin_popcountl (not obvious when you don't know), I have uploaded a new version of the patch.

before:

sage: B = Bitset('00'*10000+'1')
sage: len(B)
1
sage: %timeit len(B)
1000000 loops, best of 3: 268 ns per loop
sage: B = Bitset('10'*10000)
sage: len(B)
10000
sage: %timeit len(B)
1000 loops, best of 3: 245 us per loop
sage: B = Bitset('10'*1000000)
sage: len(B)
1000000
sage: %timeit len(B)
10 loops, best of 3: 24.1 ms per loop

after:

sage: B = Bitset('00'*10000+'1')
sage: len(B)
1
sage: %timeit len(B)
1000000 loops, best of 3: 275 ns per loop
sage: B = Bitset('10'*10000)
sage: len(B)
10000
sage: %timeit len(B)
100000 loops, best of 3: 2.52 us per loop
sage: B = Bitset('10'*1000000)
sage: len(B)
1000000
sage: %timeit len(B)
1000 loops, best of 3: 236 us per loop

I'm wondering if I should add a wrapper in case the __builtin_popcountl method is not available. This is not difficult, but I don't know which environnement variables should be tested.

Thanks.

@dcoudert

This comment has been minimized.

@dcoudert

This comment has been minimized.

@dcoudert
Copy link
Contributor Author

comment:9

I have now implemented the function in C in file popcount.h with a small wrapper. The result is a bit slower than without wrapper, but still way faster than previous method.

sage: B = Bitset('00'*10000+'1')
sage: len(B)
1
sage: %timeit len(B)
1000000 loops, best of 3: 361 ns per loop
sage: B = Bitset('10'*10000)
sage: %timeit len(B)
100000 loops, best of 3: 3.03 us per loop
sage: B = Bitset('10'*1000000)
sage: len(B)
1000000
sage: %timeit len(B)
1000 loops, best of 3: 287 us per loop

I have used the _WIN32 flag, but I'm not sure it's the good one. So if one knows the best flag to use to detect when the __builtin_popcount method is not available, please let me know.

@dcoudert
Copy link
Contributor Author

comment:10

Hello,

after browsing the web, I came up with this new version. The current version tests is the cpu has builtin support. If yes, we use builtin version. Else, we use hakmem methods.

We don't need all the methods for bitset_len but it could be useful for other usage.

Let me know if it is OK.

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Sep 17, 2013

comment:11

HMmmmmmmmm.... I'm trying to understand the code and it's much more complicated than what I actually read. Funny, looks like you can have as much fun with GCC than with Python, and decide how stuff is to be run at run time instead of compile time :-P

This being said, if I understand well what "ifunc" does I do not see how the popcount instruction will be used when available. To me it looks like you should make sure that the file is compiled with the -mpopcnt GCC flag (so that __builtin_popcnt is always replaced by the popcount instruction even if it does not exist), and that this ifunc trick means to avoid using this instruction if it is detected that, at runtime, it is not available.

This being said, it is a very nice trick O_o

Nathann

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Sep 17, 2013

comment:12

I would also say that those 2 blocks

__attribute__ ((__target__ ("popcnt"))) 
 static unsigned int builtin_popcount(unsigned int w){ 
 return (sizeof(unsigned int)==4? __builtin_popcount(w) : __builtin_popcountl(w)); 
 } 

are mistakes. Whatever happens, regardless of the architecture, __builtin_popcount works on int type, and __builtin_popcountl works on long types.

Also, perhaps you should credit Cristian Rodríguez somewhere :-P

Nathann

@dcoudert
Copy link
Contributor Author

comment:13

Attachment: trac_13352_bitset_len_v3.patch.gz

Right. I have simplified the code and add credits to Cristian Rodriguez.

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Sep 19, 2013

comment:14

Here's a slightly cleaner version which does not use ifunc but only G++'s multiversionning. C's looking very much like Python these days :-PPPP

Nice links :

Nathann

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Sep 19, 2013

Attachment: a.c.gz

@dcoudert
Copy link
Contributor Author

comment:15

Unfortunately, I'm unable to use this for the patch because the "default" keyword is not recognized.

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Sep 20, 2013

comment:16

Hmmmmmm... Well, it seems that this feature is only available in gcc 4.8+. I asked Volker for his advice on this ticket (or rather on the .c file I uploaded), and he told me that we should check for both gcc 4.8+ and x86 (ifdef __i386__).

Besides, he saw on that link (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36041, though I can't find out where on the page) that GCC was about to patch that, and improve the generic version of __builtin_gcc. If that's the case, perhaps we should just use theirs, waiting for the patch to be merged with a future gcc release. Otherwise we can use the MIT HAKMEM version indeed.

I'll try to send a version with both codes today. And we'll remove the old one in a couple of years :-P

Nathann

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Sep 20, 2013

comment:17

Hey I don't get it... In your patch you also use a attribute/target with popcount. You can't compile my code yet you can compile yours ? And that means that they added this attribute/target feature before adding the "default" keyword ? O_o

Nathann

@dcoudert
Copy link
Contributor Author

comment:18

Apparently yes. Have you tried to modify the patch and compile on your own computer?

@dcoudert
Copy link
Contributor Author

comment:31

That would be great if you have the good trick to use that method. I tried some time ago to use mpz_popcount and failed.

@jdemeyer
Copy link

comment:32

I'm also merging bitset_pxd.pxi and bitset.pxd (I see no reason why these would be two separate files) and moving the doctests to bitset.pyx.

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Dec 10, 2013

comment:33

Hmmmmmmmmmm O_O

You are on a dangerous road, Jeroen. The road to many broken dependencies in the graph/ folder :-PPPPPP

Nathann

@jdemeyer
Copy link

comment:34

Replying to @nathanncohen:

Hmmmmmmmmmm O_O

You are on a dangerous road, Jeroen. The road to many broken dependencies in the graph/ folder :-PPPPPP

You mean because of bitset_pxd.pxi? I could simply leave that file with a one-liner

from sage.misc.bitset cimport *

if you prefer.

@jdemeyer
Copy link

Changed author from David Coudert to David Coudert, Jeroen Demeyer

@jdemeyer

This comment has been minimized.

@jdemeyer
Copy link

comment:36

Replying to @dcoudert:

That would be great if you have the good trick to use that method. I tried some time ago to use mpz_popcount and failed.

The trick is to use mpn_popcount instead of mpz_popcount.

@jdemeyer
Copy link

comment:37

Speed is much better too.

With attachment: trac_13352_bitset_len_v3.patch​:

sage: B = Bitset('10'*10000)
sage: timeit('len(B)',repeat=100,number=10000)
10000 loops, best of 100: 1.24 µs per loop
sage: B = Bitset('00'*10000)
sage: timeit('len(B)',repeat=100,number=10000)
10000 loops, best of 100: 254 ns per loop

With attachment: trac_13352_bitset_len_v4.patch​:

sage: B = Bitset('10'*10000)
sage: timeit('len(B)',repeat=100,number=10000)
10000 loops, best of 100: 168 ns per loop
sage: B = Bitset('00'*10000)
sage:  timeit('len(B)',repeat=100,number=10000)
10000 loops, best of 100: 168 ns per loop

@jdemeyer
Copy link

comment:38

Similar improvements could be made to other bitset functions (in particular to bitset_next() and the like). But that would be for a new ticket...

@jdemeyer
Copy link

comment:39

Strangely, many places were importing bitset without actually using bitsets... so I removed those redundant imports.

Even stranger were two files sage/combinat/debruijn_sequence.pxd and sage/matroids/unpickling.pxd containing just one line

include 'sage/misc/bitset_pxd.pxi' 

which was in both cases unused.

@jdemeyer
Copy link

Attachment: trac_13352_bitset_len_v4.patch.gz

@dcoudert
Copy link
Contributor Author

comment:40

Excellent! The patch is working perfectly (install, doctests on sage/ , etc.). The simplification of the hamming weight method is also very good.

Thank you very much.

Should we let Nathann conclude on the status of this ticket?

@jdemeyer
Copy link

comment:41

See also #10093 for another bitset-related ticket ready for review.

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Dec 10, 2013

comment:42

Should we let Nathann conclude on the status of this ticket?

No sorry, please deal with this without me, I'm already juggling with 1000 patches right now O_o

Nathann

@dcoudert
Copy link
Contributor Author

comment:43

For me the patch proposed by Jeroen is OK (and I can install #10093 on top of it), so I set its status to positive.

@jdemeyer
Copy link

Reviewer: David Coudert

@jdemeyer
Copy link

comment:45

This is ridiculous and shows how fast mpn_popcount() is:

sage: B = FrozenBitset(capacity=10^5)
sage: timeit('B.isempty()', repeat=100, number=10000)
10000 loops, best of 100: 980 ns per loop
sage: timeit('len(B)', repeat=100, number=10000)
10000 loops, best of 100: 573 ns per loop

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Dec 10, 2013

comment:46

Hmmmmmm... Would you happen to have the popcount instruction on the computer on which you try this ?

Nathann

@jdemeyer
Copy link

comment:47

Replying to @nathanncohen:

Hmmmmmm... Would you happen to have the popcount instruction on the computer on which you try this ?

I think so, yes.

@jdemeyer
Copy link

Merged: sage-5.13.rc0

@jasongrout

This comment has been minimized.

@dcoudert

This comment has been minimized.

@dcoudert
Copy link
Contributor Author

comment:50

I changed the description since we use mpn_popcount and not ___builtin_popcount

@jdemeyer
Copy link

jdemeyer commented Sep 5, 2014

comment:51

Follow-up: #16937

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants