Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Valgrind reports memory errors on CORAL machines #2881

Closed
dongahn opened this issue Apr 1, 2020 · 8 comments
Closed

Valgrind reports memory errors on CORAL machines #2881

dongahn opened this issue Apr 1, 2020 · 8 comments

Comments

@dongahn
Copy link
Member

dongahn commented Apr 1, 2020

Without TCE:

checking consistency of all components of python development environment... yes
checking whether /usr/bin/python3 version is >= 3.6... yes
checking for /usr/bin/python3 version... 3.6
checking for /usr/bin/python3 platform... linux
checking for /usr/bin/python3 script directory... ${prefix}/lib/python3.6/site-packages
checking for /usr/bin/python3 extension module directory... ${exec_prefix}/lib/python3.6/site-packages
checking for cffi.__version_info__ >= (1,1) in python module cffi... no
configure: error: could not find python module cffi, version 1.1+ required

With TCE:

checking consistency of all components of python development environment... yes
checking whether /usr/tcetmp/bin/python3 version is >= 3.7... yes
checking for /usr/tcetmp/bin/python3 version... 3.7
checking for /usr/tcetmp/bin/python3 platform... linux
checking for /usr/tcetmp/bin/python3 script directory... ${prefix}/lib/python3.7/site-packages
checking for /usr/tcetmp/bin/python3 extension module directory... ${exec_prefix}/lib/python3.7/site-packages
checking for cffi.__version_info__ >= (1,1) in python module cffi... yes
checking for StrictVersion(six.__version__) >= StrictVersion('1.9.0') in python module six... yes
checking for StrictVersion(yaml.__version__) >= StrictVersion ('3.10.0') in python module yaml... yes
checking for StrictVersion(jsonschema.__version__) >= StrictVersion ('2.3.0') in python module jsonschema... no
configure: error: could not find python module jsonschema, version 2.3.0+ required

I have to build a version for CORAL users and pip --user isn't an option? But for testing, even if I did pip install --user --upgrade jsonschema==2.3.0, "with TCE" configure still fails.

Any ideas what's going on?

@dongahn
Copy link
Member Author

dongahn commented Apr 1, 2020

Looking back my installations, it appears I was able to build a version on 2020 Jan 27:

configure:14670: checking whether /usr/bin/python version is >= 2.7
configure:14681: /usr/bin/python -c import sys # split strings by '.' and convert to numeric. Append some zeros # because we need at least 4 digits for the hex conversion. # map returns an iterator in Python 3.0 and a list in 2.x minver = list(map(int, '2.7'.split('.'))) + [0, 0, 0] minverhex = 0 # xrange is not present in Python 3.0 and range returns an iterator for i in list(range(0, 4)): minverhex = (minverhex << 8) + minver[i] sys.exit(sys.hexversion < minverhex)
configure:14684: $? = 0
configure:14686: result: yes
configure:14779: checking for /usr/bin/python version
configure:14786: result: 2.7
configure:14798: checking for /usr/bin/python platform
configure:14805: result: linux2
configure:14831: checking for /usr/bin/python script directory
configure:14866: result: ${prefix}/lib/python2.7/site-packages
configure:14875: checking for /usr/bin/python extension module directory
configure:14910: result: ${exec_prefix}/lib64/python2.7/site-packages
configure:14939: checking for cffi.__version_info__ >= (1,1) in python module cffi
configure:14963: result: yes
configure:14974: checking for StrictVersion(six.__version__) >= StrictVersion('1.9.0') in python module six
configure:14998: result: yes
configure:15009: checking for StrictVersion(yaml.__version__) >= StrictVersion ('3.10.0') in python module yaml
configure:15033: result: yes
configure:15044: checking for StrictVersion(jsonschema.__version__) >= StrictVersion ('2.3.0') in python module jsonschema
configure:15068: result: yes

@SteVwonder
Copy link
Member

it appears I was able to build a version on 2020 Jan 27

We are now enforcing python 3.6+ as a hard requirement on master.

To get it working locally, try: pip3 install --user --upgrade jsonschema==2.3.0

The default pip binary on TOSS/BLUEOS is for python2.7:

herbein1@lassen708 ~
❯ which pip
/usr/bin/pip
                                                                                                                                                  
herbein1@lassen708 ~
❯ pip --version
pip 8.1.2 from /usr/lib/python2.7/site-packages (python 2.7)

On the #flux slack, @grondo mentioned that jsonschema and cffi are installed at the system-level for python3.6 on TOSS (I also just double-checked, and they still are). Maybe that didn't happen for BLUEOS? I think that would be the long-term solution.

As a medium-term solution: it looks like the python3 /usr/tce has cffi but not jsonschema. Maybe we can get Greg Lee to install the jsonschema package in there for us, and you can configure against the /usr/tce python? Not sure what the /usr/tce policies are though.

@dongahn
Copy link
Member Author

dongahn commented Apr 1, 2020

Ok. Sent an email to Greg for /usr/tce python3. For system python3.6, is it Py's thing or someone else maintains it?

@dongahn
Copy link
Member Author

dongahn commented Apr 1, 2020

@lee218llnl installed a version of jsonschema and flux-core now builds! Thanks Greg.

make check fails some tests though.

FAIL: t5000-valgrind.t 1 - valgrind reports no new errors on 2 broker run
ERROR: t5000-valgrind.t - exited with status 1
FAIL: t9001-pymod.t 1 - load pymod with echo.py
FAIL: t9001-pymod.t 2 - pymod echo.py function works
FAIL: t9001-pymod.t 3 - unload pymod
ERROR: t9001-pymod.t - exited with status 1

@dongahn
Copy link
Member Author

dongahn commented Apr 1, 2020

Within module load python/3.7.2, the failed test boils down to only one case:

FAIL: t5000-valgrind.t 1 - valgrind reports no new errors on 2 broker run
ERROR: t5000-valgrind.t - exited with status 1

Maybe there are some memory issues (or benign errors not filtered) unique to CORAL.

@dongahn
Copy link
Member Author

dongahn commented Apr 1, 2020

Here is the output from valgrind. Seems this is a system issue as there are uses of uninitialized parameters?
valgrind.out.zip

@dongahn dongahn changed the title python: latest flux doesn't build on CORAL machines Valgrind reports memory errors on CORAL machines Apr 1, 2020
@SteVwonder
Copy link
Member

Seems this is a system issue as there are uses of uninitialized parameters?

Huh. Interesting. Here is the offending epoll_modify function in libev. Unless we are passing a NULL fd into libev, I think this may be a bug in libev. It does not appear to be this false positive in valigrind.

@dongahn
Copy link
Member Author

dongahn commented Aug 31, 2020

Duplicate of #3093

@dongahn dongahn marked this as a duplicate of #3093 Aug 31, 2020
@dongahn dongahn closed this as completed Aug 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants