-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GEODE-9753: Solve PDX serialization coredump #891
Conversation
af45bbf
to
007e822
Compare
- While serializing a PdxSerializable object, there is a possible race condition which might cause the client to crash. This race-condition happens whenever the cluster is restarted during the serialization process and if on-client-disconnect-clear-pdxType-Ids is set to true, meaning the PdxTypeRegistry will be cleaned up if the connection towards the cluster is lost. - This issue has been solved by using the previously fetched local PDX type. - In order to properly test this solution, PdxRemoteWriterFactory has been added, so the race-condition can be force at test-time. - A new IT has been added to test that the solution is working fine. - make_unique was needed inside cppcache/src, so it was moved to utils/cxx_extensions.hpp
007e822
to
b010059
Compare
I am reviewing all of my pending PRs, and this has been awating review for quite some time. Maybe @pdxcodemonkey, @pivotal-jbarrett @albertogpz @mkevo @jvarenina or anyone else could take a look at it and provide some feedback? |
return PdxRemoteWriter(output, className, pdxTypeRegistry); | ||
} | ||
}; | ||
auto prw = cacheImpl->getPdxRemoteWriterFactory().create( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a slightly longer, meaningful name? like remoteWriter
for instance?
namespace apache { | ||
namespace geode { | ||
namespace client { | ||
|
||
class PdxRemoteWriter : public PdxLocalWriter { | ||
class APACHE_GEODE_EXPORT PdxRemoteWriter : public PdxLocalWriter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't export this, we can't change the ABI and anyway we don't want any more publicly-accessible things than we already have(!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exported for testing purposes. Will try to test it the way @pivotal-jbarrett, which would avoid exporting symbols.
namespace geode { | ||
namespace client { | ||
|
||
class APACHE_GEODE_EXPORT PdxRemoteWriterFactoryImpl : public virtual PdxRemoteWriterFactory { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, please don't export.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exported for testing purposes. Will try to test it the way @pivotal-jbarrett, which would avoid exporting symbols.
return PdxRemoteWriter(output, className, pdxTypeRegistry); | ||
} | ||
}; | ||
auto prw = cacheImpl->getPdxRemoteWriterFactory().create( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What this extracted into an pure virtual interface for the sole purpose of integration testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. I've seen you proposed an alternative way of testing this. I'll look into it.
Do you mean this issue is solved in this PR or in a previous and this PR just adds more testing? |
cachePerfStats.incPdxSerialization( | ||
pdxLen + 1 + 2 * 4); // pdxLen + 93 DSID + len + typeID | ||
} | ||
uint8_t* stPos = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto
cachePerfStats.incPdxSerialization( | ||
pdxLen + 1 + 2 * 4); // pdxLen + 93 DSID + len + typeID | ||
} | ||
uint8_t* stPos = const_cast<uint8_t*>(output.getBuffer()) + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto
} | ||
initialize(); | ||
|
||
m_pdxClassName = pdxType->getPdxClassName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If pdxType
is nullable then this line will throw.
m_preserveDataIdx(0), | ||
m_currentDataIdx(-1), | ||
m_remoteTolocalMapLength(0) { | ||
m_preserveData = nullptr; | ||
if (m_pdxType != nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If pdxType
is not nullable then this check is unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thing is I did leave that because I thought what if pdxType passed were nullptr, but that's not the way it's intended to be used, so I guess it would be best to force a coredump if pdxType is nullptr, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if it is nullptr
then line 63 is going to fault anyway. Also, if passed here as nulltpr
but later use excepted m_pdxType
to not be nullptr
... boom.
std::shared_ptr<PdxTypeRegistry> pdxTypeRegistry); | ||
|
||
PdxRemoteWriter(DataOutput& output, std::string pdxClassName, | ||
PdxRemoteWriter(DataOutput& output, std::shared_ptr<PdxType> pdxType, | ||
std::shared_ptr<PdxRemotePreservedData> preservedData, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It this really feels like an extension of the other constructor with the optional preservedData
parameters to maybe re-order the parameters to look like that.
|
||
PdxRemoteWriterFactoryImpl::~PdxRemoteWriterFactoryImpl() = default; | ||
|
||
std::unique_ptr<PdxRemoteWriter> PdxRemoteWriterFactoryImpl::create( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It really seems like you could test this at unit level by mocking PdxTypeRegistery
and PdxType
. Make this a member function back from where it came and test it a the unit level. This avoids the creation of this factory, the exporting of internal symbols, and the interdiction of artificial behavior into the production implementation, via overriding this implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your approach seems promising. I'll look into it and see if it's feasible.
Ok, I think I found the fix but I a not sure I understand it. How does preserving the |
The issue I am referring to here is the one I described in the bullet above. |
Closing this PR as this issue is already addressed in GEODE-10276 |
condition which might cause the client to crash.
This race-condition happens whenever the cluster is restarted during
the serialization process and if
on-client-disconnect-clear-pdxType-Ids is set to true, meaning the
PdxTypeRegistry will be cleaned up if the connection towards the
cluster is lost.
type.
so the race-condition can be force at test-time.
utils/cxx_extensions.hpp