Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Towards splitting coordinates in separate buffers and files #1454

Merged
merged 1 commit into from
Dec 10, 2019

Conversation

stavrospapadopoulos
Copy link
Member

Towards addressing #93. Split writer algorithms such that the dimension coordinates are split in different buffers and handled separately.

Copy link
Contributor

@joe-maley joe-maley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a partial review of the header files and small changes in .cc files.

@@ -675,6 +675,20 @@ void expand_mbr(T* mbr, const T* coords, unsigned int dim_num) {
}
}

template <class T>
void expand_mbr(T* mbr, const std::vector<T*>& coords, uint64_t pos) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll only leave one of these comments because it's a little pedantic, but we should value-const function parameters in the definitions (but not declarations) whenever possible for const correctness. So in this example, I would do T* const mbr and const uint64_t pos.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mbr is modified by the function, perhaps I should move it to the end of the list (that's the convention we've been following). Good point on pos although every other place in the code must be changed similarly. We can start now :).

uint64_t original_buffer_var_size_;

/** Constructor. */
CoordBuffer() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's changed in recent years, but list-initializing these class members has been classically faster.

* Constructor.
*
* @param domain The array domain.
* @param buff The coordinate buffers, one per dimension, containing the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename buff to coord_buffs to match the declaration.

*/
GlobalCmp2(const Domain* domain, const std::vector<const void*>& coord_buffs)
: domain_(domain)
, coord_buffs_(coord_buffs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unsafe because coord_buffs is a temporary bound to a class member reference (coord_buffs_). The temporary coord_buffs is only guaranteed to persist for the lifetime of this constructor. In other words: coord_buffs_ will NOT extend the lifetime of coord_buffs. When coord_buffs_ is dereferenced outside of the constructor, for example: within GlobalCmp2::operator(), it may be invalid.

From the C++03 standard:

The second context is when a reference is bound to a temporary. The temporary to which the reference is bound or the temporary that is the complete object to a subobject of which the temporary is bound persists for the lifetime of the reference except as specified below. A temporary bound to a reference member in a constructor’s ctor-initializer (§12.6.2 [class.base.init]) persists until the constructor exits.

So we must ensure that the original coord_buffs outlives this GlobalCmp2.
I think the right thing to do here is to change coord_buffs_ from a const-ref to a const raw ptr. The contract is the same, but if the object is freed before we access it, we'll get a segfault instead of a garbage object.

Or, we could copy/move it, but I don't recommend that.

* - +1 if the first coordinates succeed the second on the cell order
*/
template <class T>
static int cell_order_cmp(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation for making this static? Each cpp file that includes this header will get its own text copy of this routine. We should make this non-static to avoid binary bloat.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to be able to store it in cell_order_cmp_func_, this function needs to be static, otherwise the compiler complains.

* - +1 if the first coordinates succeed the second on the tile order
*/
template <class T>
static int tile_order_cmp(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove static

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the above.

* - d: The dimension index to compare on.
*/
std::vector<int (*)(
const Dimension* dim,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about typedef or using = either the vector its element template? Just an easy way to re-use it with the tile_order_cmp_func_ declaration and any other place we may need to reuse it.

@@ -880,6 +1037,29 @@ class Writer {
const std::string& attribute,
FragmentMetadata* frag_meta,
const std::vector<Tile>& tiles) const;

// TODO: remove
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is intended to be checked in, could you expand this comment to explain why it should be removed in a future patch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be removed in my next PR over the next couple of days, but I will add a comment.

Copy link
Contributor

@joe-maley joe-maley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, just a few nitpick comments that you can address or not. Thanks!

@@ -438,15 +446,16 @@ Status Writer::check_buffer_sizes() const {

auto cell_num = array_schema_->domain()->cell_num(subarray_);
uint64_t expected_cell_num = 0;
for (const auto& attr : attributes_) {
for (const auto& it : attr_buffers_) {
auto attr = it.first;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const-ref here to avoid unnecessary string copy

}
coord_buffer_sizes_.clear();
coord_buffers_.clear();
coord_buffers_alloced_ = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is coord_buffers_alloced_ necessary? Couldn't we determine if we allocated the buffers just by checking if coord_buffers_ is or is not empty?

Does this instead indicate ownership of the buffers? Is there a scenario where coord_buffers_ is non-empty but this path wouldn't be responsible for freeing the buffers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is. In the very next PR, the user will be able to set the coordinate buffers separately for each dimension (whereas currently, they must zip all coordinates in a single buffer). We need to have a way to differentiate, which is accomplished by checking that member.

case Datatype::UINT16:
return compute_coords_metadata<uint16_t>(tiles, meta);
case Datatype::INT32:
return compute_coords_metadata<int>(tiles, meta);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extreme nitpick: let's use int32_t here instead of int. same for unsigned->uint32 below.

for (const auto& it : attr_buffers_) {
auto attr = it.first;
if (global_write_state_->attr_cells_written_[attr] != cell_num) {
std::cout << global_write_state_->attr_cells_written_[attr] << "\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a debug statement? we should remove this

std::string ret;
auto dim_num = array_schema_->dim_num();

ret = "(";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this will be called frequently, consider a stringstream here to avoid a bunch of string copies.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it isn't, but I will change anyway.

…on coordinates are split in different buffers and handled separately.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants