Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12730: [MATLAB] Update featherreadmex and featherwritemex to build against latest Arrow C++ APIs #10305

Closed
wants to merge 16 commits into from

Conversation

sgilmore10
Copy link
Member

Overview

  • The MEX functions featherreadmex and featherwritemex fail to build against the latest Arrow C++ APIs. These changes allow them to successfully build.
  • These changes require CMake version 3.20 or later in order to access the latest functionality exposed by FindMatlab.cmake. We noticed that some Arrow project components, such as Gandiva, require newer versions of CMake than the core Arrow C++ libraries. If version 3.20 is too new, we're happy to find an alternative.
  • We couldn't find a way to read and write a table description for feather V1 files using the latest APIs. It looks like support for reading and writing descriptions was modified in pull request ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format #6694. For now, we've removed support for table descriptions.

Testing

  • Built featherreadmex and featherwritemex on Windows 10 with Visual Studio 2019
  • Built featherreadmex and featherwritemex on macOS Big Sur (11.2.3) with GNU Make 3.81
  • Built featherreadmex and featherwritemex on Debian 10 with GNU Make GNU 4.2.1
  • Ran all tests in tfeather and tfeathermex on all platforms in MATLAB R2021a

Future Directions

  • We did not detect the build failures due to the lack of CI integration. We hope to add CI support soon and will follow up with a mailing list discussion to talk through the details.
  • These changes are temporary to allow us to have a clean slate to start developing the MATLAB Interface to Apache Arrow.
  • Eventually we would like to support the full ranges of data types for feather V1 and feather V2.
  • In order to modernize the code, we plan to migrate to the C++ MEX and MATLAB Data Array APIs.
  • We are going to follow up with another pull request to update the README.md to provide more detailed platform-specific development instructions.
  • The MATLAB based build system inside of the build_support folder is out of date. We are not sure if we want to maintain a separate MATLAB based build system along side the CMake based one. We will follow up on this in the future via the mailing list or Jira.

We acknowledge there is a lot of information in this pull request. In the future, we will work in smaller increments. We felt a bigger pull request was necessary to get back to a working state.

Thanks,
Sarah

@github-actions
Copy link

@sgilmore10
Copy link
Member Author

I'm not too familiar with Travis CI, but I believe the build failure failure on ARM may be unrelated to my changes and/or a pre-existing build issue. It looks like other failed jobs have experienced a similar issue on ARM as well. If there's anything I need to do, please let me know.

@lidavidm
Copy link
Member

Sorry for the long delay - I kicked off the other CI pipelines here.

If you rebase, that'll let you re-run the ARM build - note it's been somewhat flaky but should have gotten better recently (we've switched to a newer machine type, IIRC).

Also, did you have someone in mind to review this? If not we can find someone.

@sgilmore10
Copy link
Member Author

Thanks for looking at this @lidavidm! I rebased against the master branch.

Ideally, we would like to get reviewers outside of MathWorks. I'm not sure what the precedent is for new language bindings in terms of finding the right reviewers. We understand it takes a fair bit of time and effort to review pull requests, so let us know if there's anything we can do to help.

Best,
Sarah

@lidavidm
Copy link
Member

Hey, sorry for the delay (again) - I can try to review this once I get a chance, but you may also want to poke the mailing list ([email protected]) to see if someone else is interested.

@sgilmore10
Copy link
Member Author

No worries! I just messaged the mailing list to see if anyone would be interested in reviewing this.

Best,
Sarah

@@ -15,7 +15,8 @@
# specific language governing permissions and limitations
# under the License.

cmake_minimum_required(VERSION 3.2)
cmake_minimum_required(VERSION 3.20)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit high, is it necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wanted to use 3.20 because earlier versions of the FindMatlab.cmake module had a few bugs we ran into when trying to build our mex functions. I'm open to other approaches, but the hope was to build on the existing work of the cmake community to avoid reinventing the wheel. We also want to share improvements upstream where appropriate.


// TableReader expects a RandomAccessFile.
std::shared_ptr<io::RandomAccessFile> random_access_file(readable_file);
std::shared_ptr<io::RandomAccessFile> random_access_file{
maybe_readable_file.ValueOrDie()};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use ARROW_ASSIGN_OR_RAISE here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

RETURN_NOT_OK(maybe_reader);

// Set the internal reader_ object.
(*feather_reader)->reader_ = maybe_reader.ValueOrDie();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARROW_ASSIGN_OR_RAISE?

matlab/src/feather_reader.cc Outdated Show resolved Hide resolved
arrow::Status status = reader_->Read(&table);
if (!status.ok()) {
mexErrMsgIdAndTxt("MATLAB:arrow:FeatherReader::FailedToReadTable",
"Failed to read arrow::Table from Feather file.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want to append the Status message? You might consider having a utility to adorn a status with additional Matlab-specific error message to make this pattern simpler

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I'll update the error to include the status message.

arrow::Result<std::shared_ptr<ResizableBuffer>> maybe_buffer =
arrow::AllocateResizableBuffer(internal::BitPackedLength(num_rows_));
RETURN_NOT_OK(maybe_buffer);
std::shared_ptr<ResizableBuffer> validity_bitmap = maybe_buffer.ValueOrDie();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARROW_ASSIGN_OR_RAISE?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do.

schema_builder.Finish();
RETURN_NOT_OK(table_schema_result);

std::shared_ptr<arrow::Schema> table_schema = table_schema_result.ValueOrDie();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARROW_ASSIGN_OR_RAISE?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do.

#include <matrix.h>

#include <memory>
#include <string>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to move these here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. See my comment below. I'll move them back to the top.

#include <mex.h>

#include <string>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reason for this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to reorganize the include statements in a way that made sense to me, but I'd rather go for consistency with the rest of the code base. I'll move the stdlib includes up top.

#include <mex.h>

#include <string>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move the include statement to the top.

@sgilmore10
Copy link
Member Author

Thanks for the feedback @wesm! I'll make some changes and push to the pr once I'm done.

Best,
Sarah

sgilmore10 added a commit to mathworks/arrow that referenced this pull request Jun 23, 2021
@sgilmore10
Copy link
Member Author

Just pushed changes based on the code review feedback.

@kou
Copy link
Member

kou commented Jun 23, 2021

Sorry. This is conflicted caused by #10571. Could you rebase on the master?

sgilmore10 added a commit to mathworks/arrow that referenced this pull request Jun 23, 2021
@sgilmore10
Copy link
Member Author

Sorry about that. Just rebased on master.

@kou
Copy link
Member

kou commented Jun 23, 2021

Thanks. Don't worry. It's not a your problem.

LINK_TO
arrow_shared)

# Ensure the MEX binaries are placed in the src directory on all platforms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just wondering. Why do we need to place the binaries in the source directory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to execute the MEX function, it has to be discoverable on the MATLAB search path. We also have MATLAB code (featherread.m and featherwrite.m) that we also need to add to the MATLAB search path. Since we need to add the source directory to the path anyway, I thought it makes sense to put the MEX files there as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
I understand. ( https://github.com/apache/arrow/blob/master/matlab/test/tfeathermex.m#L26 is the code that adds the source directory to the MATLAB search path.)

Generally, we should not change anything files in the source directory with out-of-source build. Can we also add the build directory to the MATLAB search path? Is it difficult?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add the build folder to the MATLAB search path. However, this will make harder to run our unit tests automatically because we would require the user to explicitly tell us where their build files are located every time.

Additionally, as the MATLAB interface grows, we will have many MEX files. In order to keep things organized. We see two approaches:

  1. We can encode the relationship of MEX files to MATLAB classes via the MEX file name itself. For example, arrow_array_new.mexw64.

or

  1. The other option is to encode the relationship between the two via the MATLAB packaging mechanism. For example, the folder structure +matlab/+array/+mex exposes a package called matlab.array.mex to MATLAB. If we choose this option, we can reuse common names, such as "new", and keep the names short.

Option 2 seems more scalable, but if you're experience tells us this will lead to issues, we can revisit adding the build folder to the path.

Best,
Sarah

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for describing this.
Could you try one of them (or both)? If we can find a better approach, we can choose it.
It can be done in this pull request or a follow up task.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can definitely investigate both approaches and see which one is preferable. I created a jira task to look into this in a future pull request.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did not detect the build failures due to the lack of CI integration. We hope to add CI support soon and will follow up with a mailing list discussion to talk through the details.

I hope that this is the next task. :-)

LINK_TO
arrow_shared)

# Ensure the MEX binaries are placed in the src directory on all platforms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
I understand. ( https://github.com/apache/arrow/blob/master/matlab/test/tfeathermex.m#L26 is the code that adds the source directory to the MATLAB search path.)

Generally, we should not change anything files in the source directory with out-of-source build. Can we also add the build directory to the MATLAB search path? Is it difficult?

matlab/src/feather_reader.cc Outdated Show resolved Hide resolved
std::static_pointer_cast<ArrowArrayType>(column);

// Get a raw pointer to the Arrow array data.
const MatlabType* source = integer_array->raw_values();
const MatlabType* source =
reinterpret_cast<const MatlabType*>(arrow_numeric_array->values()->data());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work with sliced array?

What is the problem of raw_values()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the documentation, it looks like the values() method doesn't account for slice offsets. I think we modified this line when we were trying to get the code to compile again, but didn't check to see if this change was really necessary. I just tried using raw_values() instead and it works. I'll undo this change.

matlab/src/feather_reader.cc Outdated Show resolved Hide resolved
Comment on lines 188 to 190
// TableReader expects a RandomAccessFile.
std::shared_ptr<io::RandomAccessFile> random_access_file(readable_file);
ARROW_ASSIGN_OR_RAISE(std::shared_ptr<io::RandomAccessFile> random_access_file,
maybe_readable_file);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this cast?

I think that arrow::ipc::feather::Reader::Open() can accept arrow::io::ReadableFile because arrow::io::ReadableFile is a subclass of arrow::io::RandomAccessFile.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. We don't need this cast.

matlab/src/feather_writer.cc Outdated Show resolved Hide resolved
matlab/src/feather_writer.cc Outdated Show resolved Hide resolved
matlab/src/feather_writer.cc Outdated Show resolved Hide resolved
matlab/src/feather_writer.cc Outdated Show resolved Hide resolved
matlab/src/feather_writer.cc Outdated Show resolved Hide resolved
@sgilmore10
Copy link
Member Author

Hi @kou,

Regarding CI, we're currently working on adding GoogleTest support in this branch here if you're interested. Once this pull request is accepted, we can submit a pull request for this branch. That branch requires the changes here to build, which is why we haven't opened pull request yet.

Regarding MATLAB CI, we're actively working on this and will open a pull request as soon as possible.

Best,
Sarah

@kou
Copy link
Member

kou commented Jun 25, 2021

Thanks!

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants