Skip to content

Commit

Permalink
apacheGH-37770: [MATLAB] Add CSV TableReader and TableWriter MATL…
Browse files Browse the repository at this point in the history
…AB classes (apache#37773)

### Rationale for this change

To enable initial CSV I/O support, this PR adds `arrow.io.csv.TableReader` and `arrow.io.csv.TableWriter` MATLAB classes to the MATLAB interface.

### What changes are included in this PR?

1. Added a new `arrow.io.csv.TableReader` class
2. Added a new `arrow.io.csv.TableWriter` class

**Example**
```matlab
>> matlabTableWrite = array2table(rand(3))

matlabTableWrite =

  3×3 table

     Var1        Var2       Var3  
    _______    ________    _______

    0.91131    0.091595    0.24594
    0.51315     0.27368    0.62119
    0.42942     0.88665    0.49501

>> arrowTableWrite = arrow.table(matlabTableWrite)

arrowTableWrite = 

Var1: double
Var2: double
Var3: double
----
Var1:
  [
    [
      0.9113083542736461,
      0.5131490075412158,
      0.42942202968065213
    ]
  ]
Var2:
  [
    [
      0.09159480217154525,
      0.27367730380496647,
      0.8866478145458545
    ]
  ]
Var3:
  [
    [
      0.2459443412735529,
      0.6211893868708748,
      0.49500739584280073
    ]
  ]

>> writer = arrow.io.csv.TableWriter("example.csv")

writer = 

  TableWriter with properties:

    Filename: "example.csv"

>> writer.write(arrowTableWrite)

>> reader = arrow.io.csv.TableReader("example.csv")

reader = 

  TableReader with properties:

    Filename: "example.csv"

>> arrowTableRead = reader.read()

arrowTableRead = 

Var1: double
Var2: double
Var3: double
----
Var1:
  [
    [
      0.9113083542736461,
      0.5131490075412158,
      0.42942202968065213
    ]
  ]
Var2:
  [
    [
      0.09159480217154525,
      0.27367730380496647,
      0.8866478145458545
    ]
  ]
Var3:
  [
    [
      0.2459443412735529,
      0.6211893868708748,
      0.49500739584280073
    ]
  ]

>> matlabTableRead = table(arrowTableRead)

matlabTableRead =

  3×3 table

     Var1        Var2       Var3  
    _______    ________    _______

    0.91131    0.091595    0.24594
    0.51315     0.27368    0.62119
    0.42942     0.88665    0.49501

>> isequal(arrowTableRead, arrowTableWrite)

ans =

  logical

   1

>> isequal(matlabTableRead, matlabTableWrite)

ans =

  logical

   1
```

### Are these changes tested?

Yes.

1. Added new CSV I/O tests including `test/arrow/io/csv/tRoundTrip.m` and `test/arrow/io/csv/tError.m`.
2. Both of these test classes inherit from a `CSVTest` superclass.

### Are there any user-facing changes?

Yes.

1. Users can now read and write CSV files using `arrow.io.csv.TableReader` and `arrow.io.csv.TableWriter`.

### Future Directions

1. Expose [options](https://github.com/apache/arrow/blob/main/cpp/src/arrow/csv/options.h) for controlling CSV reading and writing in MATLAB.
2. Add more read/write tests for null value handling and other datatypes beyond numeric and string values.
4. Add a `RecordBatchReader` and `RecordBatchWriter` for CSV.
5. Add support for more I/O formats like Parquet, JSON, ORC, Arrow IPC, etc.

### Notes

1. Thank you @ sgilmore10 for your help with this pull request!
2. I chose to add both the `TableReader` and `TableWriter` in one pull request because it simplified testing. My apologies for the slightly lengthy pull request.
* Closes: apache#37770

Lead-authored-by: Kevin Gurney <[email protected]>
Co-authored-by: Sarah Gilmore <[email protected]>
Signed-off-by: Kevin Gurney <[email protected]>
  • Loading branch information
kevingurney and sgilmore10 authored Sep 20, 2023
1 parent e068b7f commit 2b34e37
Show file tree
Hide file tree
Showing 13 changed files with 606 additions and 4 deletions.
5 changes: 3 additions & 2 deletions matlab/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,9 @@ function(build_arrow)

set(ARROW_PREFIX "${CMAKE_CURRENT_BINARY_DIR}/arrow_ep-prefix")
set(ARROW_BINARY_DIR "${CMAKE_CURRENT_BINARY_DIR}/arrow_ep-build")
set(ARROW_CMAKE_ARGS "-DCMAKE_INSTALL_PREFIX=${ARROW_PREFIX}"
"-DCMAKE_INSTALL_LIBDIR=lib" "-DARROW_BUILD_STATIC=OFF")
set(ARROW_CMAKE_ARGS
"-DCMAKE_INSTALL_PREFIX=${ARROW_PREFIX}" "-DCMAKE_INSTALL_LIBDIR=lib"
"-DARROW_BUILD_STATIC=OFF" "-DARROW_CSV=ON")

add_library(arrow_shared SHARED IMPORTED)
set(ARROW_LIBRARY_TARGET arrow_shared)
Expand Down
3 changes: 3 additions & 0 deletions matlab/src/cpp/arrow/matlab/error/error.h
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,9 @@ namespace arrow::matlab::error {
static const char* TABLE_INVALID_NUMERIC_COLUMN_INDEX = "arrow:tabular:table:InvalidNumericColumnIndex";
static const char* FAILED_TO_OPEN_FILE_FOR_WRITE = "arrow:io:FailedToOpenFileForWrite";
static const char* FAILED_TO_OPEN_FILE_FOR_READ = "arrow:io:FailedToOpenFileForRead";
static const char* CSV_FAILED_TO_WRITE_TABLE = "arrow:io:csv:FailedToWriteTable";
static const char* CSV_FAILED_TO_CREATE_TABLE_READER = "arrow:io:csv:FailedToCreateTableReader";
static const char* CSV_FAILED_TO_READ_TABLE = "arrow:io:csv:FailedToReadTable";
static const char* FEATHER_FAILED_TO_WRITE_TABLE = "arrow:io:feather:FailedToWriteTable";
static const char* TABLE_FROM_RECORD_BATCH = "arrow:table:FromRecordBatch";
static const char* FEATHER_FAILED_TO_CREATE_READER = "arrow:io:feather:FailedToCreateReader";
Expand Down
93 changes: 93 additions & 0 deletions matlab/src/cpp/arrow/matlab/io/csv/proxy/table_reader.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

#include "libmexclass/proxy/ProxyManager.h"

#include "arrow/matlab/error/error.h"
#include "arrow/matlab/io/csv/proxy/table_reader.h"
#include "arrow/matlab/tabular/proxy/table.h"

#include "arrow/util/utf8.h"

#include "arrow/result.h"

#include "arrow/io/file.h"
#include "arrow/io/interfaces.h"
#include "arrow/csv/reader.h"
#include "arrow/table.h"

namespace arrow::matlab::io::csv::proxy {

TableReader::TableReader(const std::string& filename) : filename{filename} {
REGISTER_METHOD(TableReader, read);
REGISTER_METHOD(TableReader, getFilename);
}

libmexclass::proxy::MakeResult TableReader::make(const libmexclass::proxy::FunctionArguments& constructor_arguments) {
namespace mda = ::matlab::data;
using TableReaderProxy = arrow::matlab::io::csv::proxy::TableReader;

mda::StructArray args = constructor_arguments[0];
const mda::StringArray filename_utf16_mda = args[0]["Filename"];
const auto filename_utf16 = std::u16string(filename_utf16_mda[0]);
MATLAB_ASSIGN_OR_ERROR(const auto filename, arrow::util::UTF16StringToUTF8(filename_utf16), error::UNICODE_CONVERSION_ERROR_ID);

return std::make_shared<TableReaderProxy>(filename);
}

void TableReader::read(libmexclass::proxy::method::Context& context) {
namespace mda = ::matlab::data;
using namespace libmexclass::proxy;
namespace csv = ::arrow::csv;
using TableProxy = arrow::matlab::tabular::proxy::Table;

mda::ArrayFactory factory;

// Create a file input stream.
MATLAB_ASSIGN_OR_ERROR_WITH_CONTEXT(auto source, arrow::io::ReadableFile::Open(filename, arrow::default_memory_pool()), context, error::FAILED_TO_OPEN_FILE_FOR_READ);

const ::arrow::io::IOContext io_context;
const auto read_options = csv::ReadOptions::Defaults();
const auto parse_options = csv::ParseOptions::Defaults();
const auto convert_options = csv::ConvertOptions::Defaults();

// Create a TableReader from the file input stream.
MATLAB_ASSIGN_OR_ERROR_WITH_CONTEXT(auto table_reader,
csv::TableReader::Make(io_context, source, read_options, parse_options, convert_options),
context,
error::CSV_FAILED_TO_CREATE_TABLE_READER);

// Read a Table from the file.
MATLAB_ASSIGN_OR_ERROR_WITH_CONTEXT(const auto table, table_reader->Read(), context, error::CSV_FAILED_TO_READ_TABLE);

auto table_proxy = std::make_shared<TableProxy>(table);
const auto table_proxy_id = ProxyManager::manageProxy(table_proxy);

const auto table_proxy_id_mda = factory.createScalar(table_proxy_id);

context.outputs[0] = table_proxy_id_mda;
}

void TableReader::getFilename(libmexclass::proxy::method::Context& context) {
namespace mda = ::matlab::data;
mda::ArrayFactory factory;

MATLAB_ASSIGN_OR_ERROR_WITH_CONTEXT(const auto filename_utf16, arrow::util::UTF8StringToUTF16(filename), context, error::UNICODE_CONVERSION_ERROR_ID);
auto filename_utf16_mda = factory.createScalar(filename_utf16);
context.outputs[0] = filename_utf16_mda;
}

}
38 changes: 38 additions & 0 deletions matlab/src/cpp/arrow/matlab/io/csv/proxy/table_reader.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

#pragma once

#include "libmexclass/proxy/Proxy.h"

namespace arrow::matlab::io::csv::proxy {

class TableReader : public libmexclass::proxy::Proxy {
public:
TableReader(const std::string& filename);
~TableReader() {}
static libmexclass::proxy::MakeResult make(const libmexclass::proxy::FunctionArguments& constructor_arguments);

protected:
void read(libmexclass::proxy::method::Context& context);
void getFilename(libmexclass::proxy::method::Context& context);

private:
const std::string filename;
};

}
86 changes: 86 additions & 0 deletions matlab/src/cpp/arrow/matlab/io/csv/proxy/table_writer.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

#include "arrow/matlab/io/csv/proxy/table_writer.h"
#include "arrow/matlab/tabular/proxy/table.h"
#include "arrow/matlab/error/error.h"

#include "arrow/result.h"
#include "arrow/table.h"
#include "arrow/util/utf8.h"

#include "arrow/io/file.h"
#include "arrow/csv/writer.h"
#include "arrow/csv/options.h"

#include "libmexclass/proxy/ProxyManager.h"

namespace arrow::matlab::io::csv::proxy {

TableWriter::TableWriter(const std::string& filename) : filename{filename} {
REGISTER_METHOD(TableWriter, getFilename);
REGISTER_METHOD(TableWriter, write);
}

libmexclass::proxy::MakeResult TableWriter::make(const libmexclass::proxy::FunctionArguments& constructor_arguments) {
namespace mda = ::matlab::data;
mda::StructArray opts = constructor_arguments[0];
const mda::StringArray filename_mda = opts[0]["Filename"];
using TableWriterProxy = ::arrow::matlab::io::csv::proxy::TableWriter;

const auto filename_utf16 = std::u16string(filename_mda[0]);
MATLAB_ASSIGN_OR_ERROR(const auto filename_utf8,
arrow::util::UTF16StringToUTF8(filename_utf16),
error::UNICODE_CONVERSION_ERROR_ID);

return std::make_shared<TableWriterProxy>(filename_utf8);
}

void TableWriter::getFilename(libmexclass::proxy::method::Context& context) {
namespace mda = ::matlab::data;
MATLAB_ASSIGN_OR_ERROR_WITH_CONTEXT(const auto utf16_filename,
arrow::util::UTF8StringToUTF16(filename),
context,
error::UNICODE_CONVERSION_ERROR_ID);
mda::ArrayFactory factory;
auto str_mda = factory.createScalar(utf16_filename);
context.outputs[0] = str_mda;
}

void TableWriter::write(libmexclass::proxy::method::Context& context) {
namespace csv = ::arrow::csv;
namespace mda = ::matlab::data;
using TableProxy = ::arrow::matlab::tabular::proxy::Table;

mda::StructArray opts = context.inputs[0];
const mda::TypedArray<uint64_t> table_proxy_id_mda = opts[0]["TableProxyID"];
const uint64_t table_proxy_id = table_proxy_id_mda[0];

auto proxy = libmexclass::proxy::ProxyManager::getProxy(table_proxy_id);
auto table_proxy = std::static_pointer_cast<TableProxy>(proxy);
auto table = table_proxy->unwrap();

MATLAB_ASSIGN_OR_ERROR_WITH_CONTEXT(const auto output_stream,
arrow::io::FileOutputStream::Open(filename),
context,
error::FAILED_TO_OPEN_FILE_FOR_WRITE);
const auto options = csv::WriteOptions::Defaults();
MATLAB_ERROR_IF_NOT_OK_WITH_CONTEXT(csv::WriteCSV(*table, options, output_stream.get()),
context,
error::CSV_FAILED_TO_WRITE_TABLE);
}
}
38 changes: 38 additions & 0 deletions matlab/src/cpp/arrow/matlab/io/csv/proxy/table_writer.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

#pragma once

#include "libmexclass/proxy/Proxy.h"

namespace arrow::matlab::io::csv::proxy {

class TableWriter : public libmexclass::proxy::Proxy {
public:
TableWriter(const std::string& filename);
~TableWriter() {}
static libmexclass::proxy::MakeResult make(const libmexclass::proxy::FunctionArguments& constructor_arguments);

protected:
void getFilename(libmexclass::proxy::method::Context& context);
void write(libmexclass::proxy::method::Context& context);

private:
const std::string filename;
};

}
4 changes: 4 additions & 0 deletions matlab/src/cpp/arrow/matlab/proxy/factory.cc
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@
#include "arrow/matlab/type/proxy/field.h"
#include "arrow/matlab/io/feather/proxy/writer.h"
#include "arrow/matlab/io/feather/proxy/reader.h"
#include "arrow/matlab/io/csv/proxy/table_writer.h"
#include "arrow/matlab/io/csv/proxy/table_reader.h"

#include "factory.h"

Expand Down Expand Up @@ -85,6 +87,8 @@ libmexclass::proxy::MakeResult Factory::make_proxy(const ClassName& class_name,
REGISTER_PROXY(arrow.type.proxy.StructType , arrow::matlab::type::proxy::StructType);
REGISTER_PROXY(arrow.io.feather.proxy.Writer , arrow::matlab::io::feather::proxy::Writer);
REGISTER_PROXY(arrow.io.feather.proxy.Reader , arrow::matlab::io::feather::proxy::Reader);
REGISTER_PROXY(arrow.io.csv.proxy.TableWriter , arrow::matlab::io::csv::proxy::TableWriter);
REGISTER_PROXY(arrow.io.csv.proxy.TableReader , arrow::matlab::io::csv::proxy::TableReader);

return libmexclass::error::Error{error::UNKNOWN_PROXY_ERROR_ID, "Did not find matching C++ proxy for " + class_name};
};
Expand Down
51 changes: 51 additions & 0 deletions matlab/src/matlab/+arrow/+io/+csv/TableReader.m
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
%TABLEREADER Reads tabular data from a CSV file into an arrow.tabular.Table.

% Licensed to the Apache Software Foundation (ASF) under one or more
% contributor license agreements. See the NOTICE file distributed with
% this work for additional information regarding copyright ownership.
% The ASF licenses this file to you under the Apache License, Version
% 2.0 (the "License"); you may not use this file except in compliance
% with the License. You may obtain a copy of the License at
%
% http://www.apache.org/licenses/LICENSE-2.0
%
% Unless required by applicable law or agreed to in writing, software
% distributed under the License is distributed on an "AS IS" BASIS,
% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
% implied. See the License for the specific language governing
% permissions and limitations under the License.

classdef TableReader

properties (GetAccess=public, SetAccess=private, Hidden)
Proxy
end

properties (Dependent, SetAccess=private, GetAccess=public)
Filename
end

methods

function obj = TableReader(filename)
arguments
filename (1, 1) string {mustBeNonmissing, mustBeNonzeroLengthText}
end

args = struct(Filename=filename);
obj.Proxy = arrow.internal.proxy.create("arrow.io.csv.proxy.TableReader", args);
end

function table = read(obj)
tableProxyID = obj.Proxy.read();
proxy = libmexclass.proxy.Proxy(Name="arrow.tabular.proxy.Table", ID=tableProxyID);
table = arrow.tabular.Table(proxy);
end

function filename = get.Filename(obj)
filename = obj.Proxy.getFilename();
end

end

end
Loading

0 comments on commit 2b34e37

Please sign in to comment.