Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow writers refactor #153

Merged
merged 14 commits into from
Oct 27, 2023
Merged

Arrow writers refactor #153

merged 14 commits into from
Oct 27, 2023

Conversation

JesseMckinzie
Copy link
Member

  • Refactor process dataset functions to be overloaded to separate csv/buffer functionality from Arrow functionality
  • Remove passing status to get_arrow_table and check for nullptr instead
  • Remove low level functionality from nyxus main to process functions and the environment

@sweep-ai
Copy link

sweep-ai bot commented Oct 17, 2023

Apply Sweep Rules to your PR?

  • Apply: Leftover TODOs in the code should be handled.
  • Apply: All new business logic should have corresponding unit tests in the tests/ directory.
  • Apply: Any clearly inefficient or repeated code should be optimized or refactored.


bool use_arrow = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"use_arrow" will be defined as a class member no matter if we have Apache functionality enabled or disabled. Can we have no reference to Arrow if USE_ARROW is not defined ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have no references to the Arrow variables then we need ifdef statements throughout the code whenever the Arrow functionality is used, which we want to remove. We have to choose either hiding the Arrow variables or removing the ifdef statements (unless you have an idea for another work around)

@@ -32,7 +32,8 @@ namespace Nyxus

bool scanFilePairParallel(const std::string& intens_fpath, const std::string& label_fpath, int num_fastloader_threads, int num_sensemaker_threads, int filepair_index, int tot_num_filepairs);
std::string getPureFname(const std::string& fpath);
int processDataset(const std::vector<std::string>& intensFiles, const std::vector<std::string>& labelFiles, int numFastloaderThreads, int numSensemakerThreads, int numReduceThreads, int min_online_roi_size, bool save2csv, bool arrow_output, const std::string& csvOutputDir);
int processDataset(const std::vector<std::string>& intensFiles, const std::vector<std::string>& labelFiles, int numFastloaderThreads, int numSensemakerThreads, int numReduceThreads, int min_online_roi_size, bool save2csv, const std::string& csvOutputDir);
int processDataset(const std::vector<std::string>& intensFiles, const std::vector<std::string>& labelFiles, int numFastloaderThreads, int numSensemakerThreads, int numReduceThreads, int min_online_roi_size, const std::string& outputDir);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to guess what function fits a CSV, memory buffer, or Apache output usage. Could we have argument "save2csv" of enum type (values "save2csv", "save2buffer", "save2arrow", "save2parquet", etc) rather than bool or at least document these functions with a comment block ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use an enum to choose save type and added in empty classes to replace ifdef statements, as discussed with Samee. Let me know if you agree with this approach

theEnvironment.useCsv,
theEnvironment.output_dir);

}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awkward code, to my taste. Could we have a single call and use a destination type parameter or a more elegant solution? (We are in no rush with this refactor.)

@@ -93,6 +101,8 @@ int main (int argc, char** argv)
case 3: // Memory error
std::cout << std::endl << "Memory error" << std::endl;
break;
case 4:
std::cout << std::endl << "Arrow not enabled" << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's worth adding a couple of words how to enable, reinstall, or reset something? If Arrow libraries aren't in place, the executable/library wouldn't even run, so a user may be clueless why Arrow may still not be enabled.

@@ -173,22 +173,36 @@ py::tuple featurize_directory_imp (
// We're good to extract features. Reset the feature results cache
theResultsCache.clear();

auto arrow_output = !pandas_output;
theEnvironment.use_arrow = !pandas_output;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My traditional question here: are we supporting other Apache formats - "feather" and "parquet"? If yes, then "theEnvironment.use_arrow" should be called "theEnvironment.use_apache" ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I propose "use_apache_writers" to make it clear that the variable controls writing apache formats

@sameeul sameeul self-requested a review October 25, 2023 20:26
}
}
else if (theEnvironment.useCsv) {return SaveOption::saveCSV;}
else {return SaveOption::saveBuffer;}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in CLI mode, I don't this SaveOption::saveBuffer is supported. It should be either arrow or csv, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, both "ARROW" and "ARROWIPC" should set saveOption to saveArrowIPC. Or may be get rid of "ARROW" option all together and be more precise.

#include <string>
#include <memory>

#include "output_writers.h"
#include "helpers/helpers.h"
#include "globals.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we need this here.

@sameeul sameeul merged commit b09350e into PolusAI:main Oct 27, 2023
18 checks passed
@sameeul sameeul mentioned this pull request Oct 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants