Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of undeclared crate or module parquet when compiling without --feature=parquet flag #8250

Closed
jayzhan211 opened this issue Nov 17, 2023 · 11 comments · Fixed by #8497
Closed
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed

Comments

@jayzhan211
Copy link
Contributor

jayzhan211 commented Nov 17, 2023

Describe the bug

I believe this breaks after #7745, but it has been unsolved for while, so I file an issue to track it. Adding #[cfg(feature = "parquet")] is a workaround for now, but if want to run single test inside tests we need to add another flag for it.

To Reproduce

Run tests in datafusion/common/src/scalar.rs like this:

$ cargo test -p datafusion-common
   Compiling datafusion-common v33.0.0 (/Users/alamb/Software/arrow-datafusion2/datafusion/common)
error[E0433]: failed to resolve: use of undeclared crate or module `parquet`
   --> datafusion/common/src/file_options/mod.rs:302:9
    |
302 |     use parquet::{
    |         ^^^^^^^ use of undeclared crate or module `parquet`

error[E0432]: unresolved import `super::parquet_writer`
   --> datafusion/common/src/file_options/mod.rs:316:17
    |
316 |     use super::{parquet_writer::ParquetWriterOptions, StatementOptions};
    |                 ^^^^^^^^^^^^^^ could not find `parquet_writer` in `super`

Expected behavior

I hope we can have a way to just run the test without adding flag like before. Either all the test at once or single test.

Additional context

No response

@jayzhan211 jayzhan211 added the bug Something isn't working label Nov 17, 2023
@alamb
Copy link
Contributor

alamb commented Nov 17, 2023

The workaround is to add --features=parquet to the cargo command

It would be great to update the tests in datafusion_common that require parquet to feature flaged (aka add#[cfg(feature = "parquet")])

It would be nice to fix this

@alamb alamb added the good first issue Good for newcomers label Nov 17, 2023
@alamb
Copy link
Contributor

alamb commented Nov 17, 2023

I think this would be a good first issue -- it is a nice software engineering exercise to try and make the tests easier to use and doesn't require deep knowledge of DataFusion, just Rust knowledge

Perhaps someone could start with #7934 and extend that basic idea

@alamb alamb added the help wanted Extra attention is needed label Dec 1, 2023
@alamb alamb changed the title Use of undeclared crate or module parquet Use of undeclared crate or module parquet when compiling without --feature=parquet flag Dec 1, 2023
@Dennis40816
Copy link
Contributor

Hello, can I give it a try?

@jayzhan211
Copy link
Contributor Author

Hello, can I give it a try?

Sure

@alamb
Copy link
Contributor

alamb commented Dec 3, 2023

Thank you very much @Dennis40816 🙏

@Dennis40816
Copy link
Contributor

@alamb After a brief review of the code in PR #7934, I would like to confirm my understanding and seek clarification if I have any misconceptions about the program structure or Rust.

From my perspective, this issue should address two main tasks:

  1. Explicitly mark tests and related references that depend on parquet with the #[cfg(feature = "parquet")] attribute. This ensures that when these tests are run, parquet is automatically added to the features. For example:

    // in "datafusion/common/src/file_options/mod.rs"
    #[cfg(test)]
    mod tests {
        use std::collections::HashMap;
        
        // should add cfg flag here         <--------------
        #[cfg(feature = "parquet")]        
        use parquet::{
            basic::{Compression, Encoding, ZstdLevel},
            file::properties::{EnabledStatistics, WriterVersion},
            schema::types::ColumnPath,
        };
        // ...
    
        #[test]
        // also add cfg flag here           <--------------
        #[cfg(feature = "parquet")]
        fn test_writeroptions_parquet_from_statement_options() -> Result<()> {
            let mut option_map: HashMap<String, String> = HashMap::new();
            // ...
        }
        // ...
    }
  2. Following Minor: Reduce more #cfg(feature = "parquet") in tests #7934, we should also implement a stub to trigger runtime errors when users forget to add #[cfg(feature = "parquet")] to their test cases, rather than encountering compile-time errors.

    // in "datafusion/common/src/file_options/mod.rs"
    #[test]
    fn test_writeroptions_parquet_column_specific() -> Result<()> {
      // ...
    
      // implemented in #7934
      let parquet_options = ParquetWriterOptions::try_from((&config, &options))?;
      // should we also implement `writer_options` for `ParquetWriterOptions` in the stub?
      let properties = parquet_options.writer_options();
    
      // ...
    }

However, regarding the second point, I am uncertain about its feasibility. The concern arises if users employ new parquet features, which would necessitate corresponding updates to the stub. Would extensive use of parquet features potentially make the stub challenging to maintain?

As a Rust beginner, I hope you won't mind if I ask a seemingly naive question.

@alamb
Copy link
Contributor

alamb commented Dec 11, 2023

As a Rust beginner, I hope you won't mind if I ask a seemingly naive question.

Not at all -- we are all learning together and appreciate your contributions ❤️

I think the core issue is that many tests in the datafusion crates rely on ParquetExec even when they are testing features not related to parquet (for example, EnforceDistribution): https://github.com/apache/arrow-datafusion/blob/ff65dee3ff4318da13f5f89bafddf446ffbf8803/datafusion/core/src/physical_optimizer/enforce_distribution.rs#L1766-L1784

Previously the ParquetExec was tightly integrated (and there was no parquet feature) so this wasn't a problem.

I think the longer term solution is to rewrite tests that don't actually depend on Parquet to use something else (a stub ExecutionPlan perhaps). However, this would be a lot of work

Thus, I suggest we proceed in two phases:

Phase 1: fix compilation errors as you suggested

  1. "Explicitly mark tests and related references that depend on parquet with the #[cfg(feature = "parquet")] attribute. " as you suggest above

  2. Add a job to CI that ensures the tests can run without parquet

Add a command like cargo check --tests --no-default-features -p datafusion-common to the CI checks, perhaps here https://github.com/apache/arrow-datafusion/blob/ff65dee3ff4318da13f5f89bafddf446ffbf8803/.github/workflows/rust.yml#L68 ) to make sure they all compile correctly

This ensures that when these tests are run, parquet is automatically added to the features.

What it actually does is skip the tests when parquet is not enabled

Phase 2: Rewrite tests / code over time

The idea would be to remove #[cfg(feature = "parquet")] as much as possible in the rest of the codebase by refactoring / rewriting / moving tests

@alamb
Copy link
Contributor

alamb commented Dec 11, 2023

Following #7934, we should also implement a stub to trigger runtime errors when users forget to add #[cfg(feature = "parquet")] to their test cases, rather than encountering compile-time errors.

I recommend we don't do the runtime error -- and instead ensure people don't forget #[cfg(feature = "parquet")] by adding a CI cargo check command instead

@Dennis40816
Copy link
Contributor

Thanks for your clearly suggestions. I'll follow the steps, and then we can look at the results :)

@alamb
Copy link
Contributor

alamb commented Dec 11, 2023

Thanks! To be clear, I don't think we should do Phase 2 as part of this ticket. I suggest we complete phase 1 (make the tests work) and then we can file a follow on ticket that tracks phase 2

@Dennis40816
Copy link
Contributor

Sure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants