Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement-13476][*] Improve DS load hdfs configuration automatically by using environment variable HADOOP_CONF_DIR #13478

Closed
wants to merge 8 commits into from

Conversation

xxjingcd
Copy link
Contributor

@xxjingcd xxjingcd commented Feb 1, 2023

Purpose of the pull request

Improve DS load hdfs configuration automatically by using common environment variables HADOOP_CONF_DIR or HADOOP_HOME which are usually already configured in os;

It provides a more convenient choice. If the environment variable does not exist or is an error configuration, DS will load the configuration as before;

This closes #13476

Brief change log

  • HdfsStorageOperator
  • dolphinscheduler_env.sh : add tips for environment variableHADOOP_HOME and HADOOP_CONF_DIR

Verify this pull request

This change added tests and can be verified as follows:

  • HdfsStorageOperatorTest

@SbloodyS SbloodyS added improvement make more easy to user or prompt friendly backend 3.2.0 for 3.2.0 version labels Feb 1, 2023
@SbloodyS SbloodyS added this to the 3.2.0 milestone Feb 1, 2023
@ruanwenjun
Copy link
Member

Please use mvn spotless:apply to fix the code style.

Copy link
Member

@Radeity Radeity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should better avoid using system env, maybe we can add hadoop.conf.dir in our property file.

In addition, i think there's nothing wrong with current way which move configuration files into classpath, i just wonder do we really have to make this change?

@zhongjiajie
Copy link
Member

please send to detail to [email protected] @xxjingcd

@xxjingcd
Copy link
Contributor Author

xxjingcd commented Feb 2, 2023

In addition, i think there's nothing wrong with current way which move configuration files into classpath, i just wonder do we really have to make this change?

It is not convenient enough. You have to copy hdfs-site.xml and core-site.xml to the classpath manually; And it will bring maintenance costs, users have to remember that copy again when configuration files change ;

I think we should better avoid using system env, maybe we can add hadoop.conf.dir in our property file.

It is a choice for users , you can still move configuration files into classpath. This only takes effect if you specify an environment variable;
And the environment variables are usually already configured in the operating system; That mean users can load correct configuration without doing anything.
It is a general approach;

@codecov-commenter
Copy link

codecov-commenter commented Feb 2, 2023

Codecov Report

Merging #13478 (a5ac4bb) into dev (80da35e) will increase coverage by 0.01%.
The diff coverage is 65.62%.

@@             Coverage Diff              @@
##                dev   #13478      +/-   ##
============================================
+ Coverage     39.62%   39.63%   +0.01%     
- Complexity     4354     4364      +10     
============================================
  Files          1097     1097              
  Lines         41150    41184      +34     
  Branches       4712     4721       +9     
============================================
+ Hits          16304    16323      +19     
- Misses        23036    23045       +9     
- Partials       1810     1816       +6     
Impacted Files Coverage Δ
...duler/plugin/storage/hdfs/HdfsStorageOperator.java 24.60% <65.62%> (+3.91%) ⬆️
...erver/master/processor/queue/TaskEventService.java 75.00% <0.00%> (-5.36%) ⬇️
...hinscheduler/plugin/task/flink/FlinkArgsUtils.java 63.90% <0.00%> (-1.21%) ⬇️
...r/plugin/registry/zookeeper/ZookeeperRegistry.java 50.00% <0.00%> (-0.81%) ⬇️
...nscheduler/service/process/ProcessServiceImpl.java 30.22% <0.00%> (-0.03%) ⬇️
...he/dolphinscheduler/common/thread/ThreadUtils.java 0.00% <0.00%> (ø)
...er/master/event/TaskWaitTaskGroupStateHandler.java 0.00% <0.00%> (ø)
...olphinscheduler/plugin/task/api/TaskConstants.java 60.00% <0.00%> (ø)
...inscheduler/plugin/task/flink/FlinkDeployMode.java 100.00% <0.00%> (ø)
...duler/plugin/task/api/AbstractCommandExecutor.java 0.00% <0.00%> (ø)
... and 7 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@xxjingcd
Copy link
Contributor Author

xxjingcd commented Feb 2, 2023

Please use mvn spotless:apply to fix the code style.

I've already fixed it

@xxjingcd xxjingcd requested review from Radeity and removed request for caishunfeng, zhongjiajie, SbloodyS, ruanwenjun and EricGao888 February 2, 2023 15:17
@xxjingcd
Copy link
Contributor Author

xxjingcd commented Feb 2, 2023

I think we should better avoid using system env, maybe we can add hadoop.conf.dir in our property file.

In addition, i think there's nothing wrong with current way which move configuration files into classpath, i just wonder do we really have to make this change?

Environment variables HADOOP_CONF_DIR and HADOOP_HOME is very common in most components, e.g HBASE, Spark、Flink, and so on.

@Radeity
Copy link
Member

Radeity commented Feb 2, 2023

Environment variables HADOOP_CONF_DIR and HADOOP_HOME is very common in most components, e.g HBASE, Spark、Flink, and so on.

You are right, HADOOP_CONF_DIR is a common environment variable, HOWEVER, i think add it into properties is more general, especially when deploying in k8s.

@xxjingcd
Copy link
Contributor Author

xxjingcd commented Feb 3, 2023

please send to detail to [email protected] @xxjingcd

what detail? Is it a mistake?

@xxjingcd
Copy link
Contributor Author

xxjingcd commented Feb 3, 2023

Environment variables HADOOP_CONF_DIR and HADOOP_HOME is very common in most components, e.g HBASE, Spark、Flink, and so on.

You are right, HADOOP_CONF_DIR is a common environment variable, HOWEVER, i think add it into properties is more general, especially when deploying in k8s.

In practice, users always maintain a set of common global variables, this always include HADOOP_CONF_DIR and HADOOP_HOME.
In a word, using environment variables will make DS deploy faster and easily. That is the reason why this pr use environment variables;

Actually, all I know about components are use environment variables to configure. For Example, a source code fragment of Flink’s HadoopUtils

    public static String[] possibleHadoopConfPaths(
            org.apache.flink.configuration.Configuration flinkConfiguration) {
        String[] possiblePaths = new String[4];
        possiblePaths[0] = flinkConfiguration.getString(ConfigConstants.PATH_HADOOP_CONFIG, null);
        possiblePaths[1] = System.getenv("HADOOP_CONF_DIR");

        if (System.getenv("HADOOP_HOME") != null) {
            possiblePaths[2] = System.getenv("HADOOP_HOME") + "/conf";
            possiblePaths[3] = System.getenv("HADOOP_HOME") + "/etc/hadoop"; // hadoop 2.2
        }
        return Arrays.stream(possiblePaths).filter(Objects::nonNull).toArray(String[]::new);
    }

At Last, we should not put all configurations in properties, especially if the configuration is contained in common environment variables. This will make DS deploy fast.
DS has a env config file dolphinscheduler_env.sh, similar to spark-env.sh. And if users want to overwrite global variables, just config dolphinscheduler_env.sh file.

There is not much difference between putting it in dolphinscheduler_env.sh and putting it in properties;

I know very little about k8s. Maybe you can use env in dockerfile or config dolphinscheduler_env.sh similar to how you configure properties files.

@EricGao888
Copy link
Member

please send to detail to [email protected] @xxjingcd

what detail? Is it a mistake?

@xxjingcd I think the detail stuff @zhongjiajie comments here may refer to #13460

@Radeity
Copy link
Member

Radeity commented Feb 3, 2023

I know very little about k8s. Maybe you can use env in dockerfile or config dolphinscheduler_env.sh similar to how you configure properties files.

@xxjingcd Sorry for my mistake, we have already created environment variables with configMap in k8s deploy mode, including configuration HADOOP_CONF_DIR.

Is this convenient for users to configure HDFS?

In this PR, i think you don't have to change that part of codes, it's up to users whether override hdfs configuration or not.

Whether it is convenient enough to configure environment variables HADOOP_CONF_DIR.

For me, there's no problem, please update docs in resource center configuration page : )

Also, you don't have to add environment configuration in dolphinscheduler_env.sh (refer to #13028).

Basically LGTM, cc @zhongjiajie @EricGao888

@xxjingcd
Copy link
Contributor Author

xxjingcd commented Feb 3, 2023

@Radeity thanks for your code review work.

In this PR, i think you don't have to change that part of codes, it's up to users whether override hdfs configuration or not.

OK, I'll remove the todo (about the question).

Also, you don't have to add environment configuration in dolphinscheduler_env.sh (refer to #13028).

The purpose of the pr #13028 is to remove the export of dolphinscheduler_env.sh in the task’s initialization, so it deleted the environment configuration used in user's task. But it make a mistake, JAVA_HOME should not be deleted, because JAVA_HOME used in dolphinscheduler-dameon.sh.
After this pr, dolphinscheduler_env.sh is only used by DS component (used in dolphinscheduler-dameon.sh ) and DS interval, not by the user’s task; And after pr #13317, the user's tasks can specify the env file config by shell.env_source_list; That means environment file used in DS and user's task is separated now. And I have run some tasks to verify it.
So I think it's no problem to do that. Or maybe I don't get what you mean?

For me, there's no problem, please update docs in resource center configuration page :

OK, I'll sync the configuration.

@Radeity
Copy link
Member

Radeity commented Feb 3, 2023

So I think it's no problem to do that. Or maybe I don't get what you mean?

We should better only add ds' internal env configuration, which we don't have now. And user/tenant manages external env configurations by themselves. $JAVA_HOME belongs to external configurations, we just have to declare them in the doc. That's my personal understanding of that pr, have to further make sure. cc @ruanwenjun

@xxjingcd
Copy link
Contributor Author

xxjingcd commented Feb 4, 2023

So I think it's no problem to do that. Or maybe I don't get what you mean?

We should better only add ds' internal env configuration, which we don't have now. And user/tenant manages external env configurations by themselves. $JAVA_HOME belongs to external configurations, we just have to declare them in the doc. That's my personal understanding of that pr, have to further make sure. cc @ruanwenjun

Ok, I will remove $JAVA_HOME to make this pr‘s purpose simple.

@sonarcloud
Copy link

sonarcloud bot commented Feb 6, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

70.4% 70.4% Coverage
0.0% 0.0% Duplication

@github-actions
Copy link

github-actions bot commented Jun 7, 2023

This pull request has been automatically marked as stale because it has not had recent activity for 120 days. It will be closed in 7 days if no further activity occurs.

@github-actions github-actions bot added the Stale label Jun 7, 2023
@github-actions
Copy link

This pull request has been closed because it has not had recent activity. You could reopen it if you try to continue your work, and anyone who are interested in it are encouraged to continue work on this pull request.

@github-actions github-actions bot closed this Jun 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.2.0 for 3.2.0 version backend document improvement make more easy to user or prompt friendly Stale
Projects
None yet
7 participants