Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using HDFS to backup and restore notebook #1600

Closed
wants to merge 7 commits into from

Conversation

aspen01
Copy link

@aspen01 aspen01 commented Nov 4, 2016

What is this PR for?

This PR supports using HDFS to backup and restore notebook.
It is similar to #1479.
However this PR just use WebHDFS API, so we don't need to care about hadoop libraries' dependency.

What type of PR is it?

Improvement

Todos

  • - Task

What is the Jira issue?

https://issues.apache.org/jira/browse/ZEPPELIN-1515

How should this be tested?

Set the variables in zeppelin-site.xml

  • zeppelin.notebook.storage : org.apache.zeppelin.notebook.repo.HDFSNotebookRepo
  • hdfs.url : Hadoop WebHDFS URL. default: http://localhost:50070/webhdfs/v1/
  • hdfs.user : HDFS user. default: hdfs
  • hdfs.maxlength : Maximum number of lines of results fetched. default: 100000
  • hdfs.notebook.dir : notebook location directory in HDFS. default: /tmp

After zeppelin daemon start, check the notebook directories in hdfs.notebook.dir.

Screenshots (if appropriate)

Questions:

  • Does the licenses files need update? No
  • Is there breaking changes for older versions? No
  • Does this needs documentation? Yes

@bzz
Copy link
Member

bzz commented Nov 24, 2016

@aspen01 thank you for contribution!

Could you please double-check that

As for copyright notices, could you please help to understand if your contribution contains any third-party works or is that all code, developed for the ASF?

That would determine, if the existing copyright notice from the files should be moved to root NOTICE file or not. Thanks!

public synchronized void save(Note note, AuthenticationInfo subject) throws IOException {
super.save(note, subject);
if (this.enableWebHDFS)
uploadNoteListToHDFS(note);
Copy link
Contributor

@khalidhuseynov khalidhuseynov Nov 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about:

  • whether makes sense to pass user from subject instead of getting user from environment inside HDFSCommand
  • from this function name seems like uploading whole list instead of a single note

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • If this feature save a notebook to hdfs directly, I think your opinion is right. But now the feature is just backup the notebook to hdfs, so I tried to think of the two functions(connecting hdfs and save the notebook) separately.
  • I'll fix it. Thanks.

String notebookStatus = this.hdfsCmd.runCommand(this.hdfsCmd.getFileStatus, notebook, null);
fileStatus = gson.fromJson(notebookStatus, OneFileStatus.class);
} catch (Exception e) {
logger.warn("Warning: ", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe some description to give here

}
}
catch (Exception e) {
logger.error("Exception: ", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit more description would help here as well

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add more description. Thanks.

@@ -0,0 +1,241 @@
/**
* Copyright 2015 NAVER Corp. All rights Reserved.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file licensed for use in an Apache project?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is wrong license. I'll remove.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure all the code is distributed under one of the ASF category-a licenses

@khalidhuseynov
Copy link
Contributor

@aspen01 I wonder if any progress on this one?

</property>

<property>
<name>hdfs.url</name>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to add prefix zeppelin.notebook, because hdfs is used in many places. Adding prefix can help user to understand this this is for notebook storage.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HDFSCommand use hdfs.url property already. So I used the same property name.
If adding prefix can help user to understand, I'll add prefix.

@@ -42,6 +42,18 @@
</dependency>

<dependency>
<groupId>${project.groupId}</groupId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is a good idea to make zengine to depend on one specific interpreter.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. But To use HDFSCommand functions, I had to add dependency.
Could you give me advice to solve this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, Zengine should not have such dependency.

@aspen01 if HDFSCommand have to be shared:

  • one option is to move it to zeppelin-interpreter, as both file and zeppelin-zengine already depend on it
  • another option to consider would be - extract a new maven sub-module out of zeppelin-zengine/src/main/java/org/apache/zeppelin/notebook/repo/, move HDFSCommand there and make other modules depend on it.

Second option is more involved and deserves a separate issue \w discussion. First one looks more feasible

@aspen01
Copy link
Author

aspen01 commented Dec 19, 2016

@bzz Thank you for advice. But the build is failed even though I fixed license and checkstyle.

@bzz
Copy link
Member

bzz commented Dec 20, 2016

I see, thank you for taking care! This looks like something failing on the TravisCI side though.

Could you please rebase this branch on latest master and force-push it here again? This will trigger CI \w latest code\fixes from master.

@placeybordeaux
Copy link

Looks like you're missing a hotfix commit:

1c7d8fb

@bzz
Copy link
Member

bzz commented Dec 22, 2016

Thank you @aspen01, @placeybordeaux

In order to be merged, this branch have to be rebased on top of the master, which itself already includes CI fixes, i.e 1c7d8fb.

This branch must not include any other commits, except for ones that implement "HDFS to backup and restore notebook" - edae7bc, d94eb16 need to be removed.

return "/" + newUrl.replaceAll("^hdfs://", "").split("/", 2)[1];
}
else
return newUrl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, make sure all the code follows project style guide

@aspen01
Copy link
Author

aspen01 commented Dec 22, 2016

@placeybordeaux Thank you.
@bzz Do you mean I have to merge these commits to master not branch-0.6?

@aspen01 aspen01 changed the base branch from branch-0.6 to master December 28, 2016 09:56
@aspen01
Copy link
Author

aspen01 commented Dec 28, 2016

I committed this branch on latest master and force-push it.

@placeybordeaux
Copy link

Not sure why the CI hasn't kicked in, but I tried compiling this on my local and I am getting some test failures:

Results :

Failed tests:
  NotebookTest.testAbortParagraphStatusOnInterpreterRestart:760 expected:<ABORT> but was:<RUNNING>

Tests in error:
  HeliumApplicationFactoryTest.testUnloadOnInterpreterUnbind:232 » ClassCast jav...
  HeliumApplicationFactoryTest.testLoadRunUnloadApplication:148 » ClassCast java...
  HeliumApplicationFactoryTest.testUnloadOnInterpreterRestart:299 » ClassCast ja...
  HeliumApplicationFactoryTest.testUnloadOnParagraphRemove:193 » ClassCast java....

Tests run: 149, Failures: 1, Errors: 4, Skipped: 0

Not sure if these are related to the PR at all.

@aspen01
Copy link
Author

aspen01 commented Jan 11, 2017

@placeybordeaux How did you build a package?

$ mvn clean package -Pspark-1.6 -Phadoop-2.6 -Pyarn -Ppyspark -Psparkr -Pscala-2.10

...


Results :

Tests run: 66, Failures: 0, Errors: 0, Skipped: 0

...

[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 19:23 min
[INFO] Finished at: 2017-01-11T23:00:58+09:00
[INFO] Final Memory: 160M/815M
[INFO] ------------------------------------------------------------------------

@placeybordeaux
Copy link

I built with mvn clean package, but got the same results with mvn clean package -Pspark-1.6 -Phadoop-2.6 -Pyarn -Ppyspark -Psparkr -Pscala-2.10. It's likely this is just a problem on my side.

@keithchambers
Copy link

Is this ready to be merged?

@EronWright
Copy link

Seems overcomplicated for HDFSNotebookRepo to extend VFSNotebookRepo, and to be synchronizing the local FS with the remote FS. Just implement NotebookRepo directly.

@aspen01
Copy link
Author

aspen01 commented May 10, 2017

@EronWright Thank you for your feedback. When I first developed the feature, I assumed that HDFS would not be used as default storage because remote FS failure could affect zeppelin usage. So extended VFSNotebookRepo to keep the default storage space on local FS and HDFS as backup storage for failover. This makes the implementation is simple, but it may have created complexity.
I'm using this feature, but I am also considering using a third-party storage such as HBase, because the latency of HDFS affects the latency of zeppelin notebooks.

@asfgit asfgit closed this in c38a0a0 May 9, 2018
asfgit pushed a commit that referenced this pull request May 9, 2018
close #83
close #86
close #125
close #133
close #139
close #146
close #193
close #203
close #246
close #262
close #264
close #273
close #291
close #299
close #320
close #347
close #389
close #413
close #423
close #543
close #560
close #658
close #670
close #728
close #765
close #777
close #782
close #783
close #812
close #822
close #841
close #843
close #878
close #884
close #918
close #989
close #1076
close #1135
close #1187
close #1231
close #1304
close #1316
close #1361
close #1385
close #1390
close #1414
close #1422
close #1425
close #1447
close #1458
close #1466
close #1485
close #1492
close #1495
close #1497
close #1536
close #1545
close #1561
close #1577
close #1600
close #1603
close #1678
close #1695
close #1739
close #1748
close #1765
close #1767
close #1776
close #1783
close #1799
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants