Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement of paddle.v2.data.fetcher #1405

Closed

Conversation

reyoung
Copy link
Collaborator

@reyoung reyoung commented Feb 21, 2017

@reyoung reyoung requested review from wangkuiyi, helinwang, wen-bo-yang and qingqing01 and removed request for helinwang February 21, 2017 10:46
@reyoung reyoung force-pushed the feature/python.data.fetcher branch from 505575e to 6dea954 Compare February 21, 2017 10:47

def __init__(self,
url,
filename,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为何不能直接从url最后一段算出filename?e.g., url: foo.com/abc.tar.gz -> filename: abc.tar.gz

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里有以下几种情况。

  1. 最简单的是通过URL就能判定了。这种情况,文件名就是URL的最后一节
  2. URL并没有表示文件名是什么,而是使用HTTP的response header表示的。具体标准为 rfc6266。这种情况下,需要发送HTTP请求,才能获得response,进而推断文件名。

于是,如果需要获得文件名,最保险的做法就是发送HTTP请求,解析response。但是这样,对于用户没有联网的时候,使用已经缓存好的Package就比较困难了(没联网文件名无法精确的推断出来=>无法校验本地文件的完整性)。

另外,感觉起来,毕竟这个库是Paddle内部使用的接口,让调用者多传一个文件名并不算复杂,也可以简化代码实现上的逻辑。

Copy link
Contributor

@helinwang helinwang Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

明白了,确实需要这个参数。能否filename作为optional参数?(好像大多情况,包括你写的test case, filename都可以通过URL判定。)

"""
extension = ".gz"

def open(self, path=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

除了tar.gz后缀名以外的open函数,path都没有使用到。要不要考虑把path放在__init__里面作为optional参数?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tar Gz包可以打开不止一个文件,如果传到构造函数里的话,open函数就没法传入不同的路径,多次调用了。常见用法类似于

fetcher = Fetcher("something.tar.gz")

with fetcher.open("file_a") as file_a, fetcher.open("file_b") as file_b:
    ...
    pass 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

明白了,为了让open简洁(除了tar.gz其他情况都没用到path的参数,所以感觉去掉了更简洁),能否考虑以下实现?

fetcher_a = Fetcher("something.tar.gz", path="file_a") # synchronous download
fetcher_b = Fetcher("something.tar.gz", path="file_b") # cached, no download
with fetcher_a.open() as file_a, fetcher_b.open() as file_b:
    # ...
    pass

return True


class IFileOpener(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.gz和.tar.gz文件的操作应该是read的时候解码,而不是在open的时候解码。所以这里应该定义的是一个reader,而不是一个opener interface。

实际上Go语言的标准库就有一个 io.Reader interface,并且很多不同的 packages里实现了各种reader,比如

  1. os.File 就实现了 io.Reader interface,相当于你这里的 NormalOpener 的作用;
  2. bufio.Reader 实现了 io.Reader,其中 bufio.NewReader(r io.Reader) io.Reader 函数接收一个reader,返回一个buffered reader,从这个buffered reader里读,可以减少对r.Read的调用次数,提高吞吐量。
  3. gzip.Reader 也实现了io.Reader。gzip.NewReader(io.Reader) io.Reader 也是接受一个reader,且返回一个新的reader,这个新的reader的Read函数以便读数据,一遍解码。
  4. cvs.Reader 也实现了io.Reader。cvs.Reader的Read函数会返回点子表格里的一条record。
  5. tar.Reader 实现了 io.Reader,而且不仅实现了其中的Read函数,而且加了一个 Next 函数。每次调用Next就切换到tarball里的下一个文件,然后调用Read就是返回这个文件里的内容了。

所以如果我们有一个 example.cvs.gz 文件,不需要解压,就可以通过创建下面这reader来读取其中的记录:

r = cvs.NewReader(
        gzip.NewReader(
            bufio.NewReader(
                os.Open("example.cvs.gz"))))
for l, e := r.Read(); e != nil; l, e = r.Read() {
    fmt.Println(l)
}

我建议这个PR的设计参考Go的io.Reader重新设计

Copy link
Collaborator Author

@reyoung reyoung Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.gz和.tar.gz文件的操作应该是read的时候解码,而不是在open的时候解码。

基本同意。

但是,不太同意改这个Opener,因为:

  • 这个Opener并没有实现Reader,而是实现了Reader的组合构造过程
  • 之所以实现Opener的open函数,是因为Python里面构造大多数基于文件的Reader,都叫做 XXX.open,例如gzip.open

所以:

  • 使用Opener.open这个名字,描述Reader的组合构造过程是可行的。
  • Opener.open返回的就是一个Reader。解码过程(对于GZip)在这个Reader里面做。

另外,即使这个Opener设计的有问题,也仅限于这一个文件。因为这个Opener不会expose到这个文件之外。所以,不会被其他开发者和用户来使用。

@wangkuiyi wangkuiyi mentioned this pull request Feb 22, 2017

out_fn = os.path.join(self.__cached_dir__, self.__filename__)
if os.path.exists(out_fn):
if __is_file_matched__(out_fn, md5, chunk_size):
Copy link
Contributor

@gongweibao gongweibao Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用第一个chunk做文件完整性的判断是有问题的。如果用户下载数据中断,再次执行会出错。
tmp=>real会好点

@reyoung
Copy link
Collaborator Author

reyoung commented Feb 23, 2017

Closed becasue we use NLTK to do this.

@reyoung reyoung closed this Feb 23, 2017
wangxicoding pushed a commit to wangxicoding/Paddle that referenced this pull request Dec 9, 2021
* Optmize the cost for fp16 init in FT.

* Remove reserve_data argument in FT

* Remove unused import
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants