-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PATCH] Indexing on Hadoop distributed file system [LUCENE-532] #1610
Comments
Igor Bolotin (migrated from JIRA) Two patch files are attached |
Doug Cutting (@cutting) (migrated from JIRA) Instead of changing the value to -1 we should not write a size value in the header at all. We can change the format number and use that to determine where to read the size. Does that make sense? Also, please submit patches as a single 'svn diff' from the top of the lucene tree. Thanks! |
Igor Bolotin (migrated from JIRA) Attached is new patch which is using format number to determine where to read the size as discussed. |
Otis Gospodnetic (@otisg) (migrated from JIRA) This actually looks like a good and patch that doesn't break any tests. I'll commit it in the coming days, as it looks like it should be backwards compatible... except CFS won't be supported unless somebody patches that, too (I tried quickly and soon got unit tests to fail :( ). |
Chris (migrated from JIRA) Don't mean to resurrect old issues, but we're having the same problem here indexing to DFS and I've applied the patch and it works for us. Wondering if I'm missing something, or if this is being addressed somewhere else in trunk that I haven't found. |
Otis Gospodnetic (@otisg) (migrated from JIRA) I'm hesitant to commit without the CFS support. It looks like more and more people are using CFS indexes. |
Michael McCandless (@mikemccand) (migrated from JIRA) I think this is the same issue as LUCENE-532 (I just marked that one as a dup). But there was one difference: does HDFS allow writing to the same file (eg "segments") more than once? I thought it did not because it's "write once"? Do we need to not do that (write to the same file more than once) to work with HDFS (lock-less gets us closer)? |
Michael McCandless (@mikemccand) (migrated from JIRA) Sorry, I meant "dup of #1779 " above. |
Michael McCandless (@mikemccand) (migrated from JIRA) Also: I like the idea of never doing "seek" when writing. The less functionality we rely on from the filesystem, the more portable Lucene will be. Since Lucene is so wonderfully simple, never using "seek" during write is in fact very feasible. I think to do this we need to change the CFS file format, so that the offsets are stored at the end of the file. We actually can't pre-compute where the offsets will be because we can't make assumptions about how the file position changes when bytes are written: this is implementation specific. For example, if the Directory implementation does on-the-fly compression, then the file position will not be the number of bytes written. So I think we have to write at the end of the file. Any opinions or other suggestions? |
Andrzej Bialecki (@sigram) (migrated from JIRA) Hadoop cannot (yet) change file position when writing. All files are write-once, i.e. once they are closed they are pretty much immutable. They are also append-only - writing uses a subclass of OutputStream. |
Michael McCandless (@mikemccand) (migrated from JIRA) Alas, in trying to change the CFS format so that file offsets are stored at the end of the file, when implementing the corresponding changes to CompoundFileReader, I discovered that this approach isn't viable. I had been thinking the reader would look at the file length, subtract numEntry*sizeof(long), seek to there, and then read the offsets (longs). The problem is: we can't know sizeof(long) since this is dependent on the actual storage implementation, ie, for the same reasoning above. Ie we can't assume a byte = 1 file position, always. So, then, the only solution I can think of (to avoid seek during write) would be to write to a separate file, for each *.cfs file, that contains the file offsets corresponding to the cfs file. Eg, if we have _1.cfs we would also have _1.cfsx which holds the file offsets. This is sort of costly if we care about # files (it doubles the number of files in the simple case of a bunch of segments w/ no deletes/separate norms). Yonik had actually mentioned in #1779 that fixing CFS writing to not use seek was not very important, ie, it would be OK to not use compound files with HDFS as the store. Does anyone see a better approach? |
Kevin Oliver (migrated from JIRA) Here are some diffs on how to remove seeks from CompoundFileWriter (this is against an older version of Lucene, 1.4.2 I think, but the general idea is the same). There's also a test too. |
Michael McCandless (@mikemccand) (migrated from JIRA) Thank you for the patch & unit test! This is actually the same approach that I started with. But I ruled Meaning, one can imagine a Directory implementation that's doing some And so the only value you should ever pass to "seek()" is a value you However, on looking into this question further ... I do see that there Maybe it is OK to make the definition of Directory.seek() stricter, by I think this is the open question. Does anyone have any input to help Lucene currently makes this assumption, albeit in a fairly contained |
Grant Ingersoll (@gsingers) (migrated from JIRA) Anyone have a follow up on this? Seems like Hadoop based indexing would be a nice feature. It sounds like there was a lot of support for this, but it was never committed. Is this still an issue? |
Michael Busch (migrated from JIRA) I think #1858 (move all file headers to segments file) would solve this issue nicely. Then there would not be the need to call seek() in CFSWriter and TermInfosWriter anymore. I'd love to work on 783, but not sure if time permits in the near future. |
Ning Li (migrated from JIRA) Is the use of seek and write in ChecksumIndexOutput making Lucene less likely to support all sequential write (i.e. no seek write)? ChecksumIndexOutput is currently used by SegmentInfos. |
Shai Erera (@shaie) (migrated from JIRA) I see some progress in that direction was made under #3449 but am not sure if this Codec is a generic one (i.e. can support any file we write today) or tailored for StandardTermDict. It'd be great if Lucene can support append-only FS ! |
Robert Muir (@rmuir) (migrated from JIRA) This is fixed by #3449, just set your codec to AppendingCodec. |
In my current project we needed a way to create very large Lucene indexes on Hadoop distributed file system. When we tried to do it directly on DFS using Nutch FsDirectory class - we immediately found that indexing fails because DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason for this behavior is clear - DFS does not support random updates and so seek() method can't be supported (at least not easily).
Well, if we can't support random updates - the question is: do we really need them? Search in the Lucene code revealed 2 places which call IndexOutput.seek() method: one is in TermInfosWriter and another one in CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the only place that concerned us was in TermInfosWriter.
TermInfosWriter uses IndexOutput.seek() in its close() method to write total number of terms in the file back into the beginning of the file. It was very simple to change file format a little bit and write number of terms into last 8 bytes of the file instead of writing them into beginning of file. The only other place that should be fixed in order for this to work is in SegmentTermEnum constructor - to read this piece of information at position = file length - 8.
With this format hack - we were able to use FsDirectory to write index directly to DFS without any problems. Well - we still don't index directly to DFS for performance reasons, but at least we can build small local indexes and merge them into the main index on DFS without copying big main index back and forth.
Migrated from LUCENE-532 by Igor Bolotin, 3 votes, resolved Nov 20 2011
Attachments: cfs-patch.txt, indexOnDFS.patch, SegmentTermEnum.patch, TermInfosWriter.patch
The text was updated successfully, but these errors were encountered: