From 6ebf959502cc0ad125ca0cf88a0c28071a4fa70e Mon Sep 17 00:00:00 2001 From: Tomoko Uchida Date: Sun, 9 May 2021 08:45:24 +0900 Subject: [PATCH] reorganize termvectors format description (javadocs). (#130) --- .../lucene90/Lucene90TermVectorsFormat.java | 115 ++++++++++-------- 1 file changed, 61 insertions(+), 54 deletions(-) diff --git a/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90TermVectorsFormat.java b/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90TermVectorsFormat.java index 0142f5461e86..e19168ff95da 100644 --- a/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90TermVectorsFormat.java +++ b/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90TermVectorsFormat.java @@ -56,15 +56,20 @@ *
  • VectorMeta (.tvm) --> <Header>, PackedIntsVersion, ChunkSize, * ChunkIndexMetadata, ChunkCount, DirtyChunkCount, DirtyDocsCount, Footer *
  • Header --> {@link CodecUtil#writeIndexHeader IndexHeader} - *
  • PackedIntsVersion --> {@link PackedInts#VERSION_CURRENT} as a {@link - * DataOutput#writeVInt VInt} - *
  • ChunkSize is the number of bytes of terms to accumulate before flushing, as a {@link - * DataOutput#writeVInt VInt} - *
  • ChunkCount is not known in advance and is the number of chunks necessary to store all - * document of the segment - *
  • DirtyChunkCount --> the number of prematurely flushed chunks in the .tvd file + *
  • PackedIntsVersion, ChunkSize --> {@link DataOutput#writeVInt VInt} + *
  • ChunkCount, DirtyChunkCount, DirtyDocsCount --> {@link DataOutput#writeVLong + * VLong} + *
  • ChunkIndexMetadata --> {@link FieldsIndexWriter} *
  • Footer --> {@link CodecUtil#writeFooter CodecFooter} * + *

    Notes: + *

    *
  • *

    A vector data file (extension .tvd). This file stores terms, frequencies, * positions, offsets and payloads for every document. Upon writing a new segment, it @@ -80,76 +85,78 @@ * FieldNumOffs >, < Flags >, < NumTerms >, < TermLengths >, < * TermFreqs >, < Positions >, < StartOffsets >, < Lengths >, < * PayloadLengths >, < TermAndPayloads > - *

  • DocBase is the ID of the first doc of the chunk as a {@link DataOutput#writeVInt - * VInt} - *
  • ChunkDocs is the number of documents in the chunk *
  • NumFields --> DocNumFieldsChunkDocs - *
  • DocNumFields is the number of fields for each doc, written as a {@link - * DataOutput#writeVInt VInt} if ChunkDocs==1 and as a {@link PackedInts} array - * otherwise - *
  • FieldNums --> FieldNumDeltaTotalDistincFields, a delta-encoded list of - * the sorted unique field numbers present in the chunk - *
  • FieldNumOffs --> FieldNumOffTotalFields, as a {@link PackedInts} array - *
  • FieldNumOff is the offset of the field number in FieldNums - *
  • TotalFields is the total number of fields (sum of the values of NumFields) + *
  • FieldNums --> FieldNumDeltaTotalDistincFields *
  • Flags --> Bit < FieldFlags > - *
  • Bit is a single bit which when true means that fields have the same options for every - * document in the chunk *
  • FieldFlags --> if Bit==1: FlagTotalDistinctFields else * FlagTotalFields - *
  • Flag: a 3-bits int where: - * *
  • NumTerms --> FieldNumTermsTotalFields - *
  • FieldNumTerms: the number of terms for each field, using {@link BlockPackedWriter - * blocks of 64 packed ints} *
  • TermLengths --> PrefixLengthTotalTerms * SuffixLengthTotalTerms - *
  • TotalTerms: total number of terms (sum of NumTerms) - *
  • PrefixLength: 0 for the first term of a field, the common prefix with the previous - * term otherwise using {@link BlockPackedWriter blocks of 64 packed ints} - *
  • SuffixLength: length of the term minus PrefixLength for every term using {@link - * BlockPackedWriter blocks of 64 packed ints} *
  • TermFreqs --> TermFreqMinus1TotalTerms - *
  • TermFreqMinus1: (frequency - 1) for each term using {@link BlockPackedWriter blocks - * of 64 packed ints} *
  • Positions --> PositionDeltaTotalPositions - *
  • TotalPositions is the sum of frequencies of terms of all fields that have positions - *
  • PositionDelta: the absolute position for the first position of a term, and the - * difference with the previous positions for following positions using {@link - * BlockPackedWriter blocks of 64 packed ints} *
  • StartOffsets --> (AvgCharsPerTermTotalDistinctFields) * StartOffsetDeltaTotalOffsets - *
  • TotalOffsets is the sum of frequencies of terms of all fields that have offsets - *
  • AvgCharsPerTerm: average number of chars per term, encoded as a float on 4 bytes. - * They are not present if no field has both positions and offsets enabled. - *
  • StartOffsetDelta: (startOffset - previousStartOffset - AvgCharsPerTerm * - * PositionDelta). previousStartOffset is 0 for the first offset and AvgCharsPerTerm is - * 0 if the field has no positions using {@link BlockPackedWriter blocks of 64 packed - * ints} *
  • Lengths --> LengthMinusTermLengthTotalOffsets - *
  • LengthMinusTermLength: (endOffset - startOffset - termLength) using {@link - * BlockPackedWriter blocks of 64 packed ints} *
  • PayloadLengths --> PayloadLengthTotalPayloads - *
  • TotalPayloads is the sum of frequencies of terms of all fields that have payloads - *
  • PayloadLength is the payload length encoded using {@link BlockPackedWriter blocks of - * 64 packed ints} *
  • TermAndPayloads --> LZ4-compressed representation of < FieldTermsAndPayLoads * >TotalFields *
  • FieldTermsAndPayLoads --> Terms (Payloads) - *
  • Terms: term bytes - *
  • Payloads: payload bytes (if the field has payloads) + *
  • DocBase, ChunkDocs, DocNumFields (with ChunkDocs==1) --> {@link + * DataOutput#writeVInt VInt} + *
  • AvgCharsPerTerm --> {@link DataOutput#writeInt Int} + *
  • DocNumFields (with ChunkDocs>=1), FieldNumOffs --> {@link PackedInts} array + *
  • FieldNumTerms, PrefixLength, SuffixLength, TermFreqMinus1, PositionDelta, + * StartOffsetDelta, LengthMinusTermLength, PayloadLength --> {@link + * BlockPackedWriter blocks of 64 packed ints} *
  • Footer --> {@link CodecUtil#writeFooter CodecFooter} * + *

    Notes: + *

    *
  • *

    An index file (extension .tvx). *

    *