Reconcile terminology and method naming to 'used/unused segments'; Rename MetadataSegmentManager to MetadataSegmentsManager #7306

leventov · 2019-03-20T21:00:29Z

Description

There are several related sets of changes:

Reconcile terminology and method naming to 'used/unused segments'

This naming stems from the used=0/1 column in the database table. Other names that used to be used in the codebase are

"database segments"
"database available segments"
"available segments"
"enabled/disabled segments"
"delete/remove/kill segments", in the meaning of "mark segments as unused"

You can see how the third one, "available segments" has emerged, but the result is totally misleading and confusing with segments which are available aka served by some data nodes.

Don't use terms 'enable/disable data source'

"Enable data source" actually means "try to mark as 'used' all segments in the database that belong to a data source". "Disable data source" actually means "try to mark as unused all segments in the database that belong to a data source". Enable/disable terminology might make users think that Druid maintains a separate toggle with the notion of an individual segment's used flag, which is false. Disabling a data source and then enabling it again loses information about the subset of segments that may have been unused in a data source.

Similarly, uses terms 'Used/unused data source' instead of 'enabled/disabled data source' to differentiate between data sources to which at least one used segment belongs and to which none used segments belong.

Rename `MetadataSegmentManager` to `SegmentsMetadataManager`

To make less aliasing and confusion with SegmentsManager, a class used on Historical nodes. Also renamed residual occurrences in variable names of "DatabaseSegmentManager", which is the former name of MetadataSegmentManager.

Rename `DruidCoordinatorHelper` to `CoordinatorDuty`

Also, renamed the implementations:

DruidCoordinatorBalancer -> BalanceSegments
DruidCoordinatorSegmentCompactor -> CompactSegments
DruidCoordinatorLogger -> EmitClusterStatsAndMetrics
DruidCoordinatorSegmentKiller -> KillUnusedSegments
DruidCoordinatorCleanupPendingSegments -> KillStalePendingSegments
DruidCoordinatorSegmentInfoLoader -> LogUsedSegments
DruidCoordinatorCleanupOvershadowed -> MarkAsUnusedOvershadowedSegments
DruidCoordinatorCleanupUnneeded -> UnloadUnusedSegments
DruidCoordinatorRuleRunner -> RunRules

Also, renamed CoordinatorIndexingServiceHelper annotation to CoordinatorIndexingServiceDuty.

Rename `IndexerSQLMetadataStorageCoordinator`'s `get` methods to `retrieve`

Because they all access the metadata store, i. e. not cheap.

getUsedSegmentsForInterval -> retrieveUsedSegmentsForInterval
getAllUsedSegments -> retrieveAllUsedSegments
getUsedSegmentsAndCreatedDates -> retrieveUsedSegmentsAndCreatedDates
getUsedSegmentsForIntervals -> retrieveUsedSegmentsForIntervals
getveUnusedSegmentsForInterval -> retrieveUnusedSegmentsForInterval
getDataSourceMetadata -> retrieveDataSourceMetadata

As well as a related method UsedSegmentLister.getUsedSegmentsForIntervals -> retrieveUsedSegmentsForIntervals.

Rename `SegmentsAndMetadata` class to `SegmentsAndCommitMetadata`

This is done to increase the difference from SegmentsMetadataManager (former MetadataSegmentManager).

Rename several indexing service task-related classes

KillTask -> KillUnusedSegmentsTask
ClientQuery -> ClientTaskQuery
ClientKillQuery -> ClientKillUnusedSegmentsTaskQuery
ClientCompactQuery -> ClientCompactionTaskQuery

Refactoring of `DruidCoordinator`

Refactored DruidCoordinator, removed unnecessary CoordinatorRunnable hierarchy.

Future work

Rename MetadataRuleManager to RulesMetadataManager to align with SegmentsMetadataManager.
Actually make POST /druid/coordinator/v1/datasources/{dataSourceName}/segments/{segmentId} to mark as used only non-overshadowed segments?

Key files changed in this PR are datasource-view.tsx and DruidCoordinator.

…n't use terms 'enable/disable data source'; Rename MetadataSegmentManager to MetadataSegments; Make REST API methods which mark segments as used/unused to return server error instead of an empty response in case of error

…exerMetadataStorageCoordinator instead of hacking with Intervals.ETERNITY

gianm · 2019-03-26T21:34:15Z

Reconcile terminology and method naming to 'used/unused segments'

This looks related and somewhat conflicting with the terminology discussion on #7233, where initial confusion over the meaning of is_published led to a suggestion of creating a new term "active". What do you think about the suggestion on #7233 (comment)?

I think if we go with where 7233 is going, the term for "used segments" would be "published segments". Fwiw, one reason I like this term because it lends itself well to being used as verbs in the following way:

A task would first push a segment to deep storage, and then publish it to the metadata store.
When a segment is dropped it is unpublished from the metadata store by setting used = false.

Don't use terms 'enable/disable data source'

Yeah, I definitely also believe these terms are not great, and are confusing for the reasons you mention. Here the "publish" and "unpublish" terms are a bit awkward though. It doesn't sound right to say "publish all segments in a datasource" instead of "enable a datasource". "Mark used" or "activate" sounds better. I'm not sure how to reconcile this with the fact that I do think "publish" sounds good & intuitive as a verb in the case where a task is initially writing a segment record to the metadata store.

Maybe this is how:

Tasks "publish" segments when they insert records into the metadata store with used = true (upon creation of the segment)
Later on if we drop a segment, we do that by "marking it unused" or, potentially, "deactivating" it.
If we re-enable a segment, we do that by "marking it used", or, potentially, "activating" it.

leventov · 2019-03-26T22:40:59Z

I think it makes sense to keep terminology between virtual Druid's SQL table sys.segments and metadata store (and therefore the remaining of the codebase) separate.

Can used be renamed? Or not, because it's a column in the database that needs to remain compatible?

gianm · 2019-03-27T22:51:26Z

I think it makes sense to keep terminology between virtual Druid's SQL table sys.segments and metadata store (and therefore the remaining of the codebase) separate.

I don't think we need to be dogmatic about it. But if possible, it benefits everyone (users & developers alike) to have consistent terminology across the entire system.

Can used be renamed? Or not, because it's a column in the database that needs to remain compatible?

Do you mean in the metadata store? I don't think it could be, since we don't have a way to automatically migrate already-existing metadata stores that use the name used. But we could decide we want to call the concept something else in general, and just keep a note that for legacy reasons, the metadata store uses the old name.

leventov · 2019-03-27T23:07:30Z

But if possible, it benefits everyone (users & developers alike) to have consistent terminology across the entire system.

I'm not sure that a column in sys.segments and the internal codebase concept (currently "used segments") are the same thing. Calling them the same may add more confusion and make navigation in the codebase harder than if the names are different. For example, if sys.segments concept is called "published" and somebody sees "published" in code, they know it's about the virtual SQL table, not the table in the metadata store.

gianm · 2019-03-27T23:14:09Z

I'm not sure that a column in sys.segments and the internal codebase concept (currently "used segments") are the same thing.

I think it's best if they are the same thing, since the point of system tables like sys.segments is to help cluster operators introspect into the state of the system using a familiar SQL style API. (So they don't have to go digging through metadata store, ZK, various Druid APIs to get a picture of what's going on.)

If it turns out that for some reason, they can't be the same thing, then it makes sense to use two different names. But if they could be the same thing then it's probably best to do that.

…atorDynamicConfig parameters

leventov · 2019-03-28T21:47:14Z

@gianm I didn't question that "published in sys.segments" and "used in metadata store" is the same thing, logically. I did question whether making this same logical thing called the same would actually make positive, rather than negative impact on the codebase. However, now I also think it's higher probability that it will have positive impact on the codebase. So I will support the rename.

It doesn't seem to me that this PR is conflicting with the rename. I think it's supportive to the rename because in this PR I did some nasty work at finding and sweeping old names and variants of naming that were looking completely unrelated. Now for the rename, somebody needs to make just two full-text searches in the codebase on Github (I did many more when preparing this PR):

"used segment"
"unused segment"

Then visit all result files that look relevant manually and rename "used/unused segments" to "publish/published/unpublished". (Note that this search needs to be performed after this PR is merged, because Github indexes only master codebase for full-text search.)

For this reason, I tag this PR Development Blocker for #7233.

Maybe this is how:

Tasks "publish" segments when they insert records into the metadata store with used = true (upon creation of the segment)

Later on if we drop a segment, we do that by "marking it unused" or, potentially, "deactivating" it.

If we re-enable a segment, we do that by "marking it used", or, potentially, "activating" it.

Oh no, I think we must go all-in with publish/unpublish/published/unpublished terminology. So we should call these actions "mark (as) published", however awkward it sounds, and "unpublish" (I think "unpublish all segments in a data source" sounds fine), or, alternatively, "mark (as) unpublished".

…tency with CompactionTask; ClientCompactQueryTuningConfig to ClientCompactionTaskQueryTuningConfig

leventov · 2019-03-28T22:31:21Z

For reviewers: in the latest commits I've made some more clarification renames:

KillTask -> KillUnusedSegmentsTask
TaskMonitor.killTask() -> cancelTask() (see also related Rename Overlord task "shutdown" to "cancel" #7361)
IndexingServiceClient.killTask() -> cancelTask()
IndexingServiceClient.killSegments() -> killUnusedSegments()
ClientQuery -> ClientTaskQuery
ClientKillQuery -> ClientKillUnusedSegmentsTaskQuery
ClientCompactQuery -> ClientCompactionTaskQuery
ClientCompactQueryTuningConfig -> ClientCompactionTaskQueryTuningConfig

CoordinatorDynamicConfig's properties:

millisToWaitBeforeDeleting -> millisLagSinceCoordinatorBecomesLeaderBeforeCanMarkAsUnusedOvershadowedSegments
killAllDataSources -> killUnusedSegmentsInAllDataSources
specificDataSourcesToKill -> specificDataSourcesToKillUnusedSegmentsIn
protectedPendingSegmentDatasources -> dataSourcesToNotKillStalePendingSegmentsIn

I kept the old names of JSON properties because aliases (for backward compatibility) are only supported in Jackson 2.9, see #7152.

DruidCoordinatorHelpers:

DruidCoordinatorCleanupPendingSegments -> DruidCoordinatorKillStalePendingSegments
DruidCoordinatorCleanupUnusedSegments -> DruidCoordinatorUnloadUnusedSegments
DruidCoordinatorCleanupOvershadowed -> DruidCoordinatorMarkAsUnusedOvershadowedSegments
DruidCoordinatorSegmentKiller -> DruidCoordinatorUnusedSegmentsKiller

(Next steps will be: remove "DruidCoordinator" prefixes and make all class names noun phrases or verb phrases, but not a mix.)

egor-ryashin · 2019-03-26T23:43:34Z

server/src/main/java/org/apache/druid/metadata/SqlMetadataSegments.java

@@ -75,9 +76,9 @@
 /**
 */
 @ManageLifecycle
-public class SQLMetadataSegmentManager implements MetadataSegmentManager
+public class SqlMetadataSegments implements MetadataSegments


SegmentsStorage ? ...Segments make me think about a factory class, while this one also makes update and delete operations.

Also, it's more like segmentsMetadata we work with here, not ...Segments directly, meaning, can we name it SegmentsMetadataStorage ?

Could you explain the factory association? I don't see a link.

To me, "Storage" doesn't contribute to meaning, as well as "Manager" which I removed.

SegmentsSqlMetadata may be a better name than SqlMetadataSegments though.

Renamed the interface to SegmentsMetadata and the implementation class to SqlSegmentsMetadata respectively.

I would like to see classes that contain DB API calls - to identify themselves in naming the consistent way across the project. So do you propose to start replacing Manager to Sql or it will be a single stray class like that?

I don't replace Manager with Sql - SQL was already in the name. I just removed Manager.

Got it. But what about I would like to see classes that contain DB API calls - to identify themselves in naming the consistent way across the project.

I think Sql is that identifier.

egor-ryashin · 2019-03-28T19:10:10Z

server/src/main/java/org/apache/druid/metadata/MetadataSegments.java

+   * This method's name starts with "prepare" for the same reason as {@link
+   * #prepareImmutableDataSourceWithUsedSegments}.
+   */
+  Collection<ImmutableDruidDataSource> prepareImmutableDataSourcesWithAllUsedSegments();


I wonder if All is necessary?

Do you think is unnecessary?

All makes sense in a specific use case, ie If there are two methods like:

getAll getRandomSample

otherwise All is just redundant.

This rename is already done and merged in #7653.

Hmm, it's confusing to see it here then, I spent time reading through those changes.

egor-ryashin · 2019-03-28T19:11:46Z

server/src/main/java/org/apache/druid/metadata/MetadataSegments.java

+
+  void stop();
+
+  boolean tryMarkAsUsedAllSegmentsInDataSource(String dataSource);


I wonder if All is necessary?

egor-ryashin · 2019-03-28T19:16:01Z

server/src/main/java/org/apache/druid/metadata/MetadataSegments.java

+   * This method is not removed because {@link org.apache.druid.server.http.DataSourcesResource#removeSegment}
+   * uses it and if it migrates to {@link #tryMarkSegmentAsUnused(SegmentId)} the performance will be worse.
+   */
+  boolean tryMarkSegmentAsUnused(String dataSource, String segmentId);


I wonder if As is necessary?

This question is missed, I guess.

This rename is already done and merged in #7653. This PR doesn't rename methods in MetadataSegmentManager anymore, it only renames the class itself.

server/src/main/java/org/apache/druid/metadata/MetadataSegments.java

egor-ryashin · 2019-03-28T21:41:32Z

server/src/main/java/org/apache/druid/server/coordinator/CoordinatorDynamicConfig.java

@@ -48,21 +49,21 @@
 {
  public static final String CONFIG_KEY = "coordinator.config";

-  private final long millisToWaitBeforeDeleting;
+  private final long millisLagSinceCoordinatorBecomesLeaderBeforeCanMarkAsUnusedOvershadowedSegments;


For me it's much easier/naturally to find a description of the method/variable then to read so long name without spaces between words.

Do you think it's easier for everyone, for example, people who are not Druid developers themselves?

I also think that name is too long, suggest initialDropOvershadowedWaitMillis instead, with additional clarification in docs

Or markOvershadowedAsUnusedInitialWaitMillis

I don't like the loss of meaning or even confusion in the "initial" adjective. Also, other configuration parameter names in this class don't drop the "Segments" part, except for replicantLifetime and replicationThrottleLimit which are bad names anyway. I also wanted to keep the "MarkAsUnused" part together, because it appears in different places in the code. What about leadingTimeMillisBeforeCanMarkAsUnusedOvershadowedSegments?

server/src/main/java/org/apache/druid/server/coordinator/CoordinatorDynamicConfig.java

egor-ryashin · 2019-03-28T22:05:52Z

server/src/main/java/org/apache/druid/server/coordinator/CoordinatorDynamicConfig.java

@@ -299,7 +314,8 @@ public boolean equals(Object o)

    CoordinatorDynamicConfig that = (CoordinatorDynamicConfig) o;

-    if (millisToWaitBeforeDeleting != that.millisToWaitBeforeDeleting) {
+    if (millisLagSinceCoordinatorBecomesLeaderBeforeCanMarkAsUnusedOvershadowedSegments !=
+        that.millisLagSinceCoordinatorBecomesLeaderBeforeCanMarkAsUnusedOvershadowedSegments) {


Whoa, it takes me a deal of concentration just to check those two variables represent the same property. Now I know when the bigger is not better regarding the variable naming.

Do you have some better names in mind?

millisLagBeforeMarkingUnusedOvershadowed

This property is already renamed to leadingTimeMillisBeforeCanMarkAsUnusedOvershadowedSegments in #7653.

egor-ryashin · 2019-03-28T23:40:01Z

docs/content/configuration/index.md

@@ -1023,7 +1023,7 @@ Worker select strategies control how Druid assigns tasks to middleManagers.

 ###### Equal Distribution

-Tasks are assigned to the middleManager with the most available capacity at the time the task begins running. This is
+Tasks are assigned to the middleManager with the most spare capacity at the time the task begins running. This is


"capacity" implies the maximum limit
with max free task slots ?

I don't know.

Let's do Tasks are assigned to the middleManager with the most free slots at the time

Also, let's write Middle Manager

Applied the first suggestion. I didn't replace the term "middleManager" (one word) with "Middle Manager" because the first version is used throughout the document in many places. I'm also not sure that the separate writing is better for searching in the docs. If you feel that the replacement would be an improvement, please open an issue.

egor-ryashin · 2019-03-28T23:50:44Z

docs/content/operations/metrics.md

@@ -205,8 +205,8 @@ These metrics are for the Druid Coordinator and are reset each time the Coordina
 |`segment/loadQueue/failed`|Number of segments that failed to load.|server.|0|
 |`segment/loadQueue/count`|Number of segments to load.|server.|Varies.|
 |`segment/dropQueue/count`|Number of segments to drop.|server.|Varies.|
-|`segment/size`|Size in bytes of available segments.|dataSource.|Varies.|
-|`segment/count`|Number of available segments.|dataSource.|< max|
+|`segment/size`|Sum of sizes of all used segments belonging to a data source. Emitted only for data sources to which at least one used segment belongs.|dataSource.|Varies.|


Total size of used segments ?

I don't understand what do you mean by this question.

Let's do Sum of sizes of all used segments -> Total size of used segments

egor-ryashin · 2019-04-04T19:17:08Z

core/src/main/java/org/apache/druid/timeline/VersionedIntervalTimeline.java

+  public Collection<ObjectType> iterateAllObjects()
+  {
+    return CollectionUtils.createLazyCollectionFromStream(
+        () -> allTimelineEntries


allTimelineEntries makes me think there's another collection with a subset of "all", but it isn't.

It follows the naming of "allTimelineEntries". I think the logic behind this naming is that it includes overshadowed objects. I've extended the Javadoc of iterateAllObjects() to make it clearer.

core/src/main/java/org/apache/druid/timeline/VersionedIntervalTimeline.java

leventov · 2019-11-21T18:15:09Z

@pjain1 could you please take another look? I didn't test UI yet though because I have problems preparing a docker image though

lgtm-com · 2019-12-08T12:08:46Z

This pull request introduces 1 alert when merging 6da0b9d into 441515c - view on LGTM.com

new alerts:

1 for Useless comparison test

jihoonson

@leventov, I apologize for the very delayed review and thank you for your patience. Also thank you for adding a bunch of missing javadocs and comments. I left a couple of comments.

It doesn't seem to me that this PR is conflicting with the rename. I think it's supportive to the rename because in this PR I did some nasty work at finding and sweeping old names and variants of naming that were looking completely unrelated. Now for the rename, somebody needs to make just two full-text searches in the codebase on Github (I did many more when preparing this PR):

"used segment"
"unused segment"

Then visit all result files that look relevant manually and rename "used/unused segments" to "publish/published/unpublished". (Note that this search needs to be performed after this PR is merged, because Github indexes only master codebase for full-text search.)

I'm wondering you're planning to rename "used/unused segments" to "published/unpublished segments" in the near future. If not, I think this PR should change those names properly at least for the UI. If you do, I promise to review your follow-up PR as soon as possible.

jihoonson · 2020-01-22T00:52:06Z

core/src/main/java/org/apache/druid/metadata/MetadataStorageConnector.java

@@ -36,7 +37,7 @@ Void insertOrUpdate(
      byte[] value
  );

-  byte[] lookup(
+  @Nullable byte[] lookup(


Would you add a javadoc saying when it can be null? Perhaps it would be

/** * Returns the value of the valueColumn when there is only one row matched to the given key. * This method returns null if there is no such row and throws an error if there are more than one rows. */

Thanks for suggestion, applied

jihoonson · 2020-01-22T00:55:37Z

core/src/main/java/org/apache/druid/segment/SegmentUtils.java

   */
-  public static String commaSeparateIdentifiers(final Collection<DataSegment> segments)
+  public static Object commaSeparatedIdentifiers(final Collection<DataSegment> segments)


Hmm, the method name doesn't look valid anymore. Maybe lazyCollectionOfSegmentIds?
Also why not having Collection<SegmentId> as the return type instead of Object?

The purpose is to make this method unusable for anything except logging. To obtain a collection of SegmentIds, people should usually use Stream API to create a non-lazy collection.

jihoonson · 2020-01-22T01:08:28Z

core/src/main/java/org/apache/druid/timeline/partition/OvershadowableManager.java

@@ -89,22 +89,22 @@

  // (start partitionId, end partitionId) -> minorVersion -> atomicUpdateGroup
  private final TreeMap<RootPartitionRange, Short2ObjectSortedMap<AtomicUpdateGroup<T>>> standbyGroups;
-  private final TreeMap<RootPartitionRange, Short2ObjectSortedMap<AtomicUpdateGroup<T>>> visibleGroup;
+  private final TreeMap<RootPartitionRange, Short2ObjectSortedMap<AtomicUpdateGroup<T>>> visibleGroups;


Hmm, I intentionally used the singular form here to say that there is only one visibleGroup in any case. Perhaps the code should be refactored to be more clear.

I renamed to visibleGroupPerRange, does it make more sense? Also added a comment as per #8788

It sounds good to me. Thanks for adding a comment.

jihoonson · 2020-01-22T01:38:11Z

docs/design/coordinator.md

+"used" segments (that is, the segments that *should* be loaded in the cluster) and the loading rules.
+
+Before any unassigned segments are serviced by Historical processes, the Historical processes for each tier are first
+sorted in terms of capacity, with least capacity servers having the highest priority. Unassigned segments are always


Before any unassigned segments are serviced by Historical processes, the Historical processes for each tier are first
sorted in terms of capacity, with least capacity servers having the highest priority.

It seems like this statement was written in 2013.. I don't think this is true anymore. Maybe better to mention druid.coordinator.balancer.strategy and link https://github.com/apache/druid/blob/master/docs/configuration/index.md#coordinator-operation.

I think coordinator documentation should better be updated at once, There is #7201 tracking this

Sounds good to me.

jihoonson · 2020-01-22T01:39:32Z

docs/design/coordinator.md

@@ -51,7 +64,12 @@ Segments can be automatically loaded and dropped from the cluster based on a set

 ### Cleaning up segments

-Each run, the Druid coordinator compares the list of available database segments in the database with the current segments in the cluster. Segments that are not in the database but are still being served in the cluster are flagged and appended to a removal list. Segments that are overshadowed (their versions are too old and their data has been replaced by newer segments) are also dropped.
+On each run, the Druid Coordinator compares the set of used segments in the database with the segments served by some
+nodes in the cluster. Coordinator sends requests to Historical nodes to unload unused segments or segments the are


Suggest some nodes -> historical processes.

changed to Historical nodes because this term is also used in the section twice.

jihoonson · 2020-01-22T05:34:20Z

server/src/main/java/org/apache/druid/indexing/overlord/IndexerMetadataStorageCoordinator.java

+   * given data source from the metadata store.
+   *
+   * Unlike other similar methods in this interface, this method doesn't accept a {@link Segments} "visibility"
+   * parameter. This methods returns overshadowed segments, as if {@link Segments#INCLUDING_OVERSHADOWED} was passed.


I think it's more clear to say "The returned collection may include overshadowed segments and their created_dates, as if {@link Segments#INCLUDING_OVERSHADOWED} was passed. It's the responsibility of the caller to filter out overshadowed ones if needed`.

Thanks, changed.

jihoonson · 2020-01-22T05:36:32Z

server/src/main/java/org/apache/druid/indexing/overlord/IndexerMetadataStorageCoordinator.java

+   * Delete pending segments created in the given interval belonging to the given data source from the pending segments
+   * table. The {@code created_date} field of the pending segments table is checked to find segments to be deleted.
+   *
+   * Note that the semantic of the interval (for `created_date`s) is different from the semantic of the interva


typo: interva -> interval

Thanks, fixed

jihoonson · 2020-01-22T05:37:42Z

server/src/main/java/org/apache/druid/indexing/overlord/IndexerMetadataStorageCoordinator.java

+
+  /**
+   * Delete all pending segments belonging to the given data source from the pending segments table.
+   * The {@code created_date} field of the pending segments table is checked to find segments to be deleted.


This method doesn't seem to check the created_date field.

Why? Looks it does:

druid/server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java

Lines 894 to 910 in 6da0b9d

@Override

public int deletePendingSegmentsCreatedInInterval(String dataSource, Interval deleteInterval)

{

return connector.getDBI().inTransaction(

(handle, status) -> handle

.createStatement(

StringUtils.format(

"DELETE FROM %s WHERE datasource = :dataSource AND created_date >= :start AND created_date < :end",

dbTables.getPendingSegmentsTable()

)

)

.bind("dataSource", dataSource)

.bind("start", deleteInterval.getStart().toString())

.bind("end", deleteInterval.getEnd().toString())

.execute()

);

}

This javadoc is for a different method: int deletePendingSegments(String dataSource);

I see now, fixed. Thanks.

jihoonson · 2020-01-22T05:45:40Z

server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java

@@ -908,7 +1009,7 @@ private boolean segmentExists(final Handle handle, final DataSegment segment)
   * Read dataSource metadata. Returns null if there is no metadata.
   */
  @Override
-  public DataSourceMetadata getDataSourceMetadata(final String dataSource)
+  public @Nullable DataSourceMetadata retrieveDataSourceMetadata(final String dataSource)


I think it's more popular to put annotations before the line where the method access modifier is (just like @Override). Is this intended?

I prefer putting @Nullable before the type because semantically it characterizes the returned type, not the method (there are also modifiers and other annotations in between).

jihoonson · 2020-01-22T05:54:02Z

server/src/main/java/org/apache/druid/server/coordinator/duty/CoordinatorDuty.java

 import org.apache.druid.server.coordinator.DruidCoordinatorRuntimeParams;

 import javax.annotation.Nullable;

 /**
 *
 */
-public interface DruidCoordinatorHelper
+public interface CoordinatorDuty


I think duty is more clear to me than helper.

…ments

lgtm-com · 2020-01-22T13:54:42Z

This pull request introduces 1 alert when merging 10587c1 into d541cbe - view on LGTM.com

new alerts:

1 for Useless comparison test

leventov · 2020-01-22T14:38:00Z

@jihoonson thank you very much for taking time to review this PR.

I'm wondering you're planning to rename "used/unused segments" to "published/unpublished segments" in the near future. If not, I think this PR should change those names properly at least for the UI. If you do, I promise to review your follow-up PR as soon as possible.

I don't plan to make a follow-up rename. I don't want to make partial rename in the UI because it will create an inconsistency between UI and official documentation. Then, if docs are changed, there is an inconsistency between docs and code. And so on.

So I think there should be a one-time rename. In the part of my comment which you cited, I explained why I think that this PR in its current state is at least not a move in the wrong direction, in terms of rename, because it will greatly simplify the future rename, whoever decided to do it.

jihoonson · 2020-01-23T02:03:32Z

I don't plan to make a follow-up rename. I don't want to make partial rename in the UI because it will create an inconsistency between UI and official documentation. Then, if docs are changed, there is an inconsistency between docs and code. And so on.

So I think there should be a one-time rename. In the part of my comment which you cited, I explained why I think that this PR in its current state is at least not a move in the wrong direction, in terms of rename, because it will greatly simplify the future rename, whoever decided to do it.

@leventov thanks, it makes sense to me.

vogievetsky · 2020-01-23T08:39:26Z

web-console/src/views/datasource-view/datasource-view.tsx

-  dropReloadDatasource?: string;
-  dropReloadAction: 'drop' | 'reload';
-  dropReloadInterval: string;
+  dataSourceToMarkSegmentsByIntervalIn?: string;


This is a great rename, it is good to have the UI aligned with the docs. Could you please make sure that we treat "datasource" as one word I have been consistently using "datasource" for the Druid table thing and "data source" for the source of data like S3 or HDFS (in the data loader).

So this should be datasourceToMarkSegmentsByIntervalIn

vogievetsky · 2020-01-23T08:43:36Z

web-console/src/views/datasource-view/datasource-view.tsx

-        confirmButtonText="Drop data"
-        successText="Data drop request acknowledged, next time the coordinator runs data will be dropped"
-        failText="Could not drop data"
+        confirmButtonText="Mark as unused all segments in data source"


This is way too long to be a button. Maybe just call it "Mark as unused" and make sure to specify what is going on in the text of the dialog (which you are mostly doing already)

I shortened confirm button texts but just removed "in data source" parts, IMO even shorter names are too ambiguous.

lgtm-com · 2020-01-23T19:07:12Z

This pull request introduces 1 alert when merging 2ad6dd5 into 83ddc8d - view on LGTM.com

new alerts:

1 for Useless comparison test

jihoonson

The latest changes LGTM. @vogievetsky do you have more comments? @leventov sorry to say, but would you mind fixing the conflict one more time?

vogievetsky · 2020-01-24T01:34:09Z

Web console changes LGTM 👍

leventov · 2020-01-26T11:28:13Z

An interesting lucky occasion: the conflict was in the specific file and very close to the line where I actually inserted a bug, which was not caught by tests. (Raised #9257)

leventov · 2020-01-26T11:28:55Z

@jihoonson @vogievetsky thanks for reviews!

lgtm-com · 2020-01-26T12:03:30Z

This pull request introduces 1 alert when merging 6745f4a into 19b427e - view on LGTM.com

new alerts:

1 for Useless comparison test

jihoonson · 2020-01-27T19:24:21Z

The line where LGTM complains about is to verify two static variables have the same value which should have been the same variable but stored separately because of the module dependency. So, the LGTM failure doesn't look legit in this case. I'm not sure whether we can skip the LGTM inspection for some particular lines. I'm merging this PR since it has been open for too long time. We can address this issue later in a follow-up PR.

leventov added Design Review Compatibility Refactoring labels Mar 20, 2019

leventov requested a review from egor-ryashin March 20, 2019 21:00

leventov added 6 commits March 20, 2019 18:02

Merge remote-tracking branch 'upstream/master' into used-segments

e44e199

Fix brace

25de70d

Import order

fe13d9c

Rename withKillDataSourceWhitelist to withSpecificDataSourcesToKill

050bd3b

Fix tests

562f5b4

Fix tests by adding proper methods without interval parameters to Ind…

a6138da

…exerMetadataStorageCoordinator instead of hacking with Intervals.ETERNITY

leventov added 2 commits March 28, 2019 14:03

More aligned names of DruidCoordinatorHelpers, rename several Coordin…

cb6fe7f

…atorDynamicConfig parameters

Merge remote-tracking branch 'upstream/master' into used-segments

164f783

leventov added the Development Blocker label Mar 28, 2019

leventov added 2 commits March 28, 2019 19:19

Rename ClientCompactTaskQuery to ClientCompactionTaskQuery for consis…

d959bb0

…tency with CompactionTask; ClientCompactQueryTuningConfig to ClientCompactionTaskQueryTuningConfig

More variable and method renames

01c4494

egor-ryashin requested changes Mar 28, 2019

View reviewed changes

egor-ryashin requested changes Mar 29, 2019

View reviewed changes

leventov added 4 commits April 1, 2019 14:57

Rename MetadataSegments to SegmentsMetadata

06ec389

Merge remote-tracking branch 'upstream/master' into used-segments

09fab95

Javadoc update

f4774e1

Simplify SegmentsMetadata.getUnusedSegmentIntervals(), more javadocs

8d5200d

egor-ryashin requested changes Apr 5, 2019

View reviewed changes

Merge remote-tracking branch 'upstream/master' into used-segments

6da0b9d

jihoonson reviewed Jan 22, 2020

View reviewed changes

leventov mentioned this pull request Jan 22, 2020

Coordinator documentation is stale #7201

Open

leventov added 2 commits January 22, 2020 13:29

Rename SegmentsMetadata to SegmentsMetadataManager, address other com…

d0bf20a

…ments

Merge remote-tracking branch 'upstream/master' into used-segments

10587c1

vogievetsky requested changes Jan 23, 2020

View reviewed changes

Address more comments

2ad6dd5

jihoonson reviewed Jan 23, 2020

View reviewed changes

vogievetsky approved these changes Jan 24, 2020

View reviewed changes

jihoonson approved these changes Jan 24, 2020

View reviewed changes

Fix a bug in DataSourceOptimizer

6745f4a

leventov mentioned this pull request Jan 26, 2020

Improve test coverage of DataSourceOptimizer #9257

Open

leventov changed the title ~~Reconcile terminology and method naming to 'used/unused segments'; Rename MetadataSegmentManager to MetadataSegments~~ Reconcile terminology and method naming to 'used/unused segments'; Rename MetadataSegmentManager to MetadataSegmentsManager Jan 26, 2020

jihoonson merged commit b9186f8 into apache:master Jan 27, 2020

clintropolis mentioned this pull request Mar 1, 2020

OpenJDK historical segment hand-off fails with Illegal Field Name error #8813

Closed

jihoonson mentioned this pull request Mar 25, 2020

Eclipse Java Compiler issues fixed #8818

Closed

jihoonson added this to the 0.18.0 milestone Mar 26, 2020

jihoonson mentioned this pull request Apr 9, 2020

[Draft] 0.18.0 release notes #9652

Closed

gianm mentioned this pull request Apr 23, 2020

Fix removeDataSource() and removeSegment() of SQLMetadataSegmentManager #5001

Merged


		void stop();

		boolean tryMarkAsUsedAllSegmentsInDataSource(String dataSource);

	@Override
	public int deletePendingSegmentsCreatedInInterval(String dataSource, Interval deleteInterval)
	{
	return connector.getDBI().inTransaction(
	(handle, status) -> handle
	.createStatement(
	StringUtils.format(
	"DELETE FROM %s WHERE datasource = :dataSource AND created_date >= :start AND created_date < :end",
	dbTables.getPendingSegmentsTable()
	)
	)
	.bind("dataSource", dataSource)
	.bind("start", deleteInterval.getStart().toString())
	.bind("end", deleteInterval.getEnd().toString())
	.execute()
	);
	}

Reconcile terminology and method naming to 'used/unused segments'; Rename MetadataSegmentManager to MetadataSegmentsManager #7306

Reconcile terminology and method naming to 'used/unused segments'; Rename MetadataSegmentManager to MetadataSegmentsManager #7306

Conversation

leventov commented Mar 20, 2019 • edited Loading

Description

Reconcile terminology and method naming to 'used/unused segments'

Don't use terms 'enable/disable data source'

Rename MetadataSegmentManager to SegmentsMetadataManager

Rename DruidCoordinatorHelper to CoordinatorDuty

Rename IndexerSQLMetadataStorageCoordinator's get* methods to retrieve*

Rename SegmentsAndMetadata class to SegmentsAndCommitMetadata

Rename several indexing service task-related classes

Refactoring of DruidCoordinator

Future work

gianm commented Mar 26, 2019

leventov commented Mar 26, 2019

gianm commented Mar 27, 2019

leventov commented Mar 27, 2019 • edited Loading

gianm commented Mar 27, 2019

leventov commented Mar 28, 2019

leventov commented Mar 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

egor-ryashin Jul 31, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

egor-ryashin Jul 31, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jon-wei Apr 9, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leventov commented Nov 21, 2019

lgtm-com bot commented Dec 8, 2019

jihoonson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leventov Jan 22, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leventov commented Mar 20, 2019 •

edited

Loading

Rename `MetadataSegmentManager` to `SegmentsMetadataManager`

Rename `DruidCoordinatorHelper` to `CoordinatorDuty`

Rename `IndexerSQLMetadataStorageCoordinator`'s `get` methods to `retrieve`

Rename `SegmentsAndMetadata` class to `SegmentsAndCommitMetadata`

Refactoring of `DruidCoordinator`

leventov commented Mar 27, 2019 •

edited

Loading

leventov commented Mar 28, 2019 •

edited

Loading

egor-ryashin Jul 31, 2019 •

edited

Loading

egor-ryashin Jul 31, 2019 •

edited

Loading

jon-wei Apr 9, 2019 •

edited

Loading

leventov Jan 22, 2020 •

edited

Loading