Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DAE-63] Handling exception when adding duplicate partitions #37

Merged
merged 8 commits into from
Jan 27, 2021
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions hive_metastore_client/hive_metastore_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ def drop_columns_from_table(

def add_partitions_to_table(
self, db_name: str, table_name: str, partition_list: List[Partition]
) -> None:
) -> bool:
"""
Add partitions to a table.

Expand All @@ -131,7 +131,11 @@ def add_partitions_to_table(
table_partition_keys=table.partitionKeys,
)

self.add_partitions(partition_list_with_correct_location)
try:
self.add_partitions(partition_list_with_correct_location)
ribaldorafael marked this conversation as resolved.
Show resolved Hide resolved
return True
except AlreadyExistsException:
ribaldorafael marked this conversation as resolved.
Show resolved Hide resolved
LucasMMota marked this conversation as resolved.
Show resolved Hide resolved
return False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no worries on returning boolean, but I think that it might get confusing πŸ€”
I mean, a "False" return implies, in my opinion, that the partition addition was unsuccessful.

makes sense?

Copy link
Contributor Author

@LucasMMota LucasMMota Jan 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we think is: the method "adds partitions" to the table. If it returns True I understand it was successful and if it returns False, it means that the operation was not complete. This is the case here. If a partition does not yet exist for that table, then the method adds it and returns True. But if the partition already exists, the method will return False, indicating that this operation was not complete, since this partition was already added. As you said, the operation (partition addition) was unsuccessful, since this partition cannot be added twice.
Also, as a third case, if another exception occurs it will be thrown.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, got it! but let me bring another point of view: if the "operation was not complete", as you said, what blocks me to add another exception returning False? do you agree it can get confusing over time?

moreover, when you call this method the expected behavior is that a partition will be available for use, right?
so, re-adding it wouldn't bring any errors or misbehavior, since the operation is exactly the same.

anyways, maybe separating methods like add_partitions_if_not_exists (that throws an exception in case it does) and another add_or_replace_partitions could make it more explicit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, got it! but let me bring another point of view: if the "operation was not complete", as you said, what blocks me to add another exception returning False? do you agree it can get confusing over time?

Yes, if more exceptions are added in this method (with them returning false) it'd be messy. I think this shouldn't be done. We want to guarantee if we call it 2 times, it'll add the partition in the first and do nothing in the second, not throwing an error after the first time. For this reason, I added this except, to keep the behavior of adding a partition (duplicate or not) "clean" and without errors.
If more errors are raised, they should be thrown and not caught by the try block, this would silently mask them and be out against the method objective.

anyways, maybe separating methods like add_partitions_if_not_exists (that throws an exception in case it does) and another add_or_replace_partitions could make it more explicit

I don't know if I completely got your suggestion about these two methods.
If we throw an exception in add_partitions_if_not_exists wouldn't we need to treat this exception in the user of the client?
About the add_or_replace_partitions, I got confused about what you meant by replacing, because I see two options for the partitions: add or remove. I didn't get what you meant when you said about replacing partitions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option for this would be to call the native add_partitions and handle this AlreadyExistsException exception on the user side. This was one of my options at first, but I decided to keep it cleaner for lib users, creating this encapsulation on the client side.

Copy link
Contributor Author

@LucasMMota LucasMMota Jan 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: 90f7b99


def create_database_if_not_exists(self, database: Database) -> bool:
"""
Expand Down
47 changes: 43 additions & 4 deletions tests/unit/hive_metastore_client/test_hive_metastore_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,20 +164,59 @@ def test_add_partitions_to_table(
# arrange
db_name = "database_name"
table_name = "table_name"
formatted_partitions_location = ["abc"]

mocked_table = Mock()
mocked_get_table.return_value = mocked_table

mocked_partition_list = [Mock()]
formatted_partitions_location = ["abc"]
mocked__format_partitions.return_value = formatted_partitions_location

# act
returned_value = hive_metastore_client.add_partitions_to_table(
db_name=db_name, table_name=table_name, partition_list=mocked_partition_list
)

# assert
assert returned_value
mocked_get_table.assert_called_once_with(dbname=db_name, tbl_name=table_name)
mocked__format_partitions.assert_called_once_with(
partition_list=mocked_partition_list,
table_storage_descriptor=mocked_table.sd,
table_partition_keys=mocked_table.partitionKeys,
)
mocked_add_partitions.assert_called_once_with(formatted_partitions_location)

@mock.patch.object(HiveMetastoreClient, "get_table")
@mock.patch.object(HiveMetastoreClient, "_format_partitions_location")
@mock.patch.object(HiveMetastoreClient, "add_partitions")
def test_add_partitions_to_table_with_duplicated_partitions(
self,
mocked_add_partitions,
mocked__format_partitions,
mocked_get_table,
hive_metastore_client,
):
# arrange
db_name = "database_name"
table_name = "table_name"

mocked_table = Mock()
mocked_table.sd = ""
mocked_table.partitionKeys = ""
mocked_get_table.return_value = mocked_table

mocked_partition_list = [Mock()]
formatted_partitions_location = ["abc"]
mocked__format_partitions.return_value = formatted_partitions_location

mocked_add_partitions.side_effect = AlreadyExistsException()

# act
hive_metastore_client.add_partitions_to_table(
returned_value = hive_metastore_client.add_partitions_to_table(
db_name=db_name, table_name=table_name, partition_list=mocked_partition_list
)

# assert
assert not returned_value
mocked_get_table.assert_called_once_with(dbname=db_name, tbl_name=table_name)
mocked__format_partitions.assert_called_once_with(
partition_list=mocked_partition_list,
Expand Down