-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce issue-based artifact-network creation, new tests and fix current tests + minor bug fixes #244
Conversation
Introduce edge construction into the 'get.artifact.network.issue' function. Connect 'add_link' and 'referenced_by' issue-events by an edge. This works towards se-sic#239. Signed-off-by: Maximilian Löffler <[email protected]>
Add new 'add_link' and 'referenced_by' issue events to the testing data to allow for new tests. Add a test for the construction of an issue-based artifact-network with 'issues.only.comments' turned on and off. This works towards se-sic#239. Signed-off-by: Maximilian Löffler <[email protected]>
This works towards se-sic#239. Signed-off-by: Maximilian Löffler <[email protected]>
For some issue-events the 'event.info.1' column is to be interpreted as a reference to an issue. In this case it is more consistent to reformat the column entry into the <issue-[issue.source]-[event.info.1]> format. This works towards se-sic#239. Signed-off-by: Maximilian Löffler <[email protected]>
To be consistent with bipartite networks it is necessary to rename the vertex attribute 'IssueEvent' to 'Issue' in multi-networks. This works towards se-sic#239. Signed-off-by: Maximilian Löffler <[email protected]>
Modify 'split.data.time.based' to be able to split by activity-based bins. Rename the function to 'split.data.by.time.or.bins'. Introduce wrapper functions 'split.data.by.bins.vector' and 'split.data.time.based' to call 'split.data.by.time.or.bins'. Add 'include.duplicate.ids' parameter in 'split.get.bins.activity.based' to obtain bins covering all data elements from 'df' by which the split is being performed, regardless of the elements ids uniqueness. In 'split.data.activity.based', after calculating the bins to place data elements into, replace the time-based splitting by 'split.data.by.bins.vector'. Time-based splitting is incorrect for the case that the date of the last element in a bin is the same as the date of the first element of the next bin. Adjust calculation of 'offset.end' in 'split.data.activity.based' to fix a bug where because of a short last window the end offset would cross the border of the last window, overlapping into the second last. Because of this overlap the last sliding windows would not be calculated as expected. This works towards se-sic#239. Signed-off-by: Maximilian Löffler <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MaLoefUDS for your work on this PR and fixing all the bugs you encountered during adjusting all the test data. Your changes look good to me, I only have a couple of questions regarding some implementation details, some of which need further investigations prior to discussing it - see my comments below.
Notice: I did not have a look at the adjusted tests yet, and I also will need another pass on the changes to the splitting functionality to be able to completely understand all of them...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As promised, I had a closer look at the changed splitting functionality, and I also started to have a look at the tests (these were the two parts that were still missing from my last review.)
For the splitting functionality, I have a couple of questions that I would like to discuss in our next meeting to get a better understanding of what's going on there. Please have a look at my individual comments (most of them address just comments, but we should discuss them though). In general, the new functionality looks good to me, but I would like to make sure that I understand everything correctly and that we consider special corner cases correctly.
For the tests, I decided to skip reviewing the tests for now as I identified an unexpected change in the test data - we should discuss this in our next meeting before I will have a closer look at the tests themselves. Again, see my detailed comments below.
tests/codeface-data/results/testing/test_feature/feature/issues-github.list
Show resolved
Hide resolved
tests/codeface-data/results/testing/test_feature/feature/issues-github.list
Outdated
Show resolved
Hide resolved
util-split.R
Outdated
@@ -214,7 +270,7 @@ split.data.time.based = function(project.data, time.period = "3 months", bins = | |||
|
|||
## add splitting information to project configuration | |||
project.conf.new$set.splitting.info( | |||
type = "time-based", | |||
type = if (split.by.time) "time-based" else "activity-based", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Not related to this PR, but still interesting): Do we set the type for the splitting info also in other functions? Maybe it would be an idea to introduce string constants for the types, if there are just these two. (Does not need to be done in this PR, we can open another issue for that in take of this later. So just having a look whether these types appear also somewhere else be enough for now....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are only set in one or two place (except in tests where expected results are built). I dont think its worth creating constants for them yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Then let's ignore that for now.
This works towards se-sic#239. Signed-off-by: Maximilian Löffler <[email protected]>
This works towards se-sic#239. Signed-off-by: Maximilian Löffler <[email protected]>
Instead of creating only undirected issue-based artifact-networks, we now take the directedness information out of the network config. Edges are already created in a way that they can be interpreted as directed edges from the issue referencing to the referenced issue. Replace for loop in edge creation by more efficient mclapply and fix some minor formatting inconsistencies. This works towards se-sic#239. Signed-off-by: Maximilian Löffler <[email protected]>
Rework the algorithm to create sliding windows in activity-based splitting. Instead of cutting off half a range many elements at the end before building sliding windows (which creates a lot of edge cases), build sliding windows with every element up to the last one. Then remove the last incomplete range. The contents of the last incomplete range will be fully included in the second last range and therefore redundant. Sometimes the last incomplete range is a regular range. Previously the last range always had to be a regular range. This means that removing the last incomplete range requires updating the tests. Additionally fix and improve documentation of splitting methods and fix minor spelling bugs. This works towards se-sic#239. Signed-off-by: Maximilian Löffler <[email protected]>
This renaming makes sense as the method only splits dataframes. Rename 'split.data.by.bins.vector' to 'split.data.by.bins' as it is more readable and easier to understand. This works towards se-sic#239. Signed-off-by: Maximilian Löffler <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. I still haven't managed to look at the tests. But I have had a look at the new changes of the main implementation. In general, the new changes look good. I only spotted a few comments that are still hard to understand and one implementation detail which I don't understand, making it hard for me to judge its validity. See my individual comments below.
I will have a look at the tests in the next iteration.
There is now only one point open (excluding the NEWS.md). @hechtlC What is your opinion on adding input validation for the |
Signed-off-by: Maximilian Löffler <[email protected]>
@MaLoefUDS There is still one other point open: "When we take the directed option from the network configuration, we also should have a test for the directed issue artifact network." As far as I can see, the tests in tests/test-networks-artifact.R only test undirected networks. Edit: In an earlier version I wrote "packrat" instead of "patrick", sorry. |
Check type of the 'bins' parameter and its components in the wrapper functions 'split.data.by.time' and 'split.data.by.bins'. Move 'split.data.by.time.or.bins' into a new category for internal helper functions to discourage direct invocation. Signed-off-by: Maximilian Löffler <[email protected]>
Signed-off-by: Maximilian Löffler <[email protected]>
a87782e
to
1f62944
Compare
(I just found out that I used paste0 is not what I want here so I needed to amend once more 💀) |
Check if the 'bins' parameter of the 'split.data.by.bins' actually contains 'bins' component. Use 'get.date.from.string' instead of accessing 'lubridate::ymd_hms' directly, to encapsulate date conversion. Allow 'vector' component of 'bins' to be of any subclass of 'numeric' instead of explicitly 'numeric'. Disallow lists that contain elements that are not representing a date in 'split.data.time.based' as they do not comply with the expected format of bins for 'split.data.by.time.or.bins'. Signed-off-by: Maximilian Löffler <[email protected]>
cbc80e1
to
6e1beeb
Compare
Add tests that call 'split.data.time.based' and 'split.data.by.bins' with various malformed 'bins' parameters and expect failure. Signed-off-by: Maximilian Löffler <[email protected]>
Thanks @MaLoefUDS for addressing my last comment. Now, everything I've spotted seems to be fixed now. So, given that I don't spot anything problematic in the changed tests (but we'll see), the only thing missing is the update of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found some time to have a look at the tests on a sample basis.
In general, the tests (at which I had a look at) look good. I spotted a few minor inconsistencies and a few changes that we should double-check and document.
In addition, I spotted one inconsistency in the test data 🙈
Moreover, after everything is fixed in the test data, we need to copy
test_feature/feature/issues-github.list
to test_proximity/proximity/issues-github.list
, and the same also for issues-jira.list
, as our basic assumption is that the issue data for feature and proximity are the same.
tests/codeface-data/results/testing/test_feature/feature/issues-jira.list
Outdated
Show resolved
Hide resolved
Reverse order of reference from Jira issue 332 and Jira issue 328, since previously, Jira issue 332 was referenced before its creation. Copy GitHub and Jira issue data from 'test_feature/feature' also to 'test_proximity/proximity' to keep the data consistent. Adjust all effected tests to comply with the changed testing data. Signed-off-by: Maximilian Löffler <[email protected]>
Signed-off-by: Maximilian Löffler <[email protected]>
Signed-off-by: Maximilian Löffler <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the NEWS.md. There is one major issue: You have added the changes to the v4.3 version, which has already been released. This is not a good idea, we will not change a past release. So, please move all your changes to a new section and just call it ## unversioned
. We will add a version number when the next release will be scheduled. Also make sure that we have two empty lines when the version section ends, to be consistent with the previous versions.
I agree with the commits you have referenced, but sometimes I would like to add another commit in addition, if a change is not fully implemented in one commit (and here I only speak of real implementation commits that affect the behavior, not only tiny changes). I added suggestions for these commits below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And one more comment: Please use backticks around function names and strings such as referenced_by
or IssueEvent
instead of ', such that it is properly highlighted by markdown.
Also use backticks instead of single ticks for proper markdown highlighting. Improve changelog messages by clarifing and properly focusing on the important changes as well as adding more relevant commit hashes. Signed-off-by: Maximilian Löffler <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally, everything seems to be fine with this PR.
Many thanks @MaLoefUDS for your commitment and endurance in this PR. This was a really tough one, also due to plenty of necessary changes to the test data and splitting tests. But we also spotted bugs along this way and improved the splitting functionality substantially, which is a good result for a long PR that was supposed to "just" introduce issue-based artifact networks 😄
* Adapt the sliding-window data-splitting approach from PR se-sic#244 to network splitting, especially the removal of the last (redundant) range. * Fix an edge-case bug in said algorithm, where the wrong last split is removed. This could happen when the subject to split can be exactly divided by the desired size of a split. * Fix a faulty test, that only became apparent after the adjustments to the sliding-window algorithm in network-splitting Signed-off-by: Maximilian Löffler <[email protected]>
* Adapt the sliding-window data-splitting approach from PR se-sic#244 to network splitting, especially the removal of the last (redundant) range. * Fix an edge-case bug in said algorithm, where the wrong last split is removed. This could happen when the subject to split can be exactly divided by the desired size of a split. * Fix a faulty test, that only became apparent after the adjustments to the sliding-window algorithm in network-splitting Signed-off-by: Maximilian Löffler <[email protected]>
* Adapt the sliding-window data-splitting approach from PR se-sic#244 to network splitting, especially the removal of the last (redundant) range. * Fix an edge-case bug in said algorithm, where the wrong last split is removed. This could happen when the subject to split can be exactly divided by the desired size of a split. * Fix a faulty test, that only became apparent after the adjustments to the sliding-window algorithm in network-splitting Signed-off-by: Maximilian Löffler <[email protected]>
* Adapt the sliding-window data-splitting approach from PR se-sic#244 to network splitting, especially the removal of the last (redundant) range. * Fix an edge-case bug in said algorithm, where the wrong last split is removed. This could happen when the subject to split can be exactly divided by the desired size of a split. * Fix a faulty test, that only became apparent after the adjustments to the sliding-window algorithm in network-splitting Signed-off-by: Maximilian Löffler <[email protected]>
Prerequisites
showcase.R
with respect to my changes.dev
.Changelog
Added
split.data.by.bins
function (not to be confused with a previously existing function that had the same name and was renamed in this context), which splits data based on given activity-based binsChanged/Improved
add_link
andreferenced_by
issue events which connect issues to form edges in issue-based artifact-networksbins
parameter insplit.data.time.based
andsplit.data.by.bins
split.data.by.bins
intosplit.dataframe.by.bins
as this it what it doesFixed
event.info.1
column of issue data according to the <issue-%source-%id> format, if the content of theevent.info.1
field references another issuesplit.data.time.based
insidesplit.data.activity.based
to split data into the previously derived bins, when elements close to bin borders share the same timestamps. It is fixed by replacingsplit.data.time.based
bysplit.data.by.bins
IssueEvent
toIssue
in multi-networks, to be consistent with bipartite-networks