Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further improvements to splitting module #187

Closed
3 of 4 tasks
bockthom opened this issue Dec 1, 2020 · 4 comments
Closed
3 of 4 tasks

Further improvements to splitting module #187

bockthom opened this issue Dec 1, 2020 · 4 comments

Comments

@bockthom
Copy link
Collaborator

bockthom commented Dec 1, 2020

Here, we collect further ideas from PR #184 on how to improve the splitting module. Thanks to @clhunsen for your suggestions!

  1. Add utility function to obtain the bin labels (bins.date) from ranges: construct.bin.labels.from.ranges get.bin.dates.from.ranges [edited on 2024-02-29]
  2. Add bins attribute to list of split networks also in functions split.network.time.based.by.ranges and split.networks.by.bins. To be able to do that, we need functions like construct.bin.labels.from.ranges (see above) to extract the bins from the ranges. (See also the current occurrences of attr(nets, "bins") = bins.date). After adding this attribute in the mentioned functions, we do not need to set this attribute in the function split.networks.time.based anymore.
  3. When working on network-based splitting anyway (as in the previous item), we should check whether the splitting for networks works correctly with respect to the corner cases on sliding-window splitting that we have fixed in PR Introduce issue-based artifact-network creation, new tests and fix current tests + minor bug fixes #244.
  4. If we construct ranges in split.networks.time.based after Line 660 and after Line 668 (both using construct.ranges(bins.date, sliding.window = ...)), we could omit the if-else cascade in Lines 682ff. This would decrease the complexity of the functionality below.

    coronet/util-split.R

    Lines 660 to 691 in 78e99a1

    if (sliding.window) {
    ranges = construct.overlapping.ranges(start = min(dates), end = max(dates),
    time.period = time.period, overlap = 0.5, raw = FALSE,
    include.end.date = TRUE)
    bins.info = construct.overlapping.ranges(start = min(dates), end = max(dates),
    time.period = time.period, overlap = 0.5, raw = TRUE,
    include.end.date = TRUE)
    bins.date = sort(unname(unique(get.date.from.unix.timestamp(unlist(bins.info)))))
    } else {
    bins.info = split.get.bins.time.based(dates, time.period, number.windows)
    bins.date = get.date.from.string(bins.info[["bins"]])
    }
    } else {
    ## specific bins are given, do not use sliding windows
    sliding.window = FALSE
    ## set the bins to use
    bins.date = bins
    }
    ## split all networks to the extracted bins
    networks.split = lapply(networks, function(net) {
    if (sliding.window) {
    nets = split.network.time.based.by.ranges(network = net, ranges = ranges,
    remove.isolates = remove.isolates)
    attr(nets, "bins") = bins.date
    } else {
    nets = split.network.time.based(network = net, bins = bins.date, sliding.window = sliding.window,
    remove.isolates = remove.isolates)
    }
    return(nets)
    })

    (We don't want to implement the fourth suggestions, since it would add one additional, but unnecessary indirection, as split.network.time.based.by.ranges would in turn call split.network.time.based, and this would be one more indirection in the else case than what the current implementation does. Therefore, I have crossed out the last suggestion.) [edited on 2023-11-15]

All three suggestions should be easy to implement and should, therefore, easily improve the code of the splitting module.

@bockthom bockthom added this to the v3.8 milestone Dec 1, 2020
@bockthom bockthom modified the milestones: v3.8, v3.9 Mar 8, 2021
@bockthom bockthom modified the milestones: v4.1, v4.2 Mar 24, 2022
@bockthom bockthom modified the milestones: v4.2, v4.3 Oct 24, 2022
@bockthom bockthom modified the milestones: v4.3, v4.4 Apr 17, 2023
@maxloeffler
Copy link
Contributor

I have worked on the first two points and have a finished commit here: maxloeffler@a18a932. I will not open a pull request just now though, as we discussed, to not interfere with this #244. I will look into the corner cases (point three) from now on :)

@bockthom
Copy link
Collaborator Author

bockthom commented Dec 6, 2023

I have worked on the first two points and have a finished commit here: MaLoefUDS@a18a932.

Thanks! I had a quick look at it, and your commit looks good to me (given that the functionality did not change, what we will hopefully see in the tests - did you already check whether the tests for the different kinds of splitting do also test for the bins attribute?).
I just added a comment regarding the placement new helper function.

@maxloeffler
Copy link
Contributor

I manually stepped through the code and checked that the corner cases we discovered in my last PR do not apply to activity-based network splitting, as they did to activity-based data splitting.

This has mainly the reason, that the id.column of activity-based network splitting is not obtained from some metric of the network but artificially assigned in the form of a monotonic counter. Therefore we do not run into issues of multiple elements having the same id and resulting corner cases.

There was also an issue that was caused by the offset.end calculation and how it was in the following in activity-based data splitting. The offset.end calculation seems to be identical in network splitting, but I was not able to find problems with it this time, maybe because it it used differently. E.g., the check where we check if the last range should be dropped is different to both the version we had before in activity-based data splitting, as well as to the one we have right now (Should I update it to be consistent to the one data splitting we have right now?).

@maxloeffler
Copy link
Contributor

Also about your last question regarding tests. In the last PR I added tests, that test for the validity of the bins parameter in data splitting. By "did you already check whether the tests for the different kinds of splitting do also test for the bins attribute?", do you mean I should add similar tests for network splitting? Or do you mean something else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants