-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/graph analysis metrics #42
Feature/graph analysis metrics #42
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, nice work! Great to have the full metrics functionality for habitats as well!
My questions are mainly about using it: how to ensure the GeoGraph
and its habitats have all the attributes necessary to compute all component metrics.
Otherwise, happy to merge as is! 🚀
barrier_classes: Optional[List[Union[str, int]]] = None, | ||
max_travel_distance: float = 0.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to have this now 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually go into the computation yet, or is this just for future-proofing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just for future proofing at the moment.
self, calc_polygons: bool = True, add_distance_edges: bool = False | ||
) -> ComponentGeoGraph: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to be able to turn off the calc polygons 👍
self.components = comp_geograph | ||
return comp_geograph | ||
|
||
def get_metric(self, name: str) -> metrics.Metric: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my understanding: so if I want to get all available metrics I could just use the line below?
all_metrics_of_graph = [graph.get_metric(name) for name in metrics.METRICS_DICT]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that avg_component_isolation
doesn't work unless you have edges with distance between the components in the ComponentsGeoGraph
of the GeoGraph
you call the metric on, and avg_component_area
doesn't work when the ComponentsGeoGraph
doesn't have a dataframe with polygons, which is the default for the main GeoGraph
. So if you do the above then you will get an error - any ideas on how to resolve this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a property like available_metrics
or so would be nice? This way a user could check what possible arguments are allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing you can do is
for name in metrics.METRICS_DICT:
try:
all_metrics.append(graph.get_metric(name))
except ValueError:
continue
which would not give any errors and should return all the metrics it is possible to calculate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think given the computational cost to create some of these metrics scales so poorly we should keep the GeoGraph defaults scalable (i.e. not compute all the component edges, df etc.). The most important immidiate thing will be to add good warnings (and docstrings with examples) that let the user know how to add the relevant info to compute a more sophisticated metric.
In the next step, we could let the metrics themselves trigger the computation of these data structures. So e.g. if you want to compute avg_component_area
it calls a GeoGraph
method get_components(datafram=True)
, that automatically computes the components with df if the current component attribute doesn't have this yet. This would make it easier for the user, but maintain default scalability.
EDIT: this method would be in addition to the regular components
attribute.
EDIT: this method could be an extension of the existing get_graph_components
method, so that we always call that method to get components, sometimes with them being newly computed and sometimes with the existing components
attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can certainly make it so that the metrics functions trigger the computation of the needed data structures, but that would then make it impractical to use the code above iterating over all metrics, because it would trigger some incredibly expensive computation unless your graph is small. I personally slightly prefer allowing users to get all metrics calculatable with the defaults easily. Maybe I should add a get_all_metrics
function which gets all the metrics that are valid for a geograph - then additional ones can be called optionally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm yeah I think you're right, it would be nice for the user to have an easy way to get the simple metrics. Maybe we could have get_standard_metrics
(for computationally inexpensive) and get_extended_metrics
(for all metrics including expensive) functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, I think triggering these computations is what a user would expect, if you try to get a complicated metric. But you're absolutely right, it shouldn't be the default anywhere (as by calling that iteration above).
EDIT: thought about it more .. I am somewhat indifferent between throwing an error when the graph is not fully computed yet, or a warning that it may take long, when trying to start a complicated metric. In case of the error it should be clear from the error how the user can fix it (with the example). Otherwise, up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, in b196db1 I added a STANDARD_METRICS
list and an ALL_METRICS
list, so you can just iterate over the standard metrics list to get all of the cheap metrics. Then I made the metrics functions compute the needed data structures automatically, with a warning.
def save_habitat(self, save_path: pathlib.Path) -> None: | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding: how is this different from the main save method of the GeoGraph
class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the only difference is the data = { ... }
part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also saves all the metadata for the habitat, such as the list of valid classes & barrier classes, the max_travel_distance
, etc. And the loading function loads all that stuff as well.
if TYPE_CHECKING: | ||
from src.models import geograph |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, didn't know there is a variable like this. Should start to use this as well!
self.graph = nx.empty_graph(len(self.components_list)) | ||
self.metrics: Dict[str, Metric] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding: this would then just be a number of nodes without any additional info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I thought it best to create the graph so that users can at least add edges and node attributes if they want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, sounds good.
src/models/metrics.py
Outdated
comp_geograph = geo_graph.components | ||
if not comp_geograph.has_df: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question for usage: does a GeoGraph
instance by default have this attribute, with a ComponentGeoGraph
with df? Or how can I make sure that this metric can be computed for a GeoGraph?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default in the GeoGraph
init we call self.components = self.get_graph_components(calc_polygons=False)
, which means the resulting ComponentsGeoGraph
does not have a df so this metric will not work there. You can calculate the df and store it in components
with geograph.components = geograph.get_graph_components(calc_polygons=True)
. However, by default the HabitatGeoGraph
does get the components along with the component df in its init function, so this metric will work fine for those.
src/models/metrics.py
Outdated
def _avg_component_isolation(geo_graph: geograph.GeoGraph) -> Metric: | ||
"""Calculate the average distance to the next-nearest component.""" | ||
comp_geograph = geo_graph.components |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great that we can compute this, at least for smaller habitats.
Again usage-question: how to I make sure that a GeoGraph
instance has all the relevant attributes to be able to compute this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: how hard would it be to extend this to just regular patch isolation as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do geograph.components = geograph.get_graph_components(calc_polygons=True, add_distance_edges=True)
, to calculate the distance edges and make this metric work, but as warned this will be very expensive for graphs with more than about 100 components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For patch isolation I think the approach I've been taking to calculate distances between everything is too inefficient, so I think you would have to change the idea of the metric from being distance to next-nearest patch to being average distance between neighbouring patches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also also: should we set this to zero if there is only a single component? I think this would make sense, as we would want this metric to go down as more components become connected, and in the limit (i.e. all components are connected) it should be the lowest = 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, done in f5b80c0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the info, I think the best quick addition for this PR will be to add this fix
geograph.components = geograph.get_graph_components(calc_polygons=True, add_distance_edges=True)
to the warning and docstring of the two component metric functions. Then as a next step (probably after this PR) we could extend the get components function as I suggested above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Again beyond the scope of this PR) About patch and component isolation: didn't give this extensive thought but I think we can use the existing graph info to make this more efficient:
- only computing distances where there are edges (because if there is an edge, the minimal distance patch must has such an edge)
- slowly searching for nearby patches (like creating a larger max_travel_distance habitat) by adding a double_max travel distance buffer and then adding the edges with distance to the graph, and doing this in incriments of the max_travel_distance, until we found an edge.
Together this should still be more efficient in most cases than the complete distance computation at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me as well! Mainly added a few questions for my understadning. Great job @herbiebradley 🚀
from numpy import ndarray | ||
from src.models.polygon_utils import ( | ||
connect_with_interior_bulk, | ||
connect_with_interior_or_edge_bulk, | ||
connect_with_interior_or_edge_or_corner_bulk, | ||
) | ||
|
||
if TYPE_CHECKING: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice hack! 😉
from tqdm import tqdm | ||
|
||
pd.options.mode.chained_assignment = None # default='warn' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's that? 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because in the add_habitat
function, we create a new habitat by slicing the main df to get the rows which correspond to the habitat. Further operations on the habitat df then give a Pandas warning about chained assignment, which is explained here in the first answer: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas, but for our use case we don't care about it so I disabled the warning.
src/models/geograph.py
Outdated
def __eq__(self, o: object) -> bool: | ||
if not isinstance(o, GeoGraph): | ||
return False | ||
return nx.is_isomorphic(self.graph, o.graph) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! Just out of curiosity, do you know how complex such an operation is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I just realised it's exponential in the number of vertices - for some reason that function is still very fast when comparing graphs which have quite a few differences though, but impossibly slow for graphs which are the same. I've changed it to nx.fast_could_be_isomorphic
which takes about half a second to compare even for graphs which are equal. It's only an approximation of course but I think it will be quite accurate for the differences usually found between GeoGraphs.
if final_index is None: | ||
final_index = self.df.last_valid_index() | ||
final_index = self.df.last_valid_index() + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, that would have been bad - good catch!
""" | ||
if not set(class_list).issubset(self.df["class_label"].unique()): | ||
raise ValueError("`class_list` must only contain valid class names.") | ||
self.df.loc[self.df["class_label"].isin(class_list), "class_label"] = new_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we deal with potential touching polygons once the classes are merged? Should we also merge them into the respective superpolygons
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should be able to find these cases and call the very convenient merge_nodes
function to deal with it 😄 - I'll add it to the PR tomorrow probably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is taking more time than I thought it would, so I'll create an issue and add it in a future PR.
def _load_from_dataframe( | ||
self, | ||
df: gpd.GeoDataFrame, | ||
tolerance: float = 0.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the best-practice in this case actually? Is it to keep redundant arguments running (e.g. the way you did or with kwargs) or is it to remove them? I don't know but would be curious to learn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's to create the redundant arguments so that if people accidentally copy a function call from the superclass and apply it to the child class, it all still works. See python/mypy#1237 (comment)
https://en.wikipedia.org/wiki/Liskov_substitution_principle
# Using this list and iterating through it is slightly faster than | ||
# iterating through df due to the dataframe overhead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you benchmark with df.geometry.values
(which gives you a geometry array object, that has references to the underlying C objects? How does this compare?
(Feel free to address this after the report and just open an issue - I think before the report submission performance is not priority)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was thinking about this, I'll test it out before I merge the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Croydon-Brixton I did some tests and it turns out making the list is slower than just getting the .values
, but accessing a random element in the list is ~25x faster than accessing a random element in the geometry array (which is built on top of numpy I think). So we can stick to lists, this post might explain why they are better for accessing elements: https://stackoverflow.com/questions/29281680/numpy-individual-element-access-slower-than-for-lists
class Metric: | ||
"""Class to represent a metric for a GeoGraph, with associated metadata.""" | ||
|
||
value: Any # No good Numpy type hints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly 😢 I think ppl are working on it though
desc="Calculating edge weights", | ||
total=len(self.graph.edges), | ||
): | ||
attrs["distance"] = geom[u].distance(geom[v]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That loop and distance calculation is probably the very slow computation you mentioned right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's that one
self.components = comp_geograph | ||
return comp_geograph | ||
|
||
def get_metric(self, name: str) -> metrics.Metric: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a property like available_metrics
or so would be nice? This way a user could check what possible arguments are allowed.
All points dealt with, time to merge! 🥳 |
This PR adds metrics to GeoGraph by defining a Metric dataclass. A Metric object is then returned when a metric is calculated; each metric calculation function accepts only a GeoGraph, so that you can calculate any metric by calling the
get_metric
method on any GeoGraph and passing only the name of the metric. The PR also changes Habitats and Components to be sub-classes of GeoGraph, so you can now have recursively nested HabitatGeoGraphs, each with a ComponentGeoGraph, and you can calculate metrics on all of them!Other features:
merge_classes
function which can be used to avoid some extra lines when combining class labels in the dataframe.apply_to_habitats
function which allows you to apply a GeoGraph method, likemerge_nodes
, to every HabitatGeoGraph attached to a GeoGraph in one go, assuming each application uses the same arguments.add_habitat
.get_graph_components
function, which now supports optionally calculating the union polygons for each component (by default this is not done for a GeoGraph, but is done for a HabitatGeoGraph since habitats are often small enough that this is fairly cheap).avg_component_isolation
metric.