Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider normalizing by total population of the country or region #30

Open
ksperling opened this issue Mar 30, 2020 · 28 comments
Open

Consider normalizing by total population of the country or region #30

ksperling opened this issue Mar 30, 2020 · 28 comments

Comments

@ksperling
Copy link

While the log vs log plot is great for comparing the shape of the trajectory of individual countries, wouldn't it make sense to be able to normalize by the total population, so it becomes possible to compare between countries more accurately?

@amirmazmi
Copy link

Support this. Also, perhaps by country size (land area) as a rough measure of distance between population.

I would be happy to help get the data for the above.

Good work on this. Will be useful for future outbreaks.

@carbonox-infernox
Copy link

carbonox-infernox commented Apr 1, 2020

The number of new cases is already being divided by a population - that of those infected. The total population of a country only controls the maximum infections that could occur, not the rate of infection. This is already evident by the graph; Countries follow the same trajectory regardless of vast differences in population.

@ksperling
Copy link
Author

Well, the current scaling of new cases / total cases normalizes the angle of the curve, but not the overall scale.

So if you compare the curves of China and South Korea for example, the current scaling makes it look like the South Korean curve dipped down "sooner" (yes, I realize the X-axis is not time) than Chinas, after about 6k cases vs 74k. In terms of comparing the impact of containment measures on the curve shape this is misleading though, because China has about 27 times the population of South Korea. If you instead look at it as (new cases per 1M population) / (total cases per 1M population), the curve for China dips at about 53 cases/1M compared to 117 cases/1M for South Korea.

Essentially the ask is to have the option of having the X-axis represent the proportion of the overall population that is infected, while retaining ability to compare trajectories.

@ksperling
Copy link
Author

Put another way, I think an X axis of absolute case numbers is meaningful as long as cases are essentially localized -- if there's a handful of cases, it doesn't matter how large the uninfected population surrounding those cases is. But once you get to a point where infections are essentially everywhere within a country, the numbers start to scale with the size of the overall population, and looking at it as cases per population starts making more sense.

@corlaez
Copy link

corlaez commented Apr 2, 2020

I support this request as a toggable option.

The main reason for normalizing by total population is that it gives you an estimate of the real impact on the country. A million cases in a country with 2 million people is not the same impact than a million cases in a country of 300 millions.

I think it is an cheap transformation, doesn't need any extra data and would give a better way to compare apples to apples.

@maciej-ludwinski
Copy link

+1
It would allow us to see if different countries reach the same saturation of infected cases in the population before the infection rates start dropping.

@aatishb
Copy link
Owner

aatishb commented Apr 2, 2020

I'm not yet convinced that we should add per-capita numbers, although I am hearing this requested a lot (especially in my email & mentions). Overall I'm inclined to agree with John Burn-Murdoch on this. Laying out my thinking on this for now.

As others have pointed out, essentially this would change the ordering (but not the shape) of the graph so that smaller-population countries would shift further up the graph and larger-population countries would move further down.

Some of my concerns:

  • By dividing by population, we are visually driving down the numbers in large population countries. This risks giving the impression that there is less to worry about at first, when in fact a large population suggests that there is a lot of room for the disease to spread. I want to be careful to avoid this risk.

  • Some large population countries will not have the resources to count cases or deaths accurately. For example, see this New Yorker article. If the numbers are already biased low, dividing by population may compound this. Calculating per-capita numbers may implicitly assume that the numbers are just as accurate in all countries. In reality the numbers are likely underestimated in complicated country-dependent (and possibly population dependent) ways.

  • The per-capita view perhaps makes sense when the disease has spread so far that it is being limited by the population size. For early stage outbreaks, the disease is not yet limited by the population size, and the number of deaths or cases is probably weakly correlated with population (see e.g. here). The thing is, not all countries are at population-scale levels of infection, so when we shift to per-capita we may be comparing apples to oranges.

  • Another way of saying this: if we switch to per-capita, we are boosting up the early case/death numbers in a low population country, and de-emphasizing the early numbers in a high population country. In reality at early stage the cases are growing at a roughly similar rate in low & high population countries, so it's weird to me that we're emphasizing one over the other.

  • Per-capita is also a bit abstract. I think we may lose some of the immediacy of this graph, and its relevance to what people are used to hearing and thinking about. Most people don't think in per-capita (to be fair, they probably don't think in log scales either!). 1000 deaths in India is easier to grasp than 0.000001 deaths per-capita.

  • I'm also not completely comfortable expressing deaths per capita, as this de-emphasizes individual deaths in high population countries as compared to those in low population countries. I feel that we should have a really good reason for such a decision.

Some alternate workarounds:

  • It is currently possible to select small population countries individually and compare them to each other.

  • We could consider adding a geographical region selector to compare say Europe to North America, South America, South Asia, East Asia, etc. (as some have suggested)

There are clearly situations when it makes sense to ask per-capita questions, but I'm not yet convinced this is one of them. As of now, I'm not sure the benefits of this view outweigh the costs.

I'm open to changing my thinking on this down the road. I'm also totally open to anyone who wants to fork this repo and create their own parallel one with per capita statistics.

@maciej-ludwinski
Copy link

@aatishb First of all, thanks for the great project and your dedication to continuing work on it!

I feel that there is no one graph that can show everything. Each approach is a different point of view, a different piece of the puzzle. Absolute numbers, per capita, per country area, and per test are all showing different sides of the same story. That's why I'd love to see all those options as switches – perhaps with additional info on which view accents which issue.

I know there are a lot of these graphs out there, but what's unique to this project (as far as I know), is not putting time on an axis, which gives a unique view – and I believe, adding those options would only deepen that insight.

On the other hand, these are just my opinions, I'm no mathematician, so I might be wrong and in the end, I will value your opinion over my own.

Thanks again!

@aatishb aatishb mentioned this issue Apr 2, 2020
@ksperling
Copy link
Author

@aatishb Thanks for the thoughtful response, those are good points.

I guess personally I've been looking at this from the point of view of a small country (I live in New Zealand), and am wondering if we're really doing well and if interventions actually happened "early", or if we just have comparatively low numbers because we're a small country.

The lack of correlation between deaths and population you linked is interesting. Intuitively it seems like there are a number of factors that should scale with population size though -- e.g. the number of "imported" cases should correlate with the number of people travelling abroad, which should be broadly proportional to total population (of course there are other factors too).

@n1vux
Copy link

n1vux commented Apr 7, 2020

Thank you for reminding us that phase-space plots are useful, we don't have to graph versus Date.

TL;DR

  1. From all I've read, i largely agree with your and John B-M's analysis re per-capita scaled or not; and it's quite inappropriate for your lovely phase-space chart; if it has any use, it is elsewhere.
  2. People who want the a "phase space" non-TS graph "like" this one, but with a per-capita scale, should check out Upshot's non-timeseries plot. It's also a phase-space, but +daily% vs total cases per-mille ‰ (so scaled); thus it answers a different, overlapping set of questions. There's room for both on the web.

If the goal isn't to answer the question as to whether ROK turned the corner "better" than PRC, but "have we turned yet," this construction is best I've seen yet.
(A year or three from now, retrospective research models with more complete data will be used to answer "what worked best" and "what was counter productive". We already know more testing is better, more case tracking is better, everything earlier is better. We don't need to gamify this now, do we?)

For those who want to see a population-scaled version of a phase-space plot, the other phase-space plot (that publicly launched the same day!) over at NYT Upshot, which instead of the classic X=total_cases(i) vs X'=delta_cases(i) phase-space used here, used Daily Growth Rate % (100*(1-delta_cases(i)/total_cases(i)) %) versus Cases per mille (1000 * total(date) / population ‰ ).

(I haven't seen per mille aka the per thousand sign used in a while and am happy for it!)

Their choice of Growth Rate is interesting but perhaps problematic for communications. The subtle differences among 7%, 9%, 12%, 15%, 19%, 26% daily Growth Rates are the differences between doubling in 10, 8, 6, 5, 4, 3 days (respectively), rather less subtle and important. The latter is a figure of merit that doesn't need a logarithmic scale to see the important distinctions; if we want either a non-time-series plot or even a time-series plot that doesn't need that problematically delightful log scale that confounds the innumerate, Doubling Days shouts the exponential nature of the epidemic (with neither confusing subtle steep tangents w/o log scale nor confusing log scales with). And who wants to plot Percentage on semi-log? (Which opens up another can of worms!)

I am happily using BOTH your and their phase space charts to track my state's counties reports. Yours (X vs X') makes the Weekend Effect re new cases reported on Sunday much more obvious (in Mass. counties data) than it is on the their Growth % vs Pop affected ‰ design. A sharp dip and Monday reverse, looks like a head-fake towards safety.
Theirs (GR % vs case-per ‰) is advertised as being good for comparing metropolises, the natural basins of inter-accessible population (baring isolation/distancing mitigation) and health infrastructure, the two resources that may hit exhaustion, not countries with multiple population centers. They don't claim it should work with ITALY vs USA, in fact recommend just the opposite. They compare LOMBARDY (as proxy for the dense cities of the north) vs NYC. In Mass., counties are more like Lombardy than standards MSA's (and alas the big MSA crosses county lines), but (aside from one or two citiy DPH daily reports) it's the best sub-regional breakdown I can get, and it looks good, i can see counties are following the same trajectory on a lagged schedule with your X vs X' design. With the NYT/Upshot design i get a long decay tail with a bit of a weekly wave. So i might recode my version to use Days to Double instead of Growth Rate (same variable but inverted Ydd= log(2)/log(1+Ygr) for whichever log you like).

While I have worries about data completeness, your lovely web demo reassures that this X vs X' phase-space diagram finds the signal in even the presumably incomplete data from China and is clearly signalling turn achieved in Spain and Italy now, good! Getting a signal out of noisy data is effectively statistical power. That's good.

Thank you again. (And Henry too for letting us know.)

@KenInLV
Copy link

KenInLV commented Apr 7, 2020

I agree that total numbers is important, should be primary, and should remain the default.

As MinutePhysics expertly explains, your chart was designed to show a specific thing really well — right now, which countries are still on the exponential growth line, and which have dropped off?

That is something I as a layman can intuitively understand, and your chart does a great job of highlighting it.

As I look at the countries climbing up and to the right, knowing they're vastly different population sizes, I have a new, different question — right now, how "severe" or "saturated" is each country?

That's why I disagree with you about whether per-capita is too abstract. I think it would give us a clear view into a question that seems intuitive and self-evident when I look at the chart.

Here's something that might be revealed by looking at the data this way. We know that confirmed / reported cases are a fraction of total infections, but we don't know how small a fraction. Perhaps very small. If we start to see a trend in several countries where they seem to gradually taper off once they reach a certain per-capita death rate, that could tell us something we might not know otherwise.

And even if not, at minimum it will counter the opinion that "most people are already infected." If true, we'd see it start to taper off. But, nope, still on the rise.

So I'd like to see it as an option.

Or, your proposal of listing large geographical regions I think would also work well.

@gruenix
Copy link

gruenix commented Apr 15, 2020

Thx a lot for providing these graphics, I think they allow a unique way to look at the Development.

Two suggestions

  1. consider starting the drawing of the graph at day #0 e.g. day when 50 infections had happened. Currently I assume the animation represents date, so the offsets in development in various countries remain, by taking that away one could compare the speed and progress of development of various countries

  2. the Per capita #1 request :-)
    I would consider a per X inhabitants figure helpful simply to compare the performance of the measures a country has taken... I see your points, but e.g. Italy and germany (my home) follow a rather similar line, despite the fact that Italy is far worse as the have only ¾ of Germany’s population...

Again thx for your work kindest regards Jo

@srscott
Copy link

srscott commented Apr 16, 2020

Agree with gruenix above. Both options would be helpful in visualizing what is happening.

  1. relatively how fast containment (or some other factor) may be working in each country to bend the curve.

  2. how big the problem really is for each country.

All of this of course considers whether on not the data is good, and that might be a stretch, especially with regard to China. Thanks for your work!

@gruenix
Copy link

gruenix commented Apr 17, 2020

To illustrate the per inhabitant request, if I want to keep the US in the picture (and who can afford NOT to look at trump and the outcome) I can’t really look at smaller EU countries anymore.
I would really consider the per inhabitant figures as a tool to judge the implemented measures...
F928879B-0BE8-48F7-9507-A6DA81587576

@iviney
Copy link

iviney commented Apr 19, 2020

Hello, this is a really nice graph animation, thanks! But I do think an option to show cases per capita (or per million people) would be very instructive. For example, I'm in New Zealand and there's a lot of discussion about whether we're doing better than Australia, who have a less severe lockdown policy. If we could view this on a per capita basis I think it would show that both countries are doing very similarly, but as it stands you can't really tell because Australia has 5x as many people.

@gruenix
Copy link

gruenix commented Apr 19, 2020

Hello, this is a really nice graph animation, thanks! But I do think an option to show cases per capita (or per million people) would be very instructive. For example, I'm in New Zealand and there's a lot of discussion about whether we're doing better than Australia, who have a less severe lockdown policy. If we could view this on a per capita basis I think it would show that both countries are doing very similarly, but as it stands you can't really tell because Australia has 5x as many people.

Hi we’ve all tried it but he has his reasons.... unfortunately I’m not able to fork and adapt it, as it would be reasonably easy to get the number of inhabitants together.... my programming skills don’t even deserve the name and mainly endet with my commodore C20 back then :-). And these days all I do is script Filemaker or so....

@ledahulevogyre
Copy link

@aatishb I don't understand your concern that per-capita (or per 100 000) over-emphasizes data from small countries over big ones'. I think it's actually the opposite. You over emphasize data from big countries when you don't. Borders are totally subjective.

You compare apples to oranges when choosing to compare absolute values of USA with Belgium, but also when comparing absolute values from N-Y with Europe. It's only when you divide per capita that you can compare objectively.

(when comparing pace instead of values, per-capita is useless, of course)

@gruenix
Copy link

gruenix commented Apr 27, 2020

@aatishb I don't understand your concern that per-capita (or per 100 000) over-emphasizes data from small countries over big ones'. I think it's actually the opposite. You over emphasize data from big countries when you don't. Borders are totally subjective.

You compare apples to oranges when choosing to compare absolute values of USA with Belgium, but also when comparing absolute values from N-Y with Europe. It's only when you divide per capita that you can compare objectively.

(when comparing pace instead of values, per-capita is useless, of course)

Agreed !

US and Belgium might be obvious for everyone, it’s worse when comparing eg Germany to France or UK where a smaller difference in population may not be so obvious. But it still seriously distorts the picture, even I have to look up absolute population for some of our neighbor (and my own} country and need to keep reminding myself

and im not sure about the pace, even the „case density“ changing over time would be interesting

@thomasrebele
Copy link

There's a pull-request by @jwosty: #25. Looks quite good to me. You can try it here. Thanks for all the work on this great visualization!

@gruenix
Copy link

gruenix commented May 1, 2020

Thanks to all for this and to @jwosty for the per capita addition
Q: @jwosty can you display the correct labels on the axis ? I miss the unit in the per capita graphs
Q: probably an issue in the underlying data but how can the graphs for the overall # of cases For Spain and France go down ? It’s slightly more possible in the per capita graph, but in the absolut it shouldn’t be possible at all - right ?
93428319-3E21-430C-9FF4-8EE183FC3E06
951DBCBD-5BDA-4CA5-8D6D-738823CDED07

@amirmazmi
Copy link

From the source below. Seems like an issue with the underlying data.

COVID-19/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv

date confirmed cases
4/16/20 184,948
4/17/20 190,839
4/18/20 191,726
4/19/20 198,674
4/20/20 200,210
4/21/20 204,178
4/22/20 208,389
4/23/20 213,024 <--
4/24/20 202,990
4/25/20 205,905
4/26/20 207,634
4/27/20 209,465
4/28/20 210,773
4/29/20 212,917
4/30/20 213,435

@midhenry
Copy link

midhenry commented May 3, 2020

DIVOC-91 simply provides the option to view the data absolute or population normalized.
I don't see why that would be so hard and it would certainly eliminate the endless argument here over which way is best. People can supply their own caveats.

BTW, the graph is the best idea ever.

@iviney
Copy link

iviney commented May 8, 2020

@jwosty: Thanks for the per-capita view! I think there's an error in the scale, though. For example, on 7 May New Zealand has 1,490 confirmed cases and 11 weekly cases. On the per-capita graph this is shown as "Total Confirmed Cases per 100,000: 30,848.548452552597". Apart from the excessive number of decimal places :-) the number isn't correct. NZ has a population of about 4.8m, so the number of cases per 100,000 is 1490/4800000*100000 = 31. In other words, the graph is showing the number of cases per 100 million people, not per 100,000.

PS it's only in the US that "5/7/20" means 7 May; everywhere else in the world it means 5 July!

@mm0hgw
Copy link

mm0hgw commented May 11, 2020

https://github.com/mm0hgw/electoral-analysis/blob/dev/epidemic/out/charts/New.vs.Active.2.png
My fix, was to swap out total cases for active cases on the bottom scale.

@wrhite
Copy link

wrhite commented May 30, 2020

Thanks very much to aatish. This is my go-to source for understanding what's going on. Please could we have an EU28 button to select all the countries in the European Union plus the UK. It would save me a lot of selecting tickboxes!

@rpkoller
Copy link
Contributor

@wrhite that would be a different issue to group all european countries for auto selection. the current issue is about normalizing the infection count in regards of the total population. i am not sure if a request of auto selection was brought up before. best search the issues and if not already there file a separate feature request.

@scstraus
Copy link

scstraus commented Jul 10, 2020

Is this one actually working? Would be nice to change the axis labels to what they really are rather than absolute values. Also when I hover it shows things like 113,000 new cases per week per 100,000 people (!). Unless whole population is getting it plus some getting it twice in same week.

http://raw.githack.com/jwosty/covidtrends/per-capita/index.html

@eguy54
Copy link

eguy54 commented Jul 18, 2020

I love what you put together here, but I agree the per capita would be a great addition. I think it's a useful feature to compare the outbreak / response across different geographic areas. For me personally, I find it useful in understanding how "safe" it is to step outside my door on a given day. I agree the aggregate numbers are better for many inquires, but it would be great to have the option to be able to normalized these metrics via population.

If you do roll out this feature, I'd suggest updating the line widths to the population size -- so you don't lose that context either.

Here is a sample of something I put together over the last few weeks (inspired largely by your work!) to give you an idea of what the data would look like on a per capita basis:
COVID-19

Thanks again for your work here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests