Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Correct and document crimea.json #648

Merged
merged 8 commits into from
Dec 21, 2024

Conversation

dsmedia
Copy link
Collaborator

@dsmedia dsmedia commented Dec 18, 2024

Resolves #594

Tasks

  • Replace crimea.json with version from stdlib
  • Add description, sources, and column descriptions for crimea.json to _data/datapackage_additions.toml

Notes

  • New json file includes an additional field for army_size, consistent with stdlib and Bostock version
  • Dataset corrected to match the relevant table printed in 1859 Nightingale book
  • Row count and date values unchanged
  • JSON formatted to match original crimea.json dataset

@dsmedia
Copy link
Collaborator Author

dsmedia commented Dec 18, 2024

Notes on Mortality Rate Data and Existing Visualizations

1. Mortality Rate Data Derivation

The mortality rate columns from Nightingale's 1859 publication can be calculated from the raw death counts, making their inclusion in crimea.json redundant. The formula used is:

Annual Mortality Rate = (Deaths × 1000 × 12) / Army Size

Verification

Here's a Python implementation that reproduces the original mortality rates:

import pandas as pd

def transform_mortality_data(deaths_df):
    deaths_df['Diseases'] = (deaths_df['disease'] * 1000 * 12) / deaths_df['army_size']
    deaths_df['Wounds'] = (deaths_df['wounds'] * 1000 * 12) / deaths_df['army_size']
    deaths_df['Other'] = (deaths_df['other'] * 1000 * 12) / deaths_df['army_size']
    deaths_df['Month'] = pd.to_datetime(deaths_df['date']).dt.strftime('%B')
    deaths_df['Year'] = pd.to_datetime(deaths_df['date']).dt.year.astype('int64')
    return deaths_df[['Month', 'Year', 'Diseases', 'Wounds', 'Other']].round(1)

Sample output matches the original data:

April 1854: Diseases=1.4, Wounds=0.0, Other=7.0
December 1855: Diseases=25.3, Wounds=5.0, Other=7.8
March 1856: Diseases=3.9, Wounds=0.0, Other=9.1

2. Related Visualization Work

There's an existing Vega implementation by @avatorl that uses the transformed mortality rate data. See a 2022 blog post and repository for details. This may be a good candidate for a Vega example @domoritz mentioned in #594.

@dsmedia dsmedia changed the title feat(DRAFT): Correct and document crimea.json feat: Correct and document crimea.json Dec 21, 2024
@dsmedia dsmedia marked this pull request as ready for review December 21, 2024 03:48
@dsmedia
Copy link
Collaborator Author

dsmedia commented Dec 21, 2024

@dangotbanned Please note the diff shows very slight modifications to data filesizes in datapackage.json after I ran build_datapackage.py. If I can fix this (with a local configuration setting?), please let me know.

@dsmedia
Copy link
Collaborator Author

dsmedia commented Dec 21, 2024

@dangotbanned also please be aware that when I attempted to break up the line length in the TOML multi-line strings below (for field description and source title) to keep to 80-100 characters per line, the resulting markdown table generated by build-datapackage.py was formatted incorrectly. As a result I kept it as one large line. Not sure if this is intended?

[[resources.schema.fields]]
name        = "wounds"
description = """Deaths from "Wounds and Injuries" which comprised: Luxatio (dislocation), Sub-Luxatio (partial dislocation), Vulnus Sclopitorum (gunshot wounds), Vulnus Incisum (incised wounds), Contusio (bruising), Fractura (fractures), Ambustio (burns) and Concussio-Cerebri (brain concussion)"""
[[resources.sources]]
title = """
Nightingale, Florence. A contribution to the sanitary history of the British army during the late war with Russia. London : John W. Parker and Son, 1859. Table II. Table showing the Estimated Average Monthly Strength of the Army; and the Deaths and Annual Rate of Mortality per 1,000 in each month, from April 1854, to March 1856 (inclusive), in the Hospitals of the Army in the East
"""

@dangotbanned
Copy link
Member

@dangotbanned Please note the diff shows very slight modifications to data filesizes in datapackage.json after I ran build_datapackage.py. If I can fix this (with a local configuration setting?), please let me know.

Interesting 🤔

My first thought would be maybe os.stat_result.st_size differs across platforms?
Since the sizes you have are all smaller, I'm going to guess:

import sys

sys.platform in {"darwin", "posix"} # @dsmedia
sys.platform == "win32"             # @dangotbanned

@dangotbanned
Copy link
Member

dangotbanned commented Dec 21, 2024

#648 (comment)

@dsmedia I'll take a look at this today.

For this one, it could be the leading """\n before the text actually starts?

[[resources.sources]]
title = """
Nightingale, Florence. A contribution to the sanitary history of the British army during the late war with Russia. London : John W. Parker and Son, 1859. Table II. Table showing the Estimated Average Monthly Strength of the Army; and the Deaths and Annual Rate of Mortality per 1,000 in each month, from April 1854, to March 1856 (inclusive), in the Hospitals of the Army in the East
"""

Updated

@dsmedia this will fix it if you re-run build-datapackage.py

I'm holding off on doing that locally, since I don't wanna revert all the Resource.bytes changes - until I know the cause of (#648 (comment))

dangotbanned added a commit that referenced this pull request Dec 21, 2024
Copy link
Member

@dangotbanned dangotbanned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @dsmedia!

@dangotbanned dangotbanned merged commit 369b462 into vega:main Dec 21, 2024
2 checks passed
dangotbanned added a commit to vega/altair that referenced this pull request Dec 21, 2024
Changes from vega/vega-datasets#648

Currently pinned on `main` until `v3.0.0` introduces `datapackage.json`
https://github.com/vega/vega-datasets/tree/main
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unclear provenance of crimea.json dataset ('Nightingale's Rose')
3 participants