forked from xmarquez/GPTDemocracyIndex
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathGPTDemocracyIndex.qmd
658 lines (536 loc) · 38.1 KB
/
GPTDemocracyIndex.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
---
title: "Using Large Language Models to generate democracy scores: An experiment"
author:
- name: "[Xavier Márquez](https://people.wgtn.ac.nz/xavier.marquez)"
affiliations:
- name: "Victoria University of Wellington"
format: html
editor: visual
date: "Wed 12 Apr 2023"
date-modified: "`r date()`"
toc: true
message: false
warning: false
---
Like many people, I've been playing recently with ChatGPT and other large language models, and I've found them very useful for many tasks. Inspired in part by a [blog post](https://tedunderwood.com/2023/03/19/using-gpt-4-to-measure-the-passage-of-time-in-fiction/) by Ted Underwood on using GPT4 to measure the passage of time in fiction, I decided to use [OpenAI's ChatGPT](https://chat.openai.com/chat) (version 3.5-turbo) and [Anthropic's Claude](https://www.anthropic.com/index/introducing-claude) (to which I recently got access) to see if I could replicate a democracy index.
This turned out to be surprisingly simple and cheap. I was able to produce thousands of country-year observations in a few hours, complete with justifications, at a very low cost (about \$1.59 of usage for the OpenAI API for 1600 country-years; Claude is currently free for research uses). Though by no means perfect (and in some cases completely wrong), the scores they produce are a plausible measure of democracy, and can be usefully evaluated partly because the models provide justifications for their judgments. Moreover, the procedure described here can be extended to many other forms of data generation for political science.
The two models I used (ChatGPT aka GPT3.5 Turbo and Claude) produced different judgments for some cases (as human coders would), and sometimes their justifications weren't accurate, but overall I was impressed with the quality of their work. ChatGPT seemed to produce more accurate judgments (as measured by its correlation with the Varieties of Democracy Index and Freedom House data and its lower root mean square error), and the OpenAI API is more feature-rich, but given that Claude is free for research purposes, it was relatively easy to use it to experiment and produce many more country-year observations (about 10,000, though that took about 12 hours given my current rate limits). I don't have access to GPT4, so I can't tell whether it would be better for this task, but my sense is that even GPT3.5 is very good already.
I provide replicable code in [this repository](https://github.com/xmarquez/GPTDemocracyIndex) using the [`targets`](https://books.ropensci.org/targets/) package (though the repo is not stable - I'm still experimenting with this, and things are liable to change!), but follow along for an explanation and an evaluation of the results. If you want to skip the technical details of R code, just scroll down to the section on evaluation.
## Setup
You will need a few packages and an API key from OpenAI (and an API key for Claude as well if you wanted to use that model, but I don't provide instructions here - check [the repo](https://github.com/xmarquez/GPTDemocracyIndex) for details).
```{r setup}
library(tidyverse)
library(democracyData) # Use remotes::install_github("xmarquez/democracyData")
library(openai)
```
```{r}
#| include: false
library(targets)
tar_load(fh_data)
theme_set(theme_bw())
```
Follow the instructions in the [openai](https://irudnyts.github.io/openai/) package to save your OpenAI API key in the appropriate environment variable.
## Sample and Methods
The first thing to do is to set up the sample of countries that we want to generate democracy scores for. Here I use a sample from Freedom House, because it is fairly comprehensive for years after 1972, and I wanted to check how the generated scores compare to Freedom House. They are also more likely to contain country-years for which the models have information and which can be more easily checked by me due to their relative recency (doing historical research to establish the level of democracy in the more distant past is not easy!). And the dataset is not so large that it would take a very long time to code (even by ChatGPT, given my rate limitations). But in principle we could use any country-year panel data.
```{r}
#| eval: false
fh_data <- democracyData::download_fh(verbose = FALSE) |>
filter(year < 2021)
```
```{r}
fh_panel <- fh_data |>
select(fh_country, year) |>
filter(year %% 5 == 0, year < 2021)
```
Note that since ChatGPT has a knowledge cutoff of 2021, it does not make sense to ask it to generate democracy scores for after that date. To keep costs low and be able to do some experimentation, I'm also only calculating the scores for about 20% of all country years - here, simply the years that end in 0 and 5. (A full run of all `r nrow(fh_data)` country-years only costs about \$8, but I did some trial-and error testing and some experimentation, and with a single API key coding `r nrow(fh_panel)` country-years in any case takes about 6 hours to complete for the GPT 3.5 API, and I'm impatient).
We also need to craft a good a prompt. The [OpenAI API documentation for the Chat GPT models](https://platform.openai.com/docs/guides/chat/introduction) indicates that a prompt should have at least one "system" message and one "user" message. The system message gives the system an identity to emulate (though OpenAI also notes that GPT3.5 does not always pay particular attention to the system message); the user message contains your instructions.
For the system message, I want to ChatGPT to emulate a political scientist as much as possible; my system message is:
> "You are a political scientist with deep area knowledge of \[country X\]"
For the user message, I want to anchor ChatGPT's answer to a particular conception of democracy; I use here V-Dem's prompt for the question of [whether a country is an electoral democracy](https://www.v-dem.net/documents/24/codebook_v13.pdf#page=45):
> The electoral principle of democracy seeks to embody the core value of making rulers responsive to citizens, achieved through electoral competition for the electorate's approval under circumstances when suffrage is extensive; political and civil society organizations can operate freely; elections are clean and not marred by fraud or systematic irregularities; and elections affect the composition of the chief executive of the country. In between elections, there is freedom of expression and an independent media capable of presenting alternativeviews on matters of political relevance. (V-Dem Codebook v. 13, p. 45)
Note we could use a very different conception of democracy! (In fact, I'm still working on some scores using a different prompt based off the [Democracy/Dictatorship](https://xmarquez.github.io/democracyData/reference/pacl.html) dataset, which seems so far to produce somewhat better results). In any case, the full user message gives ChatGPT instructions for how to use this conception to answer the question of how democractic a country is:
> "You are investigating the extent to which the ideal of electoral democracy in \[country X\] was achieved in \[year Y\]
>
> The electoral principle of democracy seeks to embody the core value of making rulers responsive to citizens, achieved through electoral competition for the electorate's approval under circumstances when suffrage is extensive; political and civil society organizations can operate freely; elections are clean and not marred by fraud or systematic irregularities; and elections affect the composition of the chief executive of the country. In between elections, there is freedom of expression and an independent media capable of presenting alternative views on matters of political relevance.
>
> To what extent was the ideal of electoral democracy in its fullest sense achieved in \[country X\] in \[year Y\]?
>
> Use only knowledge relevant to \[country X\] in \[year Y\] and answer in a scale of 0 to 10, with 0 representing "not at all" and 10 representing "completely", and add a measure of your confidence from 0 to 1, with 0 representing "not confident at all", and 1 representing "very confident". Use this template to answer:
>
> \[country X\] in \[year Y\]: number from 0-10, confidence: number from 0-1
We can generate all the prompts very quickly in a format appropriate for use with the {openai} package with this function call:
``` r
create_prompt <- function(fh_data) {
prompts <- purrr::map2(fh_data$fh_country, fh_data$year,
~list(
list(
"role" = "system",
"content" = stringr::str_c(
"You are a political scientist with deep area",
" knowledge of ", .x, ". ")
),
list(
"role" = "user",
"content" = stringr::str_c(
"You are investigating the extent to which the ",
"ideal of electoral democracy in ", .x,
" was achieved in ", .y, ". \n\nThe electoral ",
"principle of democracy seeks to embody the ",
"core value of making rulers responsive to ",
"citizens, achieved through electoral competition",
" for the electorate’s approval under ",
"circumstances when suffrage is extensive; ",
"political and civil society organizations can",
" operate freely; elections are clean and not ",
"marred by fraud or systematic irregularities; ",
"and elections affect the composition of the ",
"chief executive of the country. In between ",
"elections, there is freedom of expression and ",
"an independent media capable of presenting ",
" alternative views on matters of political ",
" relevance. \n\nTo what extent was the ideal ",
"of electoral democracy in its fullest sense ",
"achieved in ", .x, " in ", .y, "? \n\nUse only",
" knowledge relevant to ", .x, " in ", .y,
" and answer in a scale of 0 to 10, with 0 ",
"representing \"not at all\" and 10 representing",
" \"completely\", and add a measure of your ",
" confidence from 0 to 1, with 0 representing ",
"\"not confident at all\", and 1 representing ",
"\"very confident\". Use this template to answer:",
" \n", .x," in ", .y, ": number from 0-10, ",
"confidence: number from 0-1")
)
)
)
prompts
}
prompts <- create_prompt(fh_panel)
```
We then send these prompts to the API and wait for results, using this function:
``` r
submit_openai <- function(prompt, temperature = 0.2, n = 1) {
res <- openai::create_chat_completion(model = "gpt-3.5-turbo",
messages = prompt,
temperature = temperature,
n = n)
Sys.sleep(7)
res
}
# This takes a few hours!
openai_completions <- prompts |>
purrr::map(submit_openai)
```
Note that there are several parameters you can set when generating chat completions (see [here](https://platform.openai.com/docs/api-reference/chat/create) for a full description), including the number of distinct responses generated per prompt (n), and the temperature parameter. Lower temperatures make the outputs "more deterministic" (though I don't think the outputs are ever fully deterministic), and are recommended for analytic tasks; I set it at 0.2, somewhat arbitrarily.
I used the same prompt for Claude (the Anthropic model), but ran it twice - once with a smaller sample, and once with the full `r fh_data |> filter(year < 2021) |> nrow()` country-year observations in the Freedom House panel up to 2020.
My actual code wraps all of this in a [`targets`](https://books.ropensci.org/targets/) pipeline (see [here](https://github.com/xmarquez/GPTDemocracyIndex) for details) so that if I got disconnected or whatever I would not have to pay again to get already-submitted prompts recalculated. After OpenAI and Claude are done, I did some data wrangling to get their responses in the right tabular form (not shown here; check out the [repo](https://github.com/xmarquez/GPTDemocracyIndex) for all the gory details), and voilá! Your democracy index is ready.
Here's a random sample of the generated OpenAI scores:
```{r}
#| echo: false
#| tbl-label: tbl-openai-random-sample
#| tbl-caption: "Random sample of GPT3.5's generated democracy data"
tar_load(completions_tibble)
set.seed(14)
completions_tibble |>
slice_sample(n = 3) |>
kableExtra::kbl(col.names = c("Country", "Year", "OpenAI's GPT3.5 score",
"Confidence", "Justification")) |>
kableExtra::kable_classic(lightable_options = "striped")
```
And here's a random sample from Claude:
```{r}
#| echo: false
#| tbl-label: tbl-claude-random-sample
#| tbl-caption: "Random sample of Claude's generated democracy data"
tar_load(claude_tibble2)
claude_tibble2 |>
slice_sample(n = 3) |>
kableExtra::kbl(col.names = c("Country", "Year", "Anthropic's Claude score",
"Confidence", "Justification")) |>
kableExtra::kable_classic(lightable_options = "striped")
```
## Are these democracy scores any good?
The short answer is that it depends on what you mean by "good", and what you are looking for in democracy scores. I wouldn't (yet) use them in research.
But they are about as highly correlated to major democracy scores as other scores are among themselves (if on the lower side). For example, in figure @fig-correlations, we see that all the AI generated democracy scores are correlated at an above-median level with the V-Dem Polyarchy index, suggesting that the actual prompt (which uses the definition of electoral democracy used for the Polyarchy index) mattered. They are also well correlated with the Economist Intelligence Unit's Index, but somewhat less well correlated with Freedom House (though still better than the median correlation).
```{r}
#| echo: false
#| fig-cap: "Correlations between the scores generated by ChatGPT and Claude with major democracy measures"
#| label: fig-correlations
#| fig-width: 12
#| fig-height: 12
library(targets)
library(corrr)
tar_load(all_dem)
measures <- sort(c("v2x_polyarchy", "v2x_api", "vanhanen_democratization",
"polity2", "fh_total_reversed", "eiu", "bti_democracy",
"openai_score", "claude_score", "claude_score_2",
"wgi",
"uds_2014_mean", "pacl_update", "lexical_index"))
reduced_data <- all_dem |>
filter(measure %in% measures) |>
mutate(measure = case_when(measure == "v2x_polyarchy" ~
"V-Dem Polyarchy Index",
measure == "v2x_api" ~
"V-Dem Additive Polyarchy Index",
measure == "vanhanen_democratization" ~
"Vanhanen",
measure == "polity2" ~
"Polity2 from Polity 5",
measure == "fh_total_reversed" ~
"Freedom House PR + CL, reversed",
measure == "eiu" ~
"Economist Intelligence Unit Index",
measure == "bti_democracy" ~
"Bertelsmann Transformation Index",
measure == "openai_score" ~
"OpenAI GPT3.5 Turbo scores",
measure == "claude_score" ~
"Anthropic Claude small sample",
measure == "claude_score_2" ~
"Anthropic Claude large sample",
measure == "wgi" ~
"WGI Voice and Accountability Index",
measure == "uds_2014_mean" ~
"UDS scores, 2014 edition",
measure == "pacl_update" ~
"Updated DD measure",
measure == "lexical_index" ~
"Lexical Index of Democracy, v. 6.4",
TRUE ~ measure))
correlations <- reduced_data |>
pivot_wider(id_cols = extended_country_name:year,
names_from = measure,
values_from = value_rescaled) |>
select(-extended_country_name:-year) |>
corrr::correlate()
median_corr <- median(
correlations |>
select(-term) |>
as.matrix(), na.rm = TRUE) |>
round(2)
correlations |>
rearrange(method = "HC") |>
shave() %>%
rplot(print_cor = TRUE) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
coord_fixed() +
scale_color_gradient2(midpoint = median_corr) +
labs(title = "Pairwise correlations among selected measures of democracy",
subtitle = str_glue("Median r is {median_corr}.",
" Higher than median correlations in blue."))
```
A simple hierarchical cluster analysis (@fig-cluster) confirms that the AI-generated scores (especially GPT 3.5's scores) are closest in their judgments to the Varieties of Democracy "additive Polyarchy index" (a variant of the Polyarchy index).
```{r}
#| label: fig-cluster
#| fig-cap: "Hierarchical cluster analysis"
#| echo: false
library(factoextra)
dist_matrix <- reduced_data |>
pivot_wider(id_cols = extended_country_name:year,
names_from = measure,
values_from = value_rescaled) |>
select(-extended_country_name:-year) |>
as.matrix() |>
t() |>
dist()
clusters <- hclust(dist_matrix)
h <- clusters$height
clusters <- dendextend::ladderize(as.dendrogram(clusters))
fviz_dend(clusters, k = 6, k_colors = viridis::viridis(16),
horiz = TRUE, rect = TRUE, labels_track_height = 10)
```
These correlations vary across time and space, however. In @fig-correlations-year-openai and @fig-correlations-year-claude, we see that these tend to be higher in more recent years and lower in the 1970s and 1980s (though for Claude there's a bit of a U shape).
```{r}
#| echo: false
#| fig-cap: "Average correlation between GPT3.5's scores and the V-Dem additive Polyarchy index, per year"
#| label: fig-correlations-year-openai
wide_data <- all_dem |>
filter(measure %in% c("openai_score",
"v2x_api")) |>
pivot_wider(id_cols = extended_country_name:year,
names_from = measure,
values_from = value_rescaled)
correlations <- wide_data |>
select(-extended_country_name:-year) |>
split(wide_data$year) |>
map_df(~correlate(.) |>
stretch(na.rm = TRUE, remove.dups = TRUE),
.id = "year")
correlations |>
mutate(year = as.numeric(year)) |>
ggplot(aes(x = year, y = r)) +
stat_summary(fun.data = mean_cl_normal) +
geom_smooth()
```
```{r}
#| echo: false
#| fig-cap: "Average correlation between Claude's scores and the V-Dem additive Polyarchy index, per year"
#| label: fig-correlations-year-claude
wide_data <- all_dem |>
filter(measure %in% c("claude_score_2",
"v2x_api")) |>
pivot_wider(id_cols = extended_country_name:year,
names_from = measure,
values_from = value_rescaled)
correlations <- wide_data |>
select(-extended_country_name:-year) |>
split(wide_data$year) |>
map_df(~correlate(.) |>
stretch(na.rm = TRUE, remove.dups = TRUE),
.id = "year")
correlations |>
mutate(year = as.numeric(year)) |>
ggplot(aes(x = year, y = r)) +
stat_summary(fun.data = mean_cl_normal) +
geom_smooth()
```
We can also see where the AI-generated scores produce idiosyncratic judgments. Below, I calculate which countries have the highest and the lowest root mean squared errors (RMSE) between their scaled (0-1) AI scores and the V-Dem Additive Polyarchy Index (@tbl-worst-cases-for-openai).
```{r}
#| label: tbl-worst-cases-for-openai
#| tbl-cap: "Most difficult cases for OpenAI's GPT3.5 (largest RMSE from V-Dem's Additive Polyarchy Index)"
#| echo: false
tar_load(combined_ai_scores)
rmse <- combined_ai_scores |>
filter(measure %in% c("openai_score", "claude_score_2")) |>
left_join(vdem_simple) |>
mutate(scaled_value = scales::rescale(value, to = c(0,1))) |>
group_by(extended_country_name, measure) |>
summarise(rmse = sqrt(mean((v2x_api - scaled_value)^2)))
worst_openai <- rmse |>
filter(measure == "openai_score") |>
arrange(desc(rmse)) |>
head(10)
best_openai <- rmse |>
filter(measure == "openai_score") |>
arrange(rmse) |>
head(10)
worst_openai |>
kableExtra::kbl(col.names = c("Country", "Measure", "RMSE")) |>
kableExtra::kable_classic(lightable_options = "striped")
```
The countries that GPT3.5 struggles with are also countries that other measures struggle with (@fig-hardest-openai):
```{r}
#| echo: false
#| fig-cap: "Most difficult cases for OpenAI's GPT 3.5"
#| fig-height: 10
#| label: fig-hardest-openai
all_dem |>
filter(extended_country_name %in% worst_openai$extended_country_name,
measure %in% measures,
!str_detect(measure, "claude_score")) |>
mutate(openai_score = fct_collapse(measure,
`OpenAI GPT 3.5` = "openai_score",
`VDem Additive Polyarchy Index` =
c("v2x_api"),
other_level = "Other measures")) |>
ggplot(aes(x = year, y = value_rescaled,
color = fct_rev(openai_score),
group = measure,
alpha = fct_rev(openai_score))) +
geom_point() +
geom_path() +
facet_wrap(~extended_country_name, ncol = 2) +
labs(alpha = "", color = "", y = "Scaled value (0-1)") +
scale_color_viridis_d()
```
By contrast, GPT3.5 does especially well with established democracies, even small ones (@tbl-best-cases-for-openai):
```{r}
#| label: tbl-best-cases-for-openai
#| tbl-cap: "Best cases for OpenAI's GPT3.5"
#| echo: false
best_openai |>
kableExtra::kbl(col.names = c("Country", "Measure", "RMSE")) |>
kableExtra::kable_classic(lightable_options = "striped")
```
```{r}
#| echo: false
#| fig-cap: "Easiest cases for OpenAI's GPT 3.5"
#| fig-height: 10
#| label: fig-easiest-openai
all_dem |>
filter(extended_country_name %in% best_openai$extended_country_name,
measure %in% measures,
!str_detect(measure, "claude_score")) |>
mutate(openai_score = fct_collapse(measure,
`OpenAI GPT 3.5` = "openai_score",
`VDem Additive Polyarchy Index` =
c("v2x_api"),
other_level = "Other measures")) |>
ggplot(aes(x = year, y = value_rescaled,
color = fct_rev(openai_score),
group = measure,
alpha = fct_rev(openai_score))) +
geom_point() +
geom_path() +
facet_wrap(~extended_country_name, ncol = 2) +
labs(alpha = "", color = "", y = "Scaled value (0-1)") +
scale_color_viridis_d()
```
Indeed, it's clear that both models tend to struggle with countries with "hybrid" institutions (apparently democratic but with significant problems), which can be difficult for human coders too, as is evident from @fig-vdem-rmse. But GPT 3.5 struggles less!
```{r}
#| echo: false
#| fig-cap: "Relationship between avg. V-Dem Additive Polity Index and RMSE for OpenAI's GPT3.5 and Anthropic's Claude"
#| label: fig-vdem-rmse
combined_ai_scores |>
filter(measure %in% c("openai_score", "claude_score_2")) |>
mutate(measure = case_when(measure == "openai_score" ~ "GPT 3.5",
TRUE ~ "Claude (large sample)")) |>
left_join(vdem_simple) |>
mutate(scaled_value = scales::rescale(value, to = c(0,1))) |>
group_by(extended_country_name, measure) |>
summarise(rmse = sqrt(mean((v2x_api - scaled_value)^2)),
avg_vdem = mean(v2x_api)) |>
ggplot(aes(x = avg_vdem, y = rmse, color = measure)) +
geom_point() +
geom_smooth() +
labs(x = "Average V-Dem Additive Polyarchy Index Score",
y = "Average RMSE of AI score for country")
```
In any case, Claude struggles with an entirely different set of countries than GPT 3.5 ( @tbl-worst-cases-for-claude).
```{r}
#| label: tbl-worst-cases-for-claude
#| tbl-cap: "Most difficult cases for Anthropic's Claude (largest RMSE from V-Dem's Additive Polyarchy Index)"
#| echo: false
worst_claude <- rmse |>
filter(measure == "claude_score_2") |>
arrange(desc(rmse)) |>
head(10)
best_claude <- rmse |>
filter(measure == "claude_score_2") |>
arrange(rmse) |>
head(10)
worst_claude |>
kableExtra::kbl(col.names = c("Country", "Measure", "RMSE")) |>
kableExtra::kable_classic(lightable_options = "striped")
```
These are not quite as difficult for human coders to classify as the countries GPT3.5 struggles with (except perhaps for Bosnia-Herzegovina); it seems that Claude sometimes has a tendency to output a middle score when it encounters some evidence that not all is perfect with democracy (@fig-most-difficult-claude). Moreover, it seems Claude still has difficulty detecting transitions (perhaps another reason it struggled with the 1990s, @fig-correlations-year-claude).
```{r}
#| echo: false
#| fig-cap: "Most difficult cases for Anthropic's Claude"
#| fig-height: 10
#| label: fig-most-difficult-claude
all_dem |>
filter(extended_country_name %in% worst_claude$extended_country_name,
measure %in% measures,
!str_detect(measure, "openai_score")) |>
mutate(openai_score = fct_collapse(measure,
`Anthropic's Claude` = "claude_score_2",
`VDem Additive Polyarchy Index` =
c("v2x_api"),
other_level = "Other measures")) |>
ggplot(aes(x = year, y = value_rescaled,
color = fct_rev(openai_score),
group = measure,
alpha = fct_rev(openai_score))) +
geom_point() +
geom_path() +
facet_wrap(~extended_country_name, ncol = 2) +
labs(alpha = "", color = "", y = "Scaled value (0-1)") +
scale_color_viridis_d()
```
By contrast, here are Claude's best cases - most of them highly undemocratic and showing few changes, except for Somalia (@fig-easiest-claude).
```{r}
#| echo: false
#| fig-cap: "Easiest cases for Anthropic's Claude"
#| fig-height: 10
#| label: fig-easiest-claude
all_dem |>
filter(extended_country_name %in% best_claude$extended_country_name,
measure %in% measures,
!str_detect(measure, "openai_score")) |>
mutate(openai_score = fct_collapse(measure,
`Anthropic's Claude` = "claude_score_2",
`VDem Additive Polyarchy Index` =
c("v2x_api"),
other_level = "Other measures")) |>
ggplot(aes(x = year, y = value_rescaled,
color = fct_rev(openai_score),
group = measure,
alpha = fct_rev(openai_score))) +
geom_point() +
geom_path() +
facet_wrap(~extended_country_name, ncol = 2) +
labs(alpha = "", color = "", y = "Scaled value (0-1)") +
scale_color_viridis_d()
```
```{r}
#| echo: false
rmse_summary <- combined_ai_scores |>
filter(measure %in% c("openai_score", "claude_score_2")) |>
left_join(vdem_simple) |>
mutate(scaled_value = scales::rescale(value, to = c(0,1))) |>
group_by(measure) |>
summarise(rmse = round(sqrt(mean((v2x_api - scaled_value)^2, na.rm = TRUE)), 2))
```
Overall, GPT3.5 has the edge in accuracy over Claude, with an overall RMSE relative to the V-Dem Additive Polyarchy Index of `r rmse_summary$rmse[rmse_summary$measure == "openai_score"]` vs. `r rmse_summary$rmse[rmse_summary$measure == "claude_score_2"]` .
## How good are AI Justifications?
There is also the question of the justifications they offer. I can't evaluate them all - I don't have appropriate knowledge - but I can spot-check the justifications offered for countries that I know something about, such as Venezuela. @tbl-venezuela-justifications-openai shows the Venezuela justifications from GPT 3.5, and @tbl-venezuela-justifications-claude shows the Venezuela justifications from Claude.
```{r}
#| echo: false
#| label: tbl-venezuela-justifications-openai
#| tbl-cap: "Justifications for Venezuela by GPT3.5"
completions_tibble |>
filter(fh_country == "Venezuela") |>
kableExtra::kbl(col.names = c("Country", "Year", "OpenAI's GPT3.5 score",
"Confidence", "Justification")) |>
kableExtra::kable_classic(lightable_options = "striped")
```
```{r}
#| echo: false
#| label: tbl-venezuela-justifications-claude
#| tbl-cap: "Justifications for Venezuela by Claude"
tar_load(claude_tibble)
claude_tibble |>
filter(fh_country == "Venezuela") |>
kableExtra::kbl(col.names = c("Country", "Year", "OpenAI's GPT3.5 score",
"Confidence", "Justification")) |>
kableExtra::kable_classic(lightable_options = "striped")
```
There is a mix of accurate and inaccurate info in these justifications. GPT3.5 hallucinates Venezuelan election dates; presidential and congressional elections happened in [1973](https://en.wikipedia.org/wiki/1973_Venezuelan_general_election "1973 Venezuelan general election"), [1978](https://en.wikipedia.org/wiki/1978_Venezuelan_general_election "1978 Venezuelan general election"), [1983](https://en.wikipedia.org/wiki/1983_Venezuelan_general_election "1983 Venezuelan general election"), [1988](https://en.wikipedia.org/wiki/1988_Venezuelan_general_election "1988 Venezuelan general election"), [1993](https://en.wikipedia.org/wiki/1993_Venezuelan_general_election "1993 Venezuelan general election"), [1998](https://en.wikipedia.org/wiki/1998_Venezuelan_parliamentary_election "1998 Venezuelan parliamentary election"), [2000](https://en.wikipedia.org/wiki/2000_Venezuelan_general_election "2000 Venezuelan general election"), [2005](https://en.wikipedia.org/wiki/2005_Venezuelan_parliamentary_election "2005 Venezuelan parliamentary election"), [2010](https://en.wikipedia.org/wiki/2010_Venezuelan_parliamentary_election "2010 Venezuelan parliamentary election"), [2015](https://en.wikipedia.org/wiki/2015_Venezuelan_parliamentary_election "2015 Venezuelan parliamentary election"), and [2020](https://en.wikipedia.org/wiki/2020_Venezuelan_parliamentary_election "2020 Venezuelan parliamentary election"), but GPT 3.5 asserts they happened in 1975 (no such election), 1980 (again, no such election), 1985 (no election), and 1995 (no election). But the election of 1973 did result in Carlos Andrés Pérez winning the presidency, and though the score is a bit harsh the overall assessment for 1975 does not strike me as altogether implausible. The 1980 justification is a bit more vague, but again (aside from the fact that the election was held in 1978, not 1980) not altogether implausible. GPT3.5 also weirdly says that "In 1990, Venezuela held its first democratic presidential election in 16 years", which is silly (the election happened in 1988; there had been lots of elections in the preceding 16 years), though it does get the transfer of power from Lusinchi to Carlos Andrés Pérez correct (for 1988) and the overall assessment seems reasonable. The 1998 election was won by Chávez, who ran as the incumbent in 2000; the assessment seems reasonably accurate (but the score is high relative to earlier years with similar concerns). 2005, 2010, 2015, and 2020 seem reasonably ok justifications, with specific detail that is basically correct.
Claude's justifications are briefer and less confident. Like GPT 3.5, it also hallucinates some election dates (e.g, for 1975, 1985) but gets other dates correct (it knows there were elections in 1978, and uses that information to talk about the situation in Venezuela in 1980). Like GPT3.5, it makes the silly claim that "Venezuela held its first democratic elections in decades in 1990" -- there must be something in the training data for both models that makes them think that - but worse, it repeats the claim for 1995, embellishing it with a weirder claim about decades of military rule ("Venezuela held its first democratic elections after decades of military rule in 1993 and 1995"). Claude is also significantly (and often without good reason) harsher than GPT 3.5 in its scoring.
If the Venezuela case is indicative of the quality of the justifications, the models still have a ways to go. (Perhaps GPT4 would do better; I'm itching to test it once I get access). I've spot-checked a few other justifications and they are often very convincing, but I imagine problems like hallucinating the dates of elections may persist for a while.
I asked both models to give estimates of their uncertainty. Claude was more likely to express genuine uncertainty (including very low confidence scores) and to say it didn't know; indeed, for a few country-years (San Marino, for example) it simply refused to answer the prompt, saying it didn't have enough knowledge. By contrast, GPT3.5 never gave a confidence score lower than 0.6, and it was mostly always at least 0.7 confident. In any case, neither model's confidence was clearly related to their RMSE from the V-Dem Additive Polyarchy Index (@fig-confidence), though GPT3.5 still seems slightly better calibrated (with a declining trend line, with higher confidence for countries with a lower RMSE). Nevertheless, Claude does show an interesting pattern: the range of its RMSE is larger when it expresses lower confidence, which seems right (its judgments being more random).
```{r}
#| echo: false
#| label: fig-confidence
#| fig-cap: "Average confidence expressed by the model for the scores of a country vs. RMSE for each country"
combined_ai_scores |>
filter(measure %in% c("openai_score", "claude_score_2")) |>
mutate(measure = case_when(measure == "openai_score" ~ "GPT 3.5",
TRUE ~ "Claude (large sample)")) |>
left_join(vdem_simple) |>
mutate(scaled_value = scales::rescale(value, to = c(0,1))) |>
group_by(extended_country_name, measure) |>
summarise(rmse = sqrt(mean((v2x_api - scaled_value)^2)),
avg_confidence = mean(confidence)) |>
ggplot(aes(x = avg_confidence, y = rmse, color = measure)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Average confidence expressed by the model",
y = "Average RMSE of AI score for country")
```
I also used Claude to code countries twice (once for a smaller sample, once for a larger sample). While it mostly coded countries similarly, in a few cases (136 country-years out of 1800 common country-years) it came up with new scores. The scores never changed much (almost always just 1 point out of 10); the largest change was 2 points, for 3 country years. (Perhaps a higher temperature setting would lead to more such changes). In this sense Claude is an unusually consistent coder; even the three country-years below are ambiguous, and I could have gone either way, though Claude does seem rather harsh in its judgments.
```{r}
#| echo: false
#| tbl-label: claude-max-changes
#| tbl-cap: "Most change between the small and large run of Claude"
claude_tibble |>
mutate(sample = "small") |>
bind_rows(claude_tibble2 |>
mutate(sample = "large")) |>
pivot_wider(id_cols = c("fh_country", "year"),
names_from = sample, values_from = c(score, justification)) |>
filter(!is.na(score_small), score_small != score_large) |>
filter(abs(score_small - score_large) == max(abs(score_small - score_large))) |>
kableExtra::kbl(col.names = c("Country", "Year", "Claude score (small sample)",
"Claude score (large sample)",
"Justification (small sample)",
"Justification (large sample)")) |>
kableExtra::kable_classic(lightable_options = "striped")
```
## Further Uses
The V-Dem project already exists and publishes annually updated estimates of democracy scores, complete with component indicators and all index subcomponents. While using ChatGPT to replicate these scores does not add much value right now, it seems clear to me that over time it will become more and more feasible to use these models to generate political science data. Given the pace of change in this field, I wouldn't bet against these models become more and more important to produce at least quick and dirty data, especially in a hybrid model where experts review the AI's judgments. For example, I am currently working on generating data that extends the [Geddes, Wright, and Frantz](https://sites.psu.edu/dictators/) classification of non-democratic regimes; it would be cool if this worked well. And I'm also experimenting with other prompts, including one based on the [DD dataset](https://xmarquez.github.io/democracyData/reference/pacl.html), which seems to work relatively well to produce dichotomous measures of democracy. Indeed, one could imagine coding many of the indicators of the V-Dem project in this way, and then comparing the AI judgments to expert judgments.
Of course, this may all be "test-set leakage". But the work of generating tabular time-series data from historical sources is basically summarization, and this is a task for which very large models trained on **ALL OF THE TEXT** are likely to be very good at; I *want* them to memorize what scholars say about elections and put it in nice tabular form for me!
## Data and Code
I've made the final scores and justifications generated by both models (GPT3.5 and Claude) available [here](https://github.com/xmarquez/GPTDemocracyIndex/blob/master/ai_scores.csv) for people interested in analyzing them further. The [repo](https://github.com/xmarquez/GPTDemocracyIndex/) is still a bit in flux, but it contains all the code used to generate this post.