-
Notifications
You must be signed in to change notification settings - Fork 0
/
carsaccidentproject.Rmd
575 lines (434 loc) · 18.2 KB
/
carsaccidentproject.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
Ahmed Mahmoud Hanafy
========================================================
```{r echo=FALSE, message=FALSE, warning=FALSE, packages}
# Load all of the packages that you end up using in your analysis in this code
# chunk.
# Notice that the parameter "echo" was set to FALSE for this code chunk. This
# prevents the code from displaying in the knitted HTML output. You should set
# echo=FALSE for all code chunks in your file, unless it makes sense for your
# report to show the code that generated a particular plot.
# The other parameters for "message" and "warning" should also be set to FALSE
# for other code chunks once you have verified that each plot comes out as you
# want it to. This will clean up the flow of your report.
library(ggplot2)
```
```{r echo=FALSE, Load_the_Data}
# Load the Data
df<-read.csv('C:\\Users\\MasterO\\carsaccidents\\nassCDS.csv')
#the data is found in that link
#https://vincentarelbundock.github.io/Rdatasets/csv/DAAG/nassCDS.csv
#this is a Documentation taht could de found in the coming link
#https://vincentarelbundock.github.io/Rdatasets/doc/DAAG/nassCDS.html
#nassCDS R Documentation
#Airbag and other influences on accident fatalities
#Description
#US data, for 1997-2002, from police-reported car crashes in which there is a
#harmful event (people or property), and from which at least one vehicle was
#towed. Data are restricted to front-seat occupants, include only a subset of
#the variables recorded, and are restricted in other ways also.
#Usage
#nassCDS
#Format
#A data frame with 26217 observations on the following 15 variables.
#dvcat
#ordered factor with levels (estimated impact speeds):
# 1-9km/h, 10-24, 25-39, 40-54, 55+
#weight
#Observation weights, albeit of uncertain accuracy, designed to account for
#varying sampling probabilities.
#dead
#factor with levels alive dead
#airbag
#a factor with levels none airbag
#seatbelt
#a factor with levels none belted
#frontal
#a numeric vector; 0 = non-frontal, 1=frontal impact
#sex
#a factor with levels f m
#ageOFocc
#age of occupant in years
#yearacc
#year of accident
#yearVeh
#Year of model of vehicle; a numeric vector
#abcat
#Did one or more (driver or passenger) airbag(s) deploy? This factor has levels
#deploy nodeploy unavail
#occRole
#a factor with levels driver pass
#deploy
#a numeric vector: 0 if an airbag was unavailable or did not deploy; 1 if one or
#more bags deployed.
#injSeverity
#a numeric vector; 0:none, 1:possible injury, 2:no incapacity, 3:incapacity,
# 4:killed; 5:unknown, 6:prior death
#caseid
#character, created by pasting together the populations sampling unit, the
#case number, and the vehicle number. Within each year, use this to uniquely
#identify the vehicle.
#Details
#Data collection used a multi-stage probabilistic sampling scheme. The
#observation weight, called national inflation factor (national) in the data
#from NASS, is the inverse of an estimate of the selection probability. These
#data include a subset of the variables from the NASS dataset. Variables that
#are coded here as factors are coded as numeric values in that dataset.
#Source
#http://www.stat.colostate.edu/~meyer/airbags.htm\ ftp://ftp.nhtsa.dot.gov/nass/
#See also\ http://www.maths.anu.edu.au/~johnm/datasets/airbags
#References
#Meyer, M.C. and Finney, T. (2005): Who wants airbags?. Chance 18:3-16.
#Farmer, C.H. 2006. Another look at Meyer and Finney's ‘Who wants airbags?’
#. Chance 19:15-22.
#Meyer, M.C. 2006. Commentary on "Another look at Meyer and Finney's
#‘Who wants airbags?’. Chance 19:23-24.
#For analyses based on the alternative FARS (Fatal Accident Recording System)
#data, and associated commentary, see:
#Cummings, P; McKnight, B, 2010. Accounting for vehicle, crash, and occupant
#characteristics in traffic crash studies. Injury Prevention 16: 363-366.
#[The relatively definitive analyses in this paper use a matched cohort design,
#Olson, CM; Cummings, P, Rivara, FP, 2006. Association of first- and
#second-generation air bags with front occupant death in car crashes: a matched
#cohort study. Am J Epidemiol 164:161-169. [The relatively definitive analyses
#in this paper use a matched cohort design, using data taken from the FARS
#(Fatal Accident Recording System) database.]
#Braver, ER; Shardell, M; Teoh, ER, 2010. How have changes in air bag designs
#affected frontal crash mortality? Ann Epidemiol 20:499-510.
#The web page http://www-fars.nhtsa.dot.gov/Main/index.aspx has a menu-based
#interface into the FARS (Fatality Analysis Recording System) data. The FARS
#database aims to include every accident in which there was at least one fatality.
#Examples
#data(nassCDS)
#xtabs(weight ~ dead + airbag, data=nassCDS)
#xtabs(weight ~ dead + airbag + seatbelt + dvcat, data=nassCDS)
#tab <- xtabs(weight ~ dead + abcat, data=nassCDS,
# subset=dvcat=="25-39"&frontal==0)[, c(3,1,2)]
#round(tab[2, ]/apply(tab,2,sum)*100,2)
```
```{r echo=FALSE, message=FALSE, warning=FALSE, some_modifications}
# i would like to add a dummy variable of death to make it easy getting death
#rate by categories
df$death <- as.numeric(df$dead == 'dead')
#adding a factor for injuries
df$injdegree <- cut(df$injSeverity,breaks = c(-1,0,1,2,3,4,5,6),
labels = c('none','possible injury','no incapacity',
'incapacity','killed','unknown','prior death'))
#adding a factor for direction of impact
df$direction <- cut(df$frontal,breaks = c(-1,0,1),
labels = c('non-frontal', 'frontal impact'))
#adding a dummy variable for seatbelt
df$belted <- as.numeric(df$seatbelt == 'belted')
```
# introduction about the data:
the data is about Airbag and other influences on car accident fatalities and
injuries like weather the impact was frontal or not the seatbelt was belted or
not and the speed of the impact among others
# operations performed on the data
- creating a dummy variable for death making it easy to get rate of death
- adding factor variable for direction of impact easier to deal with than
frontal int variable in the data
- adding factor variable for injury degree easier to deal with than injseverity
int variable in the data
- adding a dummy variable for seatbelt
# Univariate Plots Section
## estimated impact speeds distribution
```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_Plots}
#ploting a histogram for spead
ggplot(data = df , aes(dvcat))+
geom_histogram(stat = 'count')
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
#speed summary
summary(df$dvcat)
```
sound strange the most accidents occurred in the range 10-24 and the faster
above it the less accidents
## gender count
```{r echo=FALSE, message=FALSE, warning=FALSE}
#ploting a bar chart for sex
ggplot(data = df , aes(sex))+
geom_histogram(stat = 'count')
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
#sex summary
summary(df$sex)
```
sound that males have more accidents than females
## age distribution
```{r echo=FALSE, message=FALSE, warning=FALSE}
#ploting a histogram for age
ggplot(data = df , aes(ageOFocc))+
geom_histogram(binwidth = 1)+
scale_x_continuous(breaks = seq(16,97,2))+
theme(axis.text.x = element_text(angle = 45))
```
sound that the older the less accidents with the peak between 16 and 24
## year of car model
```{r echo=FALSE, message=FALSE, warning=FALSE}
#ploting a histogram for year of car model
ggplot(data = df , aes(yearVeh))+
geom_histogram(binwidth = 1)+
scale_x_continuous(breaks = seq(1950,2005,5))
```
as the data is for years between 1997 and 2002 it is normal to see more models
near that period not very old
## injury degree count
```{r echo=FALSE, message=FALSE, warning=FALSE}
#ploting a bar chart for injuries count except for nulls and prior death
ggplot(data =subset(df,injdegree %in%c("none","possible injury","no incapacity",
"incapacity","killed")) ,
aes(injdegree))+
geom_histogram(stat = 'count')
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
#injdegree summary
summary(df$injdegree)
```
incapacity injury is the mood after it come none injury then possible injury
then no incapacity injuries and the least is killed
# Univariate Analysis
### the structure of the dataset
- the data consists of 20 variables and 26217 observations
### the distributions of some variables
- the most accidents occurred in the range 10-24 and the fasterabove it the less
- males have more accidents than females
- the older the less accidents with the peak between 16 and 24
- the data is for years between 1997 and 2002 it is normal to see more models
near that period not very old
- incapacity injury is the mood after it come none injury then possible injury
then no incapacity injuries and the least is killed
### to do next
looking at those factors and airbag and there effect on injuries and death
# Bivariate Plots Section
## death rate by speed
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_Plots}
#ploting speed by death rate by taking death mean as the default
ggplot(data = df ,aes(dvcat,death))+
geom_bar(stat = 'summary')
```
sound that the higher the speed the higher the death rate
## injury by speed
```{r echo=FALSE, message=FALSE, warning=FALSE}
#ploting a scatter of speed by injuries except for nulls and prior death
ggplot(data =subset(df,injdegree %in%c("none","possible injury","no incapacity",
"incapacity","killed")) ,
aes(dvcat,injdegree))+
geom_point(position = 'jitter',alpha = .1)
```
sound that the higher the speed the worse the injury
## death rate by airbag catigory
```{r echo=FALSE, message=FALSE, warning=FALSE}
#ploting airbag catigory by death rate by taking death mean as the default
ggplot(data = df ,aes(abcat,death))+
geom_bar(stat = 'summary')
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
#ploting airbag by death rate by taking death mean as the default
ggplot(data = df ,aes(airbag,death))+
geom_bar(stat = 'summary')
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
#ploting whether airbag was deployed or not by death rate by taking death mean
#as the default
ggplot(data = df ,aes(deploy,death))+
geom_bar(stat = 'summary')
```
sound that airbags are doing there job as death rate is lower with airbags
## death rate by seatbelt
```{r echo=FALSE, message=FALSE, warning=FALSE}
#ploting seatbelt by death rate by taking death mean as the default
ggplot(data = df ,aes(seatbelt,death))+
geom_bar(stat = 'summary')+
scale_y_continuous(breaks = seq(0,.1,.01))
```
sound that seatbelts are doing there job better than airbags as death rate is
lower with belted seatbelts
## gender by death rate and seatbelt using
```{r echo=FALSE, message=FALSE, warning=FALSE}
#ploting sex by death rate by taking death mean as the default
ggplot(data = df ,aes(sex,death))+
geom_bar(stat = 'summary')
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
#finding the rate of death for each sex
by(df$death,df$sex,mean)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
#ploting sex by the rate of belting seat belt
ggplot(data = df ,aes(sex,belted))+
geom_bar(stat = 'summary')
```
males have higher death rate and males use seatbelt less
## death rate by age
```{r echo=FALSE, message=FALSE, warning=FALSE}
#ploting age by death rate by taking death mean as the default
ggplot(data = df ,aes(ageOFocc,death))+
geom_bar(stat = 'summary')
```
looks like the higher the age the higher death rate
## death rate by model year
```{r echo=FALSE, message=FALSE, warning=FALSE}
#ploting year of model of the car by death rate by taking death mean as the
#default
ggplot(data = df ,aes(yearVeh,death))+
geom_bar(stat = 'summary')
```
death rate for modern car are less than old ones in general
## drivers compared to passenger death rate and seatbelt using
```{r echo=FALSE, message=FALSE, warning=FALSE}
#ploting whether driver or passenger by death rate by taking death mean as the
#default
ggplot(data = df ,aes(occRole,death))+
geom_bar(stat = 'summary')
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
#ploting whether driver or passenger by seatbelt belting rate
ggplot(data = df ,aes(occRole,belted))+
geom_bar(stat = 'summary')
```
death rate for passenger is higher than drivers and they use seatbelt less too
## direction of impact by death rate
```{r echo=FALSE, message=FALSE, warning=FALSE}
#ploting direction of impact by death rate by taking death mean as the default
ggplot(data = df ,aes(direction,death))+
geom_bar(stat = 'summary')
```
frontal impact has less death rate
## injury by airbag
```{r echo=FALSE, message=FALSE, warning=FALSE}
#ploting a scatter plot for airbag by injuries except nulls and prior death
ggplot(data =subset(df,injdegree %in%c("none","possible injury","no incapacity",
"incapacity","killed")) ,
aes(airbag,injdegree))+
geom_point(position = 'jitter',alpha = .1)
```
not very clear
# Bivariate Analysis
- the higher the speed the higher the death rate
- the higher the speed the worse the injury
- airbags are doing there job
- seatbelts are doing there job better than airbags
- males have higher death rate and males use seatbelt less
- the higher the age the higher death rate
- death rate for modern car are less than old ones in general
- death rate for passenger is higher than drivers and they use seatbelt less too
- frontal impact has less death rate than non-frontal ones
# Multivariate Plots Section
## death rate by airbag catigory faceted by speed
```{r echo=FALSE, message=FALSE, warning=FALSE, Multivariate_Plots}
#a bar plot for airbag by death rate faceted by speed
ggplot(data = df ,aes(airbag,death))+
geom_bar(stat = 'summary')+
facet_wrap(.~dvcat,nrow = 1)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
#a bar plot for deploy by death rate faceted by speed
ggplot(data = df ,aes(deploy,death))+
geom_bar(stat = 'summary')+
facet_wrap(.~dvcat,nrow = 1)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
#a bar plot for airbag category by death rate faceted by speed
ggplot(data = df ,aes(abcat,death))+
geom_bar(stat = 'summary')+
facet_wrap(.~dvcat,nrow = 1)+
theme(axis.text.x = element_text(angle = 45))
```
there is a problem with airbags at low spead and the death rate is higher
## airbag by count faceted by injury
```{r echo=FALSE, message=FALSE, warning=FALSE}
#a histogram for airbag faceted by injdegree
ggplot(data =subset(df,injdegree %in%c("none","possible injury","no incapacity",
"incapacity","killed")) ,
aes(airbag))+
geom_histogram(stat = 'count')+
facet_wrap(.~injdegree,nrow = 1)
```
very interesting more injuries happen with airbags than without them but death
is more with out them
## injury by speed faceted by airbag
```{r echo=FALSE, message=FALSE, warning=FALSE}
#ploting a scatter of speed by injuries except for nulls and prior death
#facetted by airbag category
ggplot(data =subset(df,injdegree %in%c("none","possible injury","no incapacity",
"incapacity","killed")) ,
aes(dvcat,injdegree))+
geom_point(position = 'jitter',alpha = .1)+
facet_wrap(.~abcat)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
#ploting a scatter of speed by injuries except for nulls and prior death colored
#by airbag category
ggplot(data =subset(df,injdegree %in%c("none","possible injury","no incapacity",
"incapacity","killed")) ,
aes(dvcat,injdegree))+
geom_point(position = 'jitter',alpha = .1,aes(color = abcat))
```
it seems that airbags are working in general
## death rate by seatbelt faceted by speed
```{r echo=FALSE, message=FALSE, warning=FALSE}
#a bar plot for seatbelt by death rate faceted by speed
ggplot(data = df ,aes(seatbelt,death))+
geom_bar(stat = 'summary')+
facet_wrap(.~dvcat ,nrow = 1)
```
it seems that seatbelts are doing well across all speeds
# Multivariate Analysis
- there is a problem with airbags at low spead and the death rate is higher
- it seems that airbags are working in general
- it seems that seatbelts are doing well across all speeds
- the effect of seatbelt on fatality rate looks larger than airbags in general
- very interesting more injuries happen with airbags than without them but death
is more with out them
------
# Final Plots and Summary
### Plot One
```{r echo=FALSE, message=FALSE, warning=FALSE, Plot_One}
# a histogram for speed
ggplot(data = df , aes(dvcat))+
geom_histogram(stat = 'count',color = 'blue', fill = 'dodgerblue4')+
ggtitle('estimated impact speeds count')+
xlab('estimated impact speed')+
scale_y_continuous(breaks = seq(0,13000,1000))
```
### Description One
sound strange the most accidents occurred in the range 10-24 and the faster
above it the less accidents
### Plot Two
```{r echo=FALSE, message=FALSE, warning=FALSE, Plot_Two}
#a bar plot for airbag by death rate faceted by speed
ggplot(data = df ,aes(airbag,death ,fill = airbag))+
geom_bar(stat = 'summary')+
facet_wrap(.~dvcat,nrow = 1)+
ggtitle('death rate by airbag faceted by speed')+
ylab('death rate')+
scale_y_continuous(breaks = seq(0,.3,.02))
```
### Description Two
- it seems that death rate is lower in presence of airbags except for low speed
impact which sound strange
- it is clear that although more accidents occure at low speed the death rate
for high speed is higher
### Plot Three
```{r echo=FALSE, message=FALSE, warning=FALSE, Plot_Three}
#a bar plot for seatbelt by death rate faceted by speed
ggplot(data = df ,aes(seatbelt,death , color = seatbelt , fill = seatbelt))+
geom_bar(stat = 'summary')+
facet_wrap(.~dvcat ,nrow = 1)+
ggtitle('death rate by seatbelt faceted by speed')+
ylab('death rate')+
scale_y_continuous(breaks = seq(0,.35,.02))
```
### Description Three
it seems that death rate is well lower when seatbelts are belted
------
# Reflection
- the data set was good easy to work with
- there is not enough quantitative variables to work with
- death rate at low speed was strange as it was higher in presence of airbag
and that need more investigation
- more work could be done on injuries and its degrees may be by extracting
dummy from it and see its rates related to other factors
- making dummy variables made it easy to compare rates
- the shortage in quantitative variable make it hard to make scatterblot
- because the is alot of catigorical variable the bar blot is the best to use