Cyclists' Safety Guide in NYC - Part One
Perform descriptive analytics to describe the dangerous ridership of cyclists in NYC.
- Where is the deadliest place for cyclist in NYC?? And Why??
- Big picture of collision incidents on the streets.
- Delve into the crashes that have injuries/fatalities of cyclists
- Observe how many percentage of cyclists suffer injuries and fatalities over all of victims
- Comparison of change rate between bike counts and injuries/fatalities of cyclists over these years
- Breaking down which borough have highest casualties over these years
- Breaking down at which month/day/time casualties concentrate
- Spatial distribution of cyclists-involved crashes
- Vehicles that normally collide with Bicycles
- What factors contribute to Brooklyn's uncontested role as the most lethal area to bike.
import pandas as pd
from IPython.display import display, HTML
import seaborn as sns
import matplotlib.pyplot as plt
import altair as alt
import datapane as dp
import altair as alt
import numpy as np
from datetime import date, time
All the datasets came from NYC Open datawebsite
df_collision_crashes = pd.read_csv('Motor_Vehicle_Collisions_-_Crashes.csv',low_memory=False)
display(df_collision_crashes.head())
display(df_collision_crashes.shape)
display(df_collision_crashes.info(show_counts=True))
There are no any na value in CRASH DATE
df_collision_crashes['CRASH DATE'].isna().sum()
Let's transform CRASH DATA column into datetime format in order to easily manipulate data.
df_collision_crashes['CRASH DATE'] = pd.to_datetime(df_collision_crashes['CRASH DATE'])
df_collision_crashes.head()
Add 'year' column
df_collision_crashes['year'] = df_collision_crashes['CRASH DATE'].dt.year
df_collision_crashes.head()
From the plot below, it seems like crash incidents increase over time, and there was a drastic surge between 2012 to 2013. Unsurprisingly, crash incidents suddenly decrease, it might be due to the COVID-19 pandemic.
ax = sns.countplot(data=df_collision_crashes, x='year')
ax.set(ylabel='crash counts')
Read another datasets
#collapse_show
df_bike_counts = pd.read_csv('Bicycle_Counts.csv',low_memory=False)
display(df_bike_counts.head())
display(df_bike_counts.shape)
df_bike_counters = pd.read_csv('Bicycle_Counters.csv',low_memory=False)
display(df_bike_counters.head())
display(df_bike_counters.shape)
Given overall number of cyclists also increase, it makes sense that crash incidents increased as well. However, it is interesting to observer that the number of overall cyclists did not decrease proportionally between 2019 to 2020 after the breakout of COVID-19, it is a point worth another investigation.
df_bike_counts['date'] = pd.to_datetime(df_bike_counts['date'])
df_bike_counts['year'] = df_bike_counts['date'].dt.year
df_bike_totalcounts = df_bike_counts.groupby(by=['year']).sum().reset_index()[['counts', 'year']]
df_bike_totalcounts['year'] = df_bike_totalcounts['year'].map(str)
ax = sns.barplot(data=df_bike_totalcounts, x='year', y='counts')
ax.set(ylabel='Bike counts')
ax = sns.barplot(data=df_bike_totalcounts, x='year', y='counts')
ax.set(ylabel='Bike counts')
df_bike_counts.head()
casualties_plus_year_cols = ['NUMBER OF CYCLIST INJURED',
'NUMBER OF CYCLIST KILLED',
'NUMBER OF PERSONS INJURED',
'NUMBER OF PERSONS KILLED',
'NUMBER OF PEDESTRIANS INJURED',
'NUMBER OF PEDESTRIANS KILLED',
'NUMBER OF MOTORIST INJURED',
'NUMBER OF MOTORIST KILLED'
]
df_casualties = df_collision_crashes.groupby(by=[df_collision_crashes.year]).sum()[casualties_plus_year_cols]
df_casualties['year'] = df_casualties.index.map(str)
df_casualties = df_casualties.reset_index(drop=True)
df_casualties['NUMBER of TOTAL CASUALTIES'] = df_casualties[['NUMBER OF PERSONS INJURED', 'NUMBER OF PERSONS KILLED']].sum(axis=1)
df_casualties['NUMBER of TOTAL CYCLISTS CASUALTIES'] = df_casualties[['NUMBER OF CYCLIST INJURED', 'NUMBER OF CYCLIST KILLED']].sum(axis=1)
df_casualties['NUMBER of TOTAL PEDESTRIANS CASUALTIES'] = df_casualties[['NUMBER OF PEDESTRIANS INJURED', 'NUMBER OF PEDESTRIANS KILLED']].sum(axis=1)
df_casualties['NUMBER of TOTAL MOTORIST CASUALTIES'] = df_casualties[['NUMBER OF MOTORIST INJURED', 'NUMBER OF MOTORIST KILLED']].sum(axis=1)
# df_casualties.drop(columns=casualties_plus_year_cols[2:], inplace=True)
df_casualties.head()
df_bike_totalcounts.head()
# df_casualties merge with df_bike_totalcounts
df_casualties = df_casualties.merge(df_bike_totalcounts, on='year', how='right')
df_casualties
2019 is clearly the deadliest year for cyclists. However, given growing number of injured cyclists, it is still not clear why there is drastic change of fatal accidents in both 2018 and 2019. And even though 2020 had less fatal incidents than 2019, more injuries actually arose in 2020.
f, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 8))
sns.barplot(data=df_casualties, x='year', y='NUMBER OF CYCLIST INJURED', ax=axes[0])
sns.barplot(data=df_casualties, x='year', y='NUMBER OF CYCLIST KILLED', ax=axes[1])
plt.show()
# ax.set(ylabel='Bike counts')
The figure below show how many percentage of people are cyclists among all the injuries and fatalities.
from pdb import set_trace
def annotate_percentage(axis, df):
for i, p in enumerate(axis.patches):
if i > 9 and i < 20:
percentage = '{:.1f}%'.format(100 * df['NUMBER of TOTAL CYCLISTS CASUALTIES'][i - 10] / df['NUMBER of TOTAL CASUALTIES'][i - 10])
x = p.get_width()
y = p.get_y() + p.get_height() - 0.2
axis.annotate(percentage, (x, y), size = 200)
else:
continue
sns.set_theme(style="whitegrid")
# Initialize the matplotlib figure
f, ax = plt.subplots(figsize=(130, 80))
# Plot the total crashes
sns.set_color_codes("pastel")
sns.barplot(x="NUMBER of TOTAL CASUALTIES", y="year", data=df_casualties,
label="Total Casualties", color="b")
# Plot the crashes where cyclists were victims
sns.set_color_codes("muted")
sns.barplot(x="NUMBER of TOTAL CYCLISTS CASUALTIES", y="year", data=df_casualties,
label="Total Cyclists Casualties", color="b")
annotate_percentage(ax, df_casualties)
# Add a legend and informative axis label
ax.legend(ncol=2, loc="upper right", frameon=True, fontsize='120')
ax.set(xlim=(0, 65000), ylabel="years",
xlabel="Number of Casualties")
ax.set_ylabel("years", fontsize=200)
ax.set_xlabel("Number of Casualties", fontsize=200)
ax.tick_params(axis='x', labelsize=80)
ax.tick_params(axis='y', labelsize=80)
sns.despine(left=True, bottom=True)
In the plot below, we ignore data from 2012 and 2021 because this dataset is not yet completely collected at the time of data collection.
Conclusion: Casualties trend roughly follow the change of bike counts all around new york city. Essentially, the more new yorkers biking on the street more casualties of cyclists produced. And the most peculiar thing is that the number of cyclists casualties still rise even when the bike counts decrease in 2019. That is how the issue of cyclists safety in new york city get noticed by news coverage and city government even consider the idea of licensing of biking in the city.
plt.figure(figsize=(15, 10))
sns.set_theme(style="whitegrid")
ax = sns.barplot(data=df_casualties[(df_casualties.year != '2012') & (df_casualties.year != '2021')], x="year", y='counts', color='g')
ax.set_ylabel('Bike Counts', color='g', fontsize=20)
ax.set_xlabel('year', fontsize=20)
ax2 = plt.twinx()
sns.lineplot(data=df_casualties[(df_casualties.year != '2012') & (df_casualties.year != '2021')], x="year", y='NUMBER of TOTAL CYCLISTS CASUALTIES', color='r', ax=ax2)
ax2.set_ylabel('Number of Cyclists Casualties', color='r', fontsize=20)
ax2.spines['right'].set_color('r')
ax2.spines['right'].set_linewidth(3)
ax2.spines['left'].set_color('g')
ax2.spines['left'].set_linewidth(3)
# sns.despine(left=True, bottom=True)
print(f'There are roughly {(100 * df_collision_crashes.BOROUGH.isna().sum() / len(df_collision_crashes.BOROUGH)):.2f}% missing instances in BOUROUGH column')
display(df_collision_crashes.groupby(by=['BOROUGH']).sum()[casualties_plus_year_cols].style.set_caption("Sum of Casualties by Bourough from 2012 to 2021").set_table_styles([{
'selector': 'caption',
'props': [
('font-size', '16px')
]
}]))
print('======================================================================')
display(df_collision_crashes.groupby(by=['BOROUGH', 'year']).sum()[casualties_plus_year_cols].style.set_caption("Sum of Casualties by Bourough and Years(2012 to 2021)").set_table_styles([{
'selector': 'caption',
'props': [
('font-size', '16px')
]
}]))
# df_cyclists_borough_yearly = df_collision_crashes.groupby(by=['BOROUGH', 'year']).sum()[casualties_plus_year_cols].reset_index()[['NUMBER OF CYCLIST INJURED', 'NUMBER OF CYCLIST KILLED', 'year', 'BOROUGH']]
df_cyclists_borough_yearly = df_collision_crashes.groupby(by=['BOROUGH', 'CRASH DATE']).sum()[casualties_plus_year_cols].reset_index()[['NUMBER OF CYCLIST INJURED', 'NUMBER OF CYCLIST KILLED', 'CRASH DATE', 'BOROUGH']]
df_cyclists_borough_yearly['Total Casualties of Cyclists'] = df_cyclists_borough_yearly[['NUMBER OF CYCLIST INJURED', 'NUMBER OF CYCLIST KILLED']].sum(axis=1)
df_cyclists_borough_yearly.head()
Overall, BROOKLYN has the highest casualities
alt.data_transformers.disable_max_rows()
plot1 = alt.Chart(df_cyclists_borough_yearly).mark_bar().encode(
x='year(CRASH DATE):O',
y='sum(Total Casualties of Cyclists):Q',
color='BOROUGH:N',
).facet(
facet='BOROUGH:N',
columns=3
)
dp.Report(
dp.Plot(plot1),
).upload(name="Plot1")
HTML('<iframe src="https://datapane.com/u/jason0/reports/M7bQOm3/plot1/embed/" width="60%" height="540px" style="border: none;">IFrame not supported</iframe>')
plot2 = alt.Chart(df_cyclists_borough_yearly).mark_bar().encode(
x='year(CRASH DATE):O',
y='sum(NUMBER OF CYCLIST KILLED):Q',
color='BOROUGH:N',
).facet(
facet='BOROUGH:N',
columns=3
)
dp.Report(
dp.Plot(plot2),
).upload(name="Plot2")
HTML('<iframe src="https://datapane.com/u/jason0/reports/q34OzGA/plot2/embed/" width="60%" height="540px" style="border: none;">IFrame not supported</iframe>')
In conclusion, Brooklyn take up a significant proportion of casualties both in injuries and deaths. All of boroughs slightly show peaks between 2014 to 2017 and 2019 to 2020.
death_cols = ['CRASH DATE', 'CRASH TIME', 'BOROUGH', 'LATITUDE',
'LONGITUDE', 'LOCATION', 'NUMBER OF CYCLIST INJURED',
'NUMBER OF CYCLIST KILLED', 'CONTRIBUTING FACTOR VEHICLE 1',
'CONTRIBUTING FACTOR VEHICLE 2', 'CONTRIBUTING FACTOR VEHICLE 3',
'CONTRIBUTING FACTOR VEHICLE 4', 'CONTRIBUTING FACTOR VEHICLE 5',
'COLLISION_ID', 'VEHICLE TYPE CODE 1', 'VEHICLE TYPE CODE 2',
'VEHICLE TYPE CODE 3', 'VEHICLE TYPE CODE 4', 'VEHICLE TYPE CODE 5',
'year', 'ZIP CODE']
df_casualties_with_time = df_collision_crashes[df_collision_crashes[['NUMBER OF CYCLIST KILLED', 'NUMBER OF CYCLIST INJURED']].sum(axis=1) != 0][death_cols]#.reset_index(drop=True)
df_casualties_with_time['CRASH DATE'] = pd.to_datetime(df_casualties_with_time[['CRASH DATE', 'CRASH TIME']].apply(lambda x: ' '.join(x.values.astype(str)), axis="columns"))
df_casualties_with_time_location = df_casualties_with_time.copy()
df_casualties_with_time.drop(columns=['CRASH TIME', 'LOCATION', 'LATITUDE', 'LONGITUDE'], inplace=True)
df_casualties_with_time.head()
Month versus Year distribution of Injured Cyclists
First, analyze the most dangerous borough to cycling, that is Brooklyn.
plot3 = alt.Chart(df_casualties_with_time).mark_circle().encode(
alt.X('month(CRASH DATE):O'),
alt.Y('year(CRASH DATE):O'),
size=alt.Size('sum(NUMBER OF CYCLIST INJURED):Q', legend=alt.Legend(title='Sum of Injured Cyclists')),
color='BOROUGH:N',
# column='BOROUGH:N'
facet=alt.Facet('BOROUGH:N', columns=2)
).properties(
title='Month versus Year distribution of Injured Cyclists')
dp.Report(
dp.Plot(plot3),
).upload(name="Plot3")
HTML('<iframe src="https://datapane.com/u/jason0/reports/dA68E47/plot3/embed/" width="60%" height="540px" style="border: none;">IFrame not supported</iframe>')
Month versus Year distribution of Killed Cyclists
plot4 = alt.Chart(df_casualties_with_time).mark_circle().encode(
alt.X('month(CRASH DATE):O'),
alt.Y('year(CRASH DATE):O'),
size=alt.Size('sum(NUMBER OF CYCLIST KILLED):Q', legend=alt.Legend(title='Sum of Killed Cyclists')),
color='BOROUGH:N',
).facet(
facet='BOROUGH:N',
columns=2,
).properties(
title='Month versus Year distribution of Killed Cyclists')
dp.Report(
dp.Plot(plot4),
).upload(name="Plot4")
HTML('<iframe src="https://datapane.com/u/jason0/reports/vAqwOm7/plot4/embed/" width="50%" height="450px" style="border: none;">IFrame not supported</iframe>')
Time versus Day distribution of Injured Cyclists
plot5 = alt.Chart(df_casualties_with_time).mark_circle().encode(
alt.X('hours(CRASH DATE):O'),
alt.Y('day(CRASH DATE):O'),
size=alt.Size('sum(NUMBER OF CYCLIST INJURED):Q', legend=alt.Legend(title='Sum of Injured Cyclists')),
color='BOROUGH:N',
).facet(
facet='BOROUGH:N',
columns=2,
).properties(
title='Time versus Day distribution of Injured Cyclists')
dp.Report(
dp.Plot(plot5),
).upload(name="Plot5")
HTML('<iframe src="https://datapane.com/u/jason0/reports/9AxprGA/plot5/embed/" width="80%" height="540px" style="border: none;">IFrame not supported</iframe>')
Time versus Day distribution of Killed Cyclists
plot6 = alt.Chart(df_casualties_with_time).mark_circle().encode(
alt.X('hours(CRASH DATE):O'),
alt.Y('day(CRASH DATE):O'),
size=alt.Size('sum(NUMBER OF CYCLIST KILLED):Q', legend=alt.Legend(title='Sum of Killed Cyclists')),
color='BOROUGH:N',
).facet(
facet='BOROUGH:N',
columns=2,
).properties(
title='Time versus Day distribution of Killed Cyclists')
dp.Report(
dp.Plot(plot6),
).upload(name="Plot6")
HTML('<iframe src="https://datapane.com/u/jason0/reports/wAw2ylk/plot6/embed/" width="80%" height="540px" style="border: none;">IFrame not supported</iframe>')
print(f'Totoally, there are {(100 * df_casualties_with_time_location[~((df_casualties_with_time_location.LONGITUDE.between(-74.259, -73.59) & (df_casualties_with_time_location.LATITUDE.between(40.49, 40.93))))].shape[0] / df_casualties_with_time_location.shape[0]):.2f} % rows of data are anomalous(null or unusual value) Location data')
We define location column to be False when LOGITUDE and LATITUDE information is not available.
df_casualties_with_time_location['location'] = False
# Limit all of location infos to legitimate range of New York City
df_casualties_with_time_location.loc[(df_casualties_with_time_location.LONGITUDE.between(-74.259, -73.59) & df_casualties_with_time_location.LATITUDE.between(40.49, 40.93)), 'location'] = True
df_casualties_with_time_location.location.value_counts()
To quickly examine the distribution of these small part of unspecified LOCATION datasets, we plot 4 charts below with respect to time(year, month, day, hours)/Borough/casualties), and we observe:
- In terms of time-related columns, these small parts of dataset did not look like to be skewing into any part of values.
- For those datapoints which are null value in BOROUGH column, they incline to have unspecified LOCATION as well.
base = alt.Chart(df_casualties_with_time_location)
# xscale = alt.Scale(domain=(4.0, 8.0))
# yscale = alt.Scale(domain=(1.9, 4.55))
area_args = {'opacity': .3, 'interpolate': 'step'}
points = base.mark_circle().encode(
alt.X('year:O'),
alt.Y('hours(CRASH DATE):T'),
color='location'
)
top_hist = base.mark_area(**area_args).encode(
alt.X('year:O',
# when using bins, the axis scale is set through
# the bin extent, so we do not specify the scale here
# (which would be ignored anyway)
stack=None,
title=''
),
alt.Y('count()', stack=None, title=''),
alt.Color('location:N'),
).properties(height=100)
right_hist = base.mark_area(**area_args).encode(
alt.Y('hours(CRASH DATE):T',
stack=None,
title='',
),
alt.X('count()', stack=None, title=''),
alt.Color('location:N'),
).properties(width=100)
plot7 = (top_hist & (points | right_hist))
dp.Report(
dp.Plot(plot7),
).upload(name="Plot7")
HTML('<iframe src="https://datapane.com/u/jason0/reports/8AVqaEk/plot7/embed/" width="50%" height="540px" style="border: none;">IFrame not supported</iframe>')
base = alt.Chart(df_casualties_with_time_location)
# xscale = alt.Scale(domain=(4.0, 8.0))
# yscale = alt.Scale(domain=(1.9, 4.55))
area_args = {'opacity': .3, 'interpolate': 'step'}
points = base.mark_circle().encode(
alt.X('month(CRASH DATE):O'),
alt.Y('day(CRASH DATE):O'),
color='location'
)
top_hist = base.mark_area(**area_args).encode(
alt.X('month(CRASH DATE):O',
# when using bins, the axis scale is set through
# the bin extent, so we do not specify the scale here
# (which would be ignored anyway)
stack=None,
title=''
),
alt.Y('count()', stack=None, title=''),
alt.Color('location:N'),
).properties(height=100)
right_hist = base.mark_area(**area_args).encode(
alt.Y('day(CRASH DATE):O',
stack=None,
title='',
),
alt.X('count()', stack=None, title=''),
alt.Color('location:N'),
).properties(width=100)
plot8 = (top_hist & (points | right_hist))
dp.Report(
dp.Plot(plot8),
).upload(name="Plot8")
HTML('<iframe src="https://datapane.com/u/jason0/reports/dA9b0O3/plot8/embed/" width="50%" height="540px" style="border: none;">IFrame not supported</iframe>')
base = alt.Chart(df_casualties_with_time_location)
# xscale = alt.Scale(domain=(4.0, 8.0))
# yscale = alt.Scale(domain=(1.9, 4.55))
area_args = {'opacity': .3, 'interpolate': 'step'}
points = base.mark_circle().encode(
alt.X('NUMBER OF CYCLIST KILLED:N'),
alt.Y('BOROUGH:N'),
color='location'
)
top_hist = base.mark_area(**area_args).encode(
alt.X('BOROUGH:N',
# when using bins, the axis scale is set through
# the bin extent, so we do not specify the scale here
# (which would be ignored anyway)
stack=None,
title=''
),
alt.Y('count()', stack=None, title=''),
alt.Color('location:N'),
).properties(height=100)
right_hist = base.mark_area(**area_args).encode(
alt.Y('NUMBER OF CYCLIST KILLED:N',
stack=None,
title='',
),
alt.X('count()', stack=None, title=''),
alt.Color('location:N'),
).properties(width=100)
plot9 = (top_hist & (points | right_hist))
dp.Report(
dp.Plot(plot9),
).upload(name="Plot9")
HTML('<iframe src="https://datapane.com/u/jason0/reports/nkKzMy7/plot9/embed/" width="50%" height="540px" style="border: none;">IFrame not supported</iframe>')
base = alt.Chart(df_casualties_with_time_location)
# xscale = alt.Scale(domain=(4.0, 8.0))
# yscale = alt.Scale(domain=(1.9, 4.55))
area_args = {'opacity': .3, 'interpolate': 'step'}
points = base.mark_circle().encode(
alt.X('NUMBER OF CYCLIST INJURED:O'),
alt.Y('BOROUGH:N'),
color='location'
)
top_hist = base.mark_area(**area_args).encode(
alt.X('NUMBER OF CYCLIST INJURED:O',
# when using bins, the axis scale is set through
# the bin extent, so we do not specify the scale here
# (which would be ignored anyway)
stack=None,
title=''
),
alt.Y('count()', stack=None, title=''),
alt.Color('location:N'),
).properties(height=100)
right_hist = base.mark_area(**area_args).encode(
alt.Y('BOROUGH:N',
stack=None,
title='',
),
alt.X('count()', stack=None, title=''),
alt.Color('location:N'),
).properties(width=100)
plot10 = (top_hist & (points | right_hist))
dp.Report(
dp.Plot(plot10),
).upload(name="Plot10")
HTML('<iframe src="https://datapane.com/u/jason0/reports/P3XKWMA/plot10/embed/" width="50%" height="540px" style="border: none;">IFrame not supported</iframe>')
df_casualties_with_time_location = df_casualties_with_time_location[(df_casualties_with_time_location.LONGITUDE.between(-74.259, -73.59) & (df_casualties_with_time_location.LATITUDE.between(40.49, 40.93)))]
Here, we ignore 7.51 % of datapoints from those unspecified location datapoints in order to visualize spacial distribution of cyclists crashes in new york city below.
The plot below enable the adjustable visualzation of yearly crashes in five boroughs of NYC. Note: It is recommended to interact with plots below online by clicking buttom left button due to performace reason
boroughs = alt.topo_feature('https://data.cityofnewyork.us/api/geospatial/tqmj-j8zm?method=export&format=GeoJSON', 'boroughs')
background = alt.Chart(boroughs).mark_geoshape(
fill='lightgray',
stroke='white',
strokeWidth=2
).encode(
color=alt.value('#eee'),
).properties(
width=1000,
height=800
)
points = alt.Chart(df_casualties_with_time_location[df_casualties_with_time_location.location == True]).mark_circle(
size=10,
color='steelblue'
).encode(
longitude='LONGITUDE:Q',
latitude='LATITUDE:Q',
color='BOROUGH:N',
tooltip=['CRASH TIME', 'NUMBER OF CYCLIST INJURED', 'NUMBER OF CYCLIST KILLED', 'VEHICLE TYPE CODE 1',
'VEHICLE TYPE CODE 2', 'VEHICLE TYPE CODE 3', 'VEHICLE TYPE CODE 4',
'VEHICLE TYPE CODE 5', 'CONTRIBUTING FACTOR VEHICLE 1', 'CONTRIBUTING FACTOR VEHICLE 2',
'CONTRIBUTING FACTOR VEHICLE 3', 'CONTRIBUTING FACTOR VEHICLE 4',
'CONTRIBUTING FACTOR VEHICLE 5']
# A Slider filter
year_slider = alt.binding_range(min=2012, max=2021, step=1)
slider_selection = alt.selection_single(bind=year_slider, fields=['year'], name="Yearly Crashes")
filter_year = points.add_selection(
slider_selection
).transform_filter(
slider_selection
).properties(title="Slider Filtering")
(background + filter_year)# & bars
HTML('<iframe src="https://datapane.com/u/jason0/reports/d7dXOZ7/plot11/embed/" width="100%" height="540px" style="border: none;">IFrame not supported</iframe>')
boroughs = alt.topo_feature('https://data.cityofnewyork.us/api/geospatial/tqmj-j8zm?method=export&format=GeoJSON', 'boroughs')
background = alt.Chart(boroughs).mark_geoshape(
fill='lightgray',
stroke='white',
strokeWidth=2
).encode(
color=alt.value('#eee'),
).properties(
width=1000,
height=800
)
points = alt.Chart(df_casualties_with_time_location[df_casualties_with_time_location.location == True]).mark_circle(
size=20,
color='steelblue'
).encode(
longitude='LONGITUDE:Q',
latitude='LATITUDE:Q',
color='BOROUGH:N',
tooltip=['CRASH TIME', 'NUMBER OF CYCLIST INJURED', 'NUMBER OF CYCLIST KILLED', 'VEHICLE TYPE CODE 1',
'VEHICLE TYPE CODE 2', 'VEHICLE TYPE CODE 3', 'VEHICLE TYPE CODE 4',
'VEHICLE TYPE CODE 5', 'CONTRIBUTING FACTOR VEHICLE 1', 'CONTRIBUTING FACTOR VEHICLE 2',
'CONTRIBUTING FACTOR VEHICLE 3', 'CONTRIBUTING FACTOR VEHICLE 4',
'CONTRIBUTING FACTOR VEHICLE 5']
).transform_filter(
(alt.datum['NUMBER OF CYCLIST KILLED'] != 0)
)
# A Slider filter
year_slider = alt.binding_range(min=2012, max=2021, step=1)
slider_selection = alt.selection_single(bind=year_slider, fields=['year'], name="Yearly Crashes")
filter_year = points.add_selection(
slider_selection
).transform_filter(
slider_selection
).properties(title="Slider Filtering")
plot12 = background + filter_year
dp.Report(
dp.Plot(plot12),
).upload(name="Plot12")
HTML('<iframe src="https://datapane.com/u/jason0/reports/MA1pPYk/plot12/embed/" width="100%" height="540px" style="border: none;">IFrame not supported</iframe>')
df_casualties_with_time.groupby(by=['BOROUGH', 'CONTRIBUTING FACTOR VEHICLE 1', 'year']).sum().sort_values(by=['NUMBER OF CYCLIST INJURED'], ascending=False).loc['BROOKLYN']
In these charts below, we can see what kinds of vehicles are responsible for injuries or deaths in most crash events. Noticeably, it also have shown that vehicles casuing more lethal occurences are not necessarily the same as dangerous(cause more injuries than death) ones. Based on different borough, more bulky or speedy vehicles are actually triggering more deaths.
Specifically, among these vehicles, Passenger Vehicle, Sedan, Station Wagon/SUV are main culprits causing lots of casualties. Additionally, crashes with bus and box truck give rise to higher death deaths.
df_casualties_with_time.head()
brooklyn = alt.Chart(
df_casualties_with_time,
).mark_bar().encode(
x=alt.X('VEHICLE TYPE CODE 1:N', sort='-y'),
y=alt.Y('sum(NUMBER OF CYCLIST INJURED):Q'),
color=alt.Color('sum(NUMBER OF CYCLIST KILLED):Q')
).transform_filter(
(alt.datum.BOROUGH == 'BROOKLYN')# & (alt.datum.year < 2019)
).properties(
title='Brookyn')
bronx = alt.Chart(
df_casualties_with_time,
).mark_bar().encode(
x=alt.X('VEHICLE TYPE CODE 1:N', sort='-y'),
y=alt.Y('sum(NUMBER OF CYCLIST INJURED):Q'),
color=alt.Color('sum(NUMBER OF CYCLIST KILLED):Q')
).transform_filter(
(alt.datum.BOROUGH == 'BRONX')# & (alt.datum.year < 2019)
).properties(
title='Bronx')
manhattan = alt.Chart(
df_casualties_with_time,
).mark_bar().encode(
x=alt.X('VEHICLE TYPE CODE 1:N', sort='-y'),
y=alt.Y('sum(NUMBER OF CYCLIST INJURED):Q'),
color=alt.Color('sum(NUMBER OF CYCLIST KILLED):Q')
).transform_filter(
(alt.datum.BOROUGH == 'MANHATTAN')# & (alt.datum.year < 2019)
).properties(
title='Manhattan')
queens = alt.Chart(
df_casualties_with_time,
).mark_bar().encode(
x=alt.X('VEHICLE TYPE CODE 1:N', sort='-y'),
y=alt.Y('sum(NUMBER OF CYCLIST INJURED):Q'),
color=alt.Color('sum(NUMBER OF CYCLIST KILLED):Q')
).transform_filter(
(alt.datum.BOROUGH == 'QUEENS')# & (alt.datum.year < 2019)
).properties(
title='Queens')
staten = alt.Chart(
df_casualties_with_time,
).mark_bar().encode(
x=alt.X('VEHICLE TYPE CODE 1:N', sort='-y'),
y=alt.Y('sum(NUMBER OF CYCLIST INJURED):Q'),
color=alt.Color('sum(NUMBER OF CYCLIST KILLED):Q', legend=alt.Legend(title='NUMBER OF CYCLIST KILLED'))
).transform_filter(
(alt.datum.BOROUGH == 'STATEN ISLAND')# & (alt.datum.year < 2019)
).properties(
title='Statne Island')
dp.Report(
dp.Page(
title='Brooklyn',
blocks=["### Plot", brooklyn]
),
dp.Page(
title='Bronx',
blocks=["### Plot", bronx]
),
dp.Page(
title='Manhattan',
blocks=["### Plot", manhattan]
),
dp.Page(
title='Queens',
blocks=["### Plot", queens]
),
dp.Page(
title='Staten',
blocks=["### Plot", staten]
),
).upload(name="Plot13")
HTML('<iframe src="https://datapane.com/u/jason0/reports/Y3YVnM7/plot13/embed/" width="100%" height="540px" style="border: none;">IFrame not supported</iframe>')
We also investigate contributing factors in cylists involved collisions. Common factors include:
- Driver Inattention/Distraction
- Failure to Yield Right-of-Way
- Pedestrian/Bicyclist/Other Pedestrian Error/Confusion
Unfortunately, the majority of contributing factors from collisions data are Unspecified.
brooklyn = alt.Chart(
df_casualties_with_time,
).mark_bar().encode(
x=alt.X('CONTRIBUTING FACTOR VEHICLE 1:N', sort='-y'),
y=alt.Y('sum(NUMBER OF CYCLIST INJURED):Q'),
color=alt.Color('sum(NUMBER OF CYCLIST KILLED):Q')
).transform_filter(
(alt.datum.BOROUGH == 'BROOKLYN')# & (alt.datum.year < 2019)
).properties(
title='Brookyn')
bronx = alt.Chart(
df_casualties_with_time,
).mark_bar().encode(
x=alt.X('CONTRIBUTING FACTOR VEHICLE 1:N', sort='-y'),
y=alt.Y('sum(NUMBER OF CYCLIST INJURED):Q'),
color=alt.Color('sum(NUMBER OF CYCLIST KILLED):Q')
).transform_filter(
(alt.datum.BOROUGH == 'BRONX')# & (alt.datum.year < 2019)
).properties(
title='Bronx')
manhattan = alt.Chart(
df_casualties_with_time,
).mark_bar().encode(
x=alt.X('CONTRIBUTING FACTOR VEHICLE 1:N', sort='-y'),
y=alt.Y('sum(NUMBER OF CYCLIST INJURED):Q'),
color=alt.Color('sum(NUMBER OF CYCLIST KILLED):Q')
).transform_filter(
(alt.datum.BOROUGH == 'MANHATTAN')# & (alt.datum.year < 2019)
).properties(
title='Manhattan')
queens = alt.Chart(
df_casualties_with_time,
).mark_bar().encode(
x=alt.X('CONTRIBUTING FACTOR VEHICLE 1:N', sort='-y'),
y=alt.Y('sum(NUMBER OF CYCLIST INJURED):Q'),
color=alt.Color('sum(NUMBER OF CYCLIST KILLED):Q')
).transform_filter(
(alt.datum.BOROUGH == 'QUEENS')# & (alt.datum.year < 2019)
).properties(
title='Queens')
staten = alt.Chart(
df_casualties_with_time,
).mark_bar().encode(
x=alt.X('CONTRIBUTING FACTOR VEHICLE 1:N', sort='-y'),
y=alt.Y('sum(NUMBER OF CYCLIST INJURED):Q'),
color=alt.Color('sum(NUMBER OF CYCLIST KILLED):Q', legend=alt.Legend(title='NUMBER OF CYCLIST KILLED'))
).transform_filter(
(alt.datum.BOROUGH == 'STATEN ISLAND')# & (alt.datum.year < 2019)
).properties(
title='Statne Island')
dp.Report(
dp.Page(
title='Brooklyn',
blocks=["### Plot", brooklyn]
),
dp.Page(
title='Bronx',
blocks=["### Plot", bronx]
),
dp.Page(
title='Manhattan',
blocks=["### Plot", manhattan]
),
dp.Page(
title='Queens',
blocks=["### Plot", queens]
),
dp.Page(
title='Staten',
blocks=["### Plot", staten]
),
).upload(name="Plot14")
HTML('<iframe src="https://datapane.com/u/jason0/reports/E7ylGg3/plot14/embed/" width="100%" height="540px" style="border: none;">IFrame not supported</iframe>')
We currently have already known what sorts of vehicles and main contributing factors were held accountable for vehemently high injuries/fatalities.
There are many aspects to consider reasons why brooklyn has much higher injuries/fatalities.
- street traffic violation record
- street design(road type)
- traffic sign design
- etc..
In this branch, we focus on traffic violation record.
df_summons = pd.read_csv('NYPD_B_Summons__Historic_.csv')
df_summons.head()
In this dataset, we have only violation records in 2020, so we only explore this dataset in 2020 from the perspective of moving violation summons.
df_violatecode = pd.read_excel('violationcodes.xlsx')
df_violatecode.head()
df_violatecode['ADJ Code'] = df_violatecode['ADJ Code'].astype('str')
# Check if ADJ Code column has any int type value left
print(len([c for c in df_violatecode['ADJ Code'] if type(c) == int]))
# Check if VIOLATION_CODE column in df_summons still are str dtypes
print(len([c for c in df_summons if type(c) == int]))
viocode_dict = df_violatecode.set_index('ADJ Code').DESCRIPTION.to_dict()
# Fix data value error
viocode_dict['37524A'] = viocode_dict['37524AB']
viocode_dict['408E5'] = viocode_dict['408E5 ']
del viocode_dict['408E5 ']
# Create a new column in df_summons with created dictionary.
df_summons['DESCRIPTION'] = df_summons.VIOLATION_CODE.map(viocode_dict)
df_summons['can be issued to a bicyclist?'] = df_summons.VIOLATION_CODE.map(df_violatecode.set_index('ADJ Code')['can be issued to a bicyclist?'].to_dict())
display(df_summons.shape)
By and large, Brooklyn has highest violation record from both CAR/SUV and Truck/BUS which also account for annual death of cyclists in other four boroughs. PS: There is no such borough called New York, it is a problem from datasets. And those locations actually could came from anywhere in New York city.
plt.figure(figsize=(20, 10))
sns.set_theme(style="whitegrid")
g = sns.countplot(
data=df_summons,
y='CITY_NM', hue='VEH_CATEGORY',
)
Interactive bar plots below lists top 20 traffic violation conducts from highest to lowest, viewer can also play around radio buttons to see how changes vary depends on your chosen conditions. In conlusion, although Queens had highest over speeding viloations, brooklyn on the other hand had higher number of records from a variety of violation conducts such as speeding/disobeyed traffic device/operate vehicle while using electronic device/failt to yield cyclists and pedastrian that can explain why the contributing factor Driver Inattention/Distraction of crashes in brooklyn dominate other factors and is significant all over other boroughs.
df_summons_top20_tickets = df_summons.loc[df_summons.VIOLATION_CODE.isin(df_summons.VIOLATION_CODE.value_counts()[:20].index)][['CITY_NM', 'DESCRIPTION']]
df_summons_top20_tickets.head()
alt.data_transformers.enable('default')
alt.data_transformers.disable_max_rows()
bars = alt.Chart(
df_summons_top20_tickets
).mark_bar().encode(
x=alt.X('DESCRIPTION:N', sort='-y'),
y=alt.Y('count():Q'),
color=alt.Color('count():Q'),
tooltip=['DESCRIPTION']
)
boroughs = ['BRONX', 'MANHATTAN', 'BROOKLYN', 'QUEENS']#, 'STATEN ISLAND', 'NEW YORK']
borough_radio = alt.binding_radio(options=boroughs)
borough_select = alt.selection_single(fields=['CITY_NM'], bind=borough_radio, name="Boroughs", init={'CITY_NM': 'BRONX'})
filter_borough = bars.add_selection(
borough_select
).transform_filter(
borough_select
).properties(title='Violation Code Bar change by Boroughs')
dp.Report(
dp.Plot(filter_borough),
).upload(name="Plot17")
HTML('<iframe src="https://datapane.com/u/jason0/reports/87NGoVk/plot17/embed/" width="100%" height="540px" style="border: none;">IFrame not supported</iframe>')
From the perspective of moving violation from streets in NYC, it is not so hard to imagine that Brooklyn with such high numbers of violations consistently reflect its hostile environment for cyclists and pedastrians every year. Particularly, moving violations of bicycles was not higher than Manhattan and Queens, however, moving violations of CAR/SUVs in Brooklyn is the highest among boroughs.
Altogether we can safely say Brooklyn is a relatively dangerous place to ride a bike, and we also get a slight idea why brookln always pose potential threats to cyclists. With all that being said, in this analysis we still have not found other factors contribute to Brooklyn's high crashes rate such as street design, traffic sign maintaince, etc..
It wil be intersting to continue dig into this topics, all the insights gotten from analysis would have a potential application for effective city planing. Please wait for future analysis.