This is the concluding part of my article devoted to a statistical analysis of police shootings and criminality among the white and the black population of the United States. In the first part, we talked about the research background, goals, assumptions, and source data; in the second part, we investigated the national use-of-force and crime data and tracked their connection with race.
Let's recall the intermediate inferences that we were able to make from the available data for 2000 — 2018:
- White police victims outnumber black victims in absolute figures.
- Use of lethal force results in an average of 5.9 per one million Black deaths and 2.3 per one million White deaths (Black victim count is 2.6 greater in unit values).
- Year-to-year scatter in Black lethal force fatalities is nearly twice the scatter in White fatalities.
- White fatalities grow continuously from year to year (by 0.1 — 0.2 per million on average), while Black fatalities rolled back to their 2009 level after climaxing in 2011 — 2013.
- Whites commit twice as many offenses as Blacks in absolute numbers, but three times as fewer in per capita numbers (per 1 million population within that race).
- Criminality among Whites grows more or less steadily over the entire period of investigation (doubled over 19 years). Criminality among Blacks also grows, but by leaps and starts; over the entire period, however, the growth factor is also 2, like with Whites.
- Fatal encounters with law enforcement are connected with criminality (number of offenses committed). The correlation though differs between the two races: for Whites, it is almost perfect, for Blacks — far from perfect.
- Lethal force victims grow 'in reply to' criminality growth, generally with a few years' lag (this is more conspicuous in the Black data).
- White offenders tend to meet death from the police a little more frequently than Black offenders.
Today, as I promised, we'll be looking at the geographical distribution of these data across the states, which ought to either confirm or confute the previous conclusions.
However, before we take up geography, let's make a step back and see what happens if we analyze only the most violent offenses instead of 'All Offenses' as the source data for criminality. Many of my readers have pointed out in their comments that this would have been more proper, since 'All Offenses' incorporate those which should not (in practice) be associated with aggressive behavior provoking police shooting, such as petty larceny or selling drugs. I cannot whole-heartedly agree with this reasoning because, as I see it, any offense can arouse or heighten attention from the law enforcement, which in turn may wind up sadly… Still, let's just be curious enough to check!
Assault and Murder Instead of All Offenses
We just need to change one line of code where we form the crime dataset. Replace this line
df_crimes1 = df_crimes1.loc[df_crimes1['Offense'] == 'All Offenses']
with this:
df_crimes1 = df_crimes1.loc[df_crimes1['Offense'].str.contains('Assault|Murder')]
Our new filter now lets through only offenses connected with assault (simple and aggravated) and murder / non-negligent homicide (negligent / justifiable homicide / manslaughter cases are not included).
We leave the rest of the code as it was.
The number of crimes per 1 million population within each race now looks as follows:
We can see that, though the scale (Y-axis) is much lower, the shape of the curves is almost identical to the All Offenses ones we saw previously.
The criminality vs. lethal force victims curves for both races:
And the correlation matrix:
White_promln_cr | White_promln_uof | Black_promln_cr | Black_promln_uof | |
---|---|---|---|---|
White_promln_cr | 1.000000 | 0.684757 | 0.986622 | 0.729674 |
White_promln_uof | 0.684757 | 1.000000 | 0.614132 | 0.795486 |
Black_promln_cr | 0.986622 | 0.614132 | 1.000000 | 0.680893 |
Black_promln_uof | 0.729674 | 0.795486 | 0.680893 | 1.000000 |
The correlation between criminality and lethal force fatalities is worse this time (0.68 against 0.88 and 0.72 for All Offenses). But the silver lining here is the fact that the correlation coefficients for Whites and Blacks are almost equal, which gives reason to say there is some constant correlation between crime and police shootings / victims (regardless of race).
Now for our 'DIY' index — the ratio of lethal force deaths to the number of crimes (both per capita):
The difference here is even more apparent. The inference is the same: White criminals are more likely to get killed by the police than Black criminals.
The summary is that all our prior conclusions hold true.
Well, down to geography lessons now! :)
Source Data
To investigate criminality in individual states, I used different source endpoints in the FBI database:
- State level UCR Estimated Crime Data Endpoint — without race classification (the resulting CSV can be downloaded from here)
- State level Arrest Demographic Count By Offense Endpoint — with race classification (the resulting CSV can be downloaded from here)
Unfortunately, I didn't manage to get complete data on committed offenses with the offense state, year and offender race, much as I tried. The returned results had large gaps, for example, some states were totally omitted. But the alternative data on arrests is quite sufficient for our humble research.
The first dataset contains crime counts for all the 51 states from 1991 to 2018, for the following offense categories:
- violent crime (murder, rape, robbery and aggravated assault)
- homicide (all types, including negligent / justifiable)
- rape legacy (using outdated metrics — before 2013)
- rape revised (using updated metrics — from 2013 on)
- robbery
- aggravated assault
- property crime
- burglary
- larceny
- motor vehicle theft
- arson
For our purposes, we'll be using the 'violent crime' category, in keeping with the rest of the research.
The second dataset features the number of arrests for the 51 states from 2000 to 2018, with details on the arrested persons' race (refer to the previous part for the race categories). Since the arrest dataset uses a different offense classification and doesn't provide the combined 'violent crime' category, the requests and retrieved results are for the four constituent offenses — murder / non-negligent manslaughter, robbery, rape, and aggravated assault.
Crime Distribution (No Racial Factor)
First, we'll look at the distribution of violent crimes across the states regardless of the offenders' race:
import pandas as pd, numpy as np
CRIME_STATES_FILE = ROOT_FOLDER + '\\crimes_by_state.csv'
df_crime_states = pd.read_csv(CRIME_STATES_FILE, sep=';', header=0,
usecols=['year', 'state_abbr', 'population', 'violent_crime'])
The resulting dataset:
year | state_abbr | population | violent_crime | |
---|---|---|---|---|
0 | 2016 | AL | 4860545 | 25878 |
1 | 1996 | AL | 4273000 | 24159 |
2 | 1997 | AL | 4319000 | 24379 |
3 | 1998 | AL | 4352000 | 22286 |
4 | 1999 | AL | 4369862 | 21421 |
... | ... | ... | ... | ... |
1423 | 2000 | DC | 572059 | 8626 |
1424 | 2001 | DC | 573822 | 9195 |
1425 | 2002 | DC | 569157 | 9322 |
1426 | 2003 | DC | 557620 | 9061 |
1427 | 2016 | DC | 684336 | 8236 |
1428 rows ? 4 columns
Adding the full state names (the list of states we already used in our research — CSV) and optimizing / sorting the data:
df_crime_states = df_crime_states.merge(df_state_names, on='state_abbr')
df_crime_states.dropna(inplace=True)
df_crime_states.sort_values(by=['year', 'state_abbr'], inplace=True)
Since the dataset already has population values, let's calculate the number of crimes per million people:
df_crime_states['crime_promln'] = df_crime_states['violent_crime'] * 1e6 /
df_crime_states['population']
Finally, we'll turn the data into a table spanning the 2000 — 2018 period transposing the state names and dropping the redundant columns:
df_crime_states_agg = df_crime_states.groupby(['state_name',
'year'])['violent_crime'].sum().unstack(level=1).T
df_crime_states_agg.fillna(0, inplace=True)
df_crime_states_agg = df_crime_states_agg.astype('uint32').loc[2000:2018, :]
The resulting table contains 19 rows (year observations from 2000 through 2018) and 51 columns (by the number of states).
Let's display the top 10 states for the average number of crimes:
df_crime_states_agg_top10 = df_crime_states_agg.describe().T.nlargest(10, 'mean'). astype('uint32')
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
state_name | ||||||||
California | 19 | 181514 | 19425 | 153763 | 165508 | 178597 | 193022 | 212867 |
Texas | 19 | 117614 | 6522 | 104734 | 113212 | 121091 | 122084 | 126018 |
Florida | 19 | 110104 | 18542 | 81980 | 92809 | 113541 | 127488 | 131878 |
New York | 19 | 81618 | 9548 | 68495 | 75549 | 77563 | 85376 | 105111 |
Illinois | 19 | 62866 | 10445 | 47775 | 54039 | 64185 | 69937 | 81196 |
Michigan | 19 | 49273 | 5029 | 41712 | 44900 | 49737 | 54035 | 56981 |
Pennsylvania | 19 | 46941 | 5066 | 39192 | 41607 | 48188 | 51021 | 55028 |
Tennessee | 19 | 41951 | 2432 | 38063 | 40321 | 41562 | 43358 | 46482 |
Georgia | 19 | 40228 | 3327 | 34355 | 38283 | 39435 | 41495 | 47353 |
North Carolina | 19 | 37936 | 3193 | 32718 | 34706 | 38243 | 40258 | 43125 |
We'll also make it more graphic with a box plot:
df_crime_states_top10 = df_crime_states_agg.loc[:, df_crime_states_agg_top10.index]
plt = df_crime_states_top10.plot.box(figsize=(12, 10))
plt.set_ylabel('Violent crime count (2000 - 2018)')
The 'Hollywood' state easily and notoriously beats the rest 9. The 'prizewinners' are California, Texas and Florida, all three in the South, the regular settings for most Hollywood criminal blockbusters.
You can also see that criminality has changed considerably over the observed period in some states (California, Florida and Illinois), whereas in others (like Georgia) it has remained almost constant.
I tend to think the crime rates are in some way connected with population :) Let's see the top 10 states by population in 2018:
df_crime_states_2018 = df_crime_states.loc[df_crime_states['year'] == 2018]
plt = df_crime_states_2018.nlargest(10, 'population'). sort_values(by='population').plot.barh(x='state_name',
y='population', legend=False, figsize=(10,5))
plt.set_xlabel('2018 Population')
plt.set_ylabel('')
Same old mugs here :) Let's check the correlation between crimes and population:
df_corr = df_crime_states[df_crime_states['year']>=2000].groupby(['state_name']).mean()
df_corr = df_corr.loc[:, ['population', 'violent_crime']]
df_corr.corr(method='pearson').at['population', 'violent_crime']
The calculated Pearson correlation coefficient is 0.98. Q.E.D.
But the per capita crime counts give a staringly different picture:
plt = df_crime_states_2018.nlargest(10, 'crime_promln'). sort_values(by='crime_promln').plot.barh(x='state_name',
y='crime_promln', legend=False, figsize=(10,5))
plt.set_xlabel('Number of violent crimes per 1 mln. population (2018)')
plt.set_ylabel('')
There's a pretty kettle of fish! The leaders by per capita crimes are the least populated states: District Columbia (with the US capital) and Alaska (both home to some 700+ thousand people as of 2018), as well as one medium-populated state — New Mexico, with 2 mln. people. Only one state from our previous toplist is featured here — Tennessee, which gives this state a less-than-desirable reputation.
We will then display these results on the US map. To do this, we need the folium library:
import folium
First, the 2018 absolute crime counts:
FOLIUM_URL = 'https://raw.githubusercontent.com/python-visualization/folium/master/examples/data'
FOLIUM_US_MAP = f'{FOLIUM_URL}/us-states.json'
m = folium.Map(location=[48, -102], zoom_start=3)
folium.Choropleth(
geo_data=FOLIUM_US_MAP,
name='choropleth',
data=df_crime_states_2018,
columns=['state_abbr', 'violent_crime'],
key_on='feature.id',
fill_color='YlOrRd',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Violent crimes in 2018',
bins=df_crime_states_2018['violent_crime'].quantile(
list(np.linspace(0.0, 1.0, 5))).to_list(),
reset=True
).add_to(m)
folium.LayerControl().add_to(m)
m
The same in per capita values (per 1 million):
m = folium.Map(location=[48, -102], zoom_start=3)
folium.Choropleth(
geo_data=FOLIUM_US_MAP,
name='choropleth',
data=df_crime_states_2018,
columns=['state_abbr', 'crime_promln'],
key_on='feature.id',
fill_color='YlOrRd',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Violent crimes in 2018 (per 1 mln. population)',
bins=df_crime_states_2018['crime_promln'].quantile(
list(np.linspace(0.0, 1.0, 5))).to_list(),
reset=True
).add_to(m)
folium.LayerControl().add_to(m)
m
In the first case, as we can see, crimes are more or less evenly distributed in the North to South direction. In the second case, it's mostly the Southern states plus DC and Alaska that make the trend.
Lethal Force Fatalities Across States (No Racial Factor)
We are now going to look at lethal force used in individual states across the country.
To prepare the dataset, we'll complement the UOF (Use Of Force) data we used previously by the full state names, group the cases by states, and constrain the observations to years 2000 through 2018:
df_fenc_agg_states = df_fenc.merge(df_state_names, how='inner',
left_on='State', right_on='state_abbr')
df_fenc_agg_states.fillna(0, inplace=True)
df_fenc_agg_states = df_fenc_agg_states.rename(columns={'state_name_x': 'State Name'})
df_fenc_agg_states = df_fenc_agg_states.loc[:, ['Year', 'Race', 'State',
'State Name', 'Cause', 'UOF']]
df_fenc_agg_states = df_fenc_agg_states. groupby(['Year', 'State Name', 'State'])['UOF']. count().unstack(level=0)
df_fenc_agg_states.fillna(0, inplace=True)
df_fenc_agg_states = df_fenc_agg_states.astype('uint16').loc[:, :2018]
df_fenc_agg_states = df_fenc_agg_states.reset_index()
Top 10 states for police victims in 2018:
df_fenc_agg_states_2018 = df_fenc_agg_states.loc[:, ['State Name', 2018]]
plt = df_fenc_agg_states_2018.nlargest(10, 2018).sort_values(2018).plot.barh(
x='State Name', y=2018, legend=False, figsize=(10,5))
plt.set_xlabel('Number of UOF victims in 2018')
plt.set_ylabel('')
Let's also review the data for the entire period as a box plot:
fenc_top10 = df_fenc_agg_states.loc[df_fenc_agg_states['State Name']. isin(df_fenc_agg_states_2018.nlargest(10, 2018)['State Name'])]
fenc_top10 = fenc_top10.T
fenc_top10.columns = fenc_top10.loc['State Name', :]
fenc_top10 = fenc_top10.reset_index().loc[2:, :].set_index('Year')
df_sorted = fenc_top10.mean().sort_values(ascending=False)
fenc_top10 = fenc_top10.loc[:, df_sorted.index]
plt = fenc_top10.plot.box(figsize=(12, 6))
plt.set_ylabel('Number of UOF victims (2000 - 2018)')
Yep! The same 'unholy trio' of California, Texas and Florida, with their other two Southern sidekicks — Arizona and Georgia. The leaders again show large scatter indicative of year-to-year changes.
Connection Between Lethal Force Fatalities and Crimes
As in the previous part of this research, we are investigating the possible connection between criminality and deaths at the hands of law enforcement. We'll start without the racial factor, to see if such a connection exists in principle and how it varies from state to state.
At first, we must merge the UOF and (violent) crime datasets, setting the observation period to 2000 — 2018:
# add full state names
df_fenc_crime_states = df_fenc.merge(df_state_names, how='inner',
left_on='State', right_on='state_abbr')
# rename some columns
df_fenc_crime_states = df_fenc_crime_states.rename(columns={'Year': 'year',
'state_name_x': 'state_name'})
# truncate period to 2000-2018
df_fenc_crime_states = df_fenc_crime_states[df_fenc_crime_states['year'].between(2000,
2018)]
# group by year and state
df_fenc_crime_states = df_fenc_crime_states.groupby(['year', 'state_name'])['UOF']. count().reset_index()
# join with crime data
df_fenc_crime_states = df_fenc_crime_states.merge(df_crime_states[df_crime_states['year']. between(2000, 2018)], how='outer', on=['year', 'state_name'])
# set missing data to zero
df_fenc_crime_states.fillna({'UOF': 0}, inplace=True)
# unify data types
df_fenc_crime_states = df_fenc_crime_states.astype({'year': 'uint16', 'UOF': 'uint16',
'population': 'uint32', 'violent_crime': 'uint32'})
# sort data
df_fenc_crime_states = df_fenc_crime_states.sort_values(by=['year', 'state_name'])
Resulting dataset
year | state_name | UOF | state_abbr | population | violent_crime | crime_promln | |
---|---|---|---|---|---|---|---|
0 | 2000 | Alabama | 7 | AL | 4447100 | 21620 | 4861.595197 |
1 | 2000 | Alaska | 2 | AK | 626932 | 3554 | 5668.876369 |
2 | 2000 | Arizona | 11 | AZ | 5130632 | 27281 | 5317.278651 |
3 | 2000 | Arkansas | 4 | AR | 2673400 | 11904 | 4452.756789 |
4 | 2000 | California | 97 | CA | 33871648 | 210531 | 6215.552311 |
... | ... | ... | ... | ... | ... | ... | ... |
907 | 2018 | Virginia | 18 | VA | 8517685 | 17032 | 1999.604353 |
908 | 2018 | Washington | 24 | WA | 7535591 | 23472 | 3114.818732 |
909 | 2018 | West Virginia | 7 | WV | 1805832 | 5236 | 2899.494527 |
910 | 2018 | Wisconsin | 10 | WI | 5813568 | 17176 | 2954.467893 |
911 | 2018 | Wyoming | 4 | WY | 577737 | 1226 | 2122.072846 |
As you will remember, the UOF column contains the number of deaths from encounters with law enforcement officers (who I sometimes call here just 'the police', but who include, of course, other agencies such as the FBI) where lethal force was used intentionally.
We will also make a separate dataset with year-average values:
df_fenc_crime_states_agg = df_fenc_crime_states.groupby(['state_name']). mean().loc[:, ['UOF', 'violent_crime']]
Now let's look at the year averages for crimes and lethal force fatalities for all the 51 states on one plot:
plt = df_fenc_crime_states_agg['violent_crime'].plot.bar(legend=True, figsize=(15,5))
plt.set_ylabel('Number of violent crimes (year average)')
plt2 = df_fenc_crime_states_agg['UOF'].plot(secondary_y=True, style='g', legend=True)
plt2.set_ylabel('Number of UOF victims (year average)', rotation=90)
plt2.set_xlabel('')
plt.set_xlabel('')
plt.set_xticklabels(df_fenc_crime_states_agg.index, rotation='vertical')
Looking closely at this combined chart, one can see the following:
- the connection between crime and use of force is plainly trackable: the green UOF curve tends to repeat the shape of the crime bars
- the more criminal states (such as Florida, Illinois, Michigan, New York and Texas) evince proportionately less use of force compared to the less criminal states
Let's also make a scatterplot:
plt = df_fenc_crime_states_agg.plot.scatter(x='violent_crime', y='UOF')
plt.set_xlabel('Number of violent crimes (year average)')
plt.set_ylabel('Number of UOF victims (year average)')
Here it becomes conspicuous that the ratio between crime and use of lethal force is affected by the crime rate. Speaking crudely, in states with the number of violent crimes below 75k the number of police victims grows more slowly; whereas in the states with the crime count above 75k this growth is quite steep. This latter group includes, as we can see, only four states. Let's look them 'in the face':
df_fenc_crime_states_agg[df_fenc_crime_states_agg['violent_crime'] > 75000]
UOF | violent_crime | |
---|---|---|
state_name | ||
California | 133.263158 | 181514.578947 |
Florida | 54.578947 | 110104.315789 |
New York | 19.157895 | 81618.052632 |
Texas | 64.368421 | 117614.631579 |
Will you be surprised? We've got the same 'four horsemen of the Apocalypse': California, Florida, Texas and New York.
Correspondingly, let's calculate the correlation coefficients between our data for three cases:
- states with the year average crime count up to 75,000
- states with the year average crime count above 75,000
- all the states
For the first case:
df_fenc_crime_states_agg[df_fenc_crime_states_agg['violent_crime'] <= 75000].corr(method='pearson').at['UOF', 'violent_crime']
— we obtain 0.839 as the correlation coefficient. This is a statistically valid value, although it doesn't reach 0.9 due to scatter across the 47 states.
For the first case:
df_fenc_crime_states_agg[df_fenc_crime_states_agg['violent_crime'] > 75000].corr(method='pearson').at['UOF', 'violent_crime']
— we get 0.999 — an ideal correlation!
For the last case (all states):
df_fenc_crime_states_agg.corr(method='pearson').at['UOF', 'violent_crime']
— the correlation is estimated at 0.935. This overall correlation may be considered very good.
Let's now look at the geographical distribution of our 'offender shootdown' index (the term is coined here for brevity). As before, we divide the number of lethal force fatalities by the number of crimes:
df_fenc_crime_states_agg['uof_by_crime'] = df_fenc_crime_states_agg['UOF'] /
df_fenc_crime_states_agg['violent_crime']
plt = df_fenc_crime_states_agg.loc[:, 'uof_by_crime'].sort_values(ascending=False). plot.bar(figsize=(15,5))
plt.set_xlabel('')
plt.set_ylabel('Ratio of UOF victims to number of violent crimes')
It is interesting to observe that our erstwhile leaders have shifted toward the center or even the rightmost end of the chart, which must mean that the most criminal states don't have the most 'bloodthirsty' police (towards real or potential offenders).
Intermediate conclusions:
- The number of violent crimes is directly proportionate to population (good call, Captain Obvious!)
- The most populated states (California, Florida, Texas and New York) are also the most criminal, in absolute values.
- In per capita values, Southern states are more criminal than Northern states, with the exception of Alaska and District Columbia.
- Lethal force deaths are correlated to criminality with an average coefficient of 0.93 across all the states. This correlation reaches almost unity (strictly linear) for the most criminal states and only 0.84 for the rest.
Racial Factor in Criminality and Lethal Force Fatalities Across States
Proving that crime rates do affect police victim rates, let's add the racial factor and see what it affects. As I explained above, we'll be using the arrest data for this purpose as being the most complete and covering the main offenses for all the states. There is, of course, no such state or country where one could equate the number of committed crimes to the number of arrests; yet these parameters are closely related. As such, we can do very well with arrest data for our statistical analysis. And, as we already agreed, only violent offenses (murder, rape, robbery, aggravated assault) will be taken into account.
Let's load the source data from the CSV file and routinely add the full state names:
ARRESTS_FILE = ROOT_FOLDER + '\\arrests_by_state_race.csv'
# arrests of Blacks and Whites only
df_arrests = pd.read_csv(ARRESTS_FILE, sep=';', header=0,
usecols=['data_year', 'state', 'white', 'black'])
# sum the four offenses and group by states
df_arrests = df_arrests.groupby(['data_year', 'state']).sum().reset_index()
# add state names
df_arrests = df_arrests.merge(df_state_names, left_on='state', right_on='state_abbr')
# rename / remove columns
df_arrests = df_arrests.rename(columns={'data_year': 'year'}).drop(columns='state_abbr')
# peek at the result
df_arrests.head()
year | state | black | white | state_name | |
---|---|---|---|---|---|
0 | 2000 | AK | 140 | 613 | Alaska |
1 | 2001 | AK | 139 | 718 | Alaska |
2 | 2002 | AK | 143 | 677 | Alaska |
3 | 2003 | AK | 173 | 801 | Alaska |
4 | 2004 | AK | 163 | 765 | Alaska |
We'll also create a dataframe with year average values:
df_arrests_agg = df_arrests.groupby(['state_name']).mean().drop(columns='year')
Arrests of Whites and Blacks in 51 states (year average counts)
black | white | |
---|---|---|
state_name | ||
Alabama | 2805.842105 | 1757.315789 |
Alaska | 221.894737 | 844.157895 |
Arizona | 1378.368421 | 7007.157895 |
Arkansas | 2387.894737 | 2303.789474 |
California | 26668.368421 | 87252.315789 |
Colorado | 1268.210526 | 5157.368421 |
Connecticut | 2097.631579 | 2981.210526 |
Delaware | 1356.894737 | 1048.578947 |
District of Columbia | 111.111111 | 4.944444 |
Florida | 12.000000 | 7.000000 |
Georgia | 8262.842105 | 3502.894737 |
Hawaii | 81.052632 | 368.736842 |
Idaho | 44.000000 | 1362.263158 |
Illinois | 5699.842105 | 1841.894737 |
Indiana | 3553.368421 | 5192.263158 |
Iowa | 1104.421053 | 3039.473684 |
Kansas | 522.315789 | 1501.315789 |
Kentucky | 1476.894737 | 1906.052632 |
Louisiana | 5928.789474 | 3414.263158 |
Maine | 63.736842 | 699.526316 |
Maryland | 7189.105263 | 4010.684211 |
Massachusetts | 3407.157895 | 7319.684211 |
Michigan | 7628.157895 | 6304.157895 |
Minnesota | 2231.210526 | 2645.736842 |
Mississippi | 1462.210526 | 474.368421 |
Missouri | 5777.473684 | 5703.368421 |
Montana | 27.684211 | 673.684211 |
Nebraska | 591.421053 | 1058.526316 |
Nevada | 1956.421053 | 3817.210526 |
New Hampshire | 68.368421 | 640.789474 |
New Jersey | 6424.157895 | 6043.789474 |
New Mexico | 234.421053 | 2809.368421 |
New York | 8394.526316 | 8734.947368 |
North Carolina | 10527.947368 | 7412.947368 |
North Dakota | 61.263158 | 277.052632 |
Ohio | 4063.947368 | 4071.368421 |
Oklahoma | 1625.105263 | 3353.000000 |
Oregon | 445.105263 | 3373.368421 |
Pennsylvania | 11974.157895 | 11039.473684 |
Rhode Island | 275.684211 | 699.210526 |
South Carolina | 5578.526316 | 3615.421053 |
South Dakota | 67.105263 | 349.368421 |
Tennessee | 6799.894737 | 8462.526316 |
Texas | 10547.631579 | 22062.684211 |
Utah | 167.105263 | 1748.894737 |
Vermont | 43.526316 | 439.210526 |
Virginia | 4100.421053 | 3060.263158 |
Washington | 1688.947368 | 6012.105263 |
West Virginia | 271.263158 | 1528.315789 |
Wisconsin | 3440.055556 | 4107.722222 |
Wyoming | 27.263158 | 506.947368 |
Looking at this table, one can't overlook some oddities. In some states the arrest counts reach hundreds and thousands, while in others — only dozens or fewer. That's the case with Florida, one of the most populated states: it counts only 19 arrests per year (12 Blacks and 7 Whites). Surely, some data is missing here; let's check:
df_arrests[df_arrests['state'] == 'FL']
And indeed we see that data for Florida is available only for 2017. Well, we'll have to put up with this, I suppose. All the other states have complete data. But the ten / hundred-fold difference should be accounted for by population. Let's add population-by-race data and have a look.
The population data was taken from the US Census Bureau website (which is for some reason not accessible in Russia). You can download the prepared CSV file with 2010 — 2019 data from here.
Unfortunately, no state population data exist for prior periods (2000 — 2009). We have therefore to narrow down our observation period to 9 years (from 2010 through 2018) for this part of the research.
POP_STATES_FILES = ROOT_FOLDER + '\\us_pop_states_race_2010-2019.csv'
df_pop_states = pd.read_csv(POP_STATES_FILES, sep=';', header=0)
# the source CSV has a specific format, so some trickery is required :)
df_pop_states = df_pop_states.melt('state_name', var_name='r_year', value_name='pop')
df_pop_states['race'] = df_pop_states['r_year'].str[0]
df_pop_states['year'] = df_pop_states['r_year'].str[2:].astype('uint16')
df_pop_states.drop(columns='r_year', inplace=True)
df_pop_states = df_pop_states[df_pop_states['year'].between(2000, 2018)]
df_pop_states = df_pop_states.groupby(['state_name', 'year', 'race']).sum(). unstack().reset_index()
df_pop_states.columns = ['state_name', 'year', 'black_pop', 'white_pop']
White and Black population across states
year | black_pop | white_pop | |
---|---|---|---|
state_name | |||
Alabama | 2010 | 5044936 | 13462236 |
Alabama | 2011 | 5067912 | 13477008 |
Alabama | 2012 | 5102512 | 13484256 |
Alabama | 2013 | 5137360 | 13488812 |
Alabama | 2014 | 5162316 | 13493432 |
... | ... | ... | ... |
Wyoming | 2014 | 31392 | 2167008 |
Wyoming | 2015 | 29568 | 2177740 |
Wyoming | 2016 | 29304 | 2170700 |
Wyoming | 2017 | 29444 | 2148128 |
Wyoming | 2018 | 29604 | 2139896 |
Merging this data with the arrests dataset, we can calculate the per-million arrest counts:
df_arrests_2010_2018 = df_arrests.merge(df_pop_states, how='inner',
on=['year', 'state_name'])
df_arrests_2010_2018['white_arrests_promln'] = df_arrests_2010_2018['white'] * 1e6 /
df_arrests_2010_2018['white_pop']
df_arrests_2010_2018['black_arrests_promln'] = df_arrests_2010_2018['black'] * 1e6 /
df_arrests_2010_2018['black_pop']
And again let's calculate the year averages:
df_arrests_2010_2018_agg = df_arrests_2010_2018.groupby(
['state_name', 'state']).mean().drop(columns='year').reset_index()
df_arrests_2010_2018_agg = df_arrests_2010_2018_agg.set_index('state_name')
Combined arrest dataset with absolute and per-million counts
state | black | white | black_pop | white_pop | white_arrests_promln | black_arrests_promln | |
---|---|---|---|---|---|---|---|
state_name | |||||||
Alabama | AL | 1682.000000 | 1342.000000 | 5.152399e+06 | 1.349158e+07 | 99.424741 | 324.055203 |
Alaska | AK | 255.000000 | 870.555556 | 1.069489e+05 | 1.957445e+06 | 445.199704 | 2390.243876 |
Arizona | AZ | 1635.555556 | 6852.000000 | 1.279172e+06 | 2.260403e+07 | 302.923002 | 1267.000192 |
Arkansas | AR | 1960.666667 | 2466.000000 | 1.855574e+06 | 9.465137e+06 | 260.459917 | 1055.854934 |
California | CA | 24381.666667 | 79477.000000 | 1.007921e+07 | 1.128020e+08 | 704.731408 | 2419.234376 |
Colorado | CO | 1377.222222 | 5171.555556 | 9.508173e+05 | 1.882940e+07 | 274.209456 | 1439.257054 |
Connecticut | CT | 1823.777778 | 2295.333333 | 1.643690e+06 | 1.165681e+07 | 196.712775 | 1114.811569 |
Delaware | DE | 1318.000000 | 914.111111 | 8.354622e+05 | 2.635794e+06 | 347.374980 | 1582.395733 |
District of Columbia | DC | 139.222222 | 4.777778 | 1.288488e+06 | 1.154416e+06 | 4.112547 | 108.101938 |
Florida | FL | 12.000000 | 7.000000 | 1.415383e+07 | 6.498292e+07 | 0.107721 | 0.847827 |
Georgia | GA | 8137.222222 | 4271.444444 | 1.279378e+07 | 2.500293e+07 | 170.939250 | 639.869143 |
Hawaii | HI | 81.333333 | 383.777778 | 1.124298e+05 | 1.453712e+06 | 264.353469 | 725.477589 |
Idaho | ID | 51.888889 | 1373.777778 | 5.288222e+04 | 6.154316e+06 | 223.151878 | 978.205026 |
Illinois | IL | 4216.000000 | 1284.222222 | 7.554687e+06 | 3.980927e+07 | 32.199075 | 557.493894 |
Indiana | IN | 2924.444444 | 5186.111111 | 2.522917e+06 | 2.267508e+07 | 228.699515 | 1155.168768 |
Iowa | IA | 1181.000000 | 2999.222222 | 4.305640e+05 | 1.141794e+07 | 262.666753 | 2760.038539 |
Kansas | KS | 539.555556 | 1512.111111 | 7.116182e+05 | 1.006714e+07 | 150.232160 | 758.851182 |
Kentucky | KY | 1443.888889 | 2173.666667 | 1.442174e+06 | 1.558094e+07 | 139.526970 | 1001.433470 |
Louisiana | LA | 5917.000000 | 3255.333333 | 6.021228e+06 | 1.174245e+07 | 277.277874 | 981.334817 |
Maine | ME | 78.000000 | 678.000000 | 7.667733e+04 | 5.059062e+06 | 134.024032 | 1019.061684 |
Maryland | MD | 6460.444444 | 3325.444444 | 7.229037e+06 | 1.426036e+07 | 233.317775 | 893.942720 |
Massachusetts | MA | 3349.555556 | 6895.111111 | 2.249232e+06 | 2.226671e+07 | 309.745910 | 1505.096888 |
Michigan | MI | 6302.444444 | 5647.444444 | 5.645176e+06 | 3.170670e+07 | 178.111684 | 1116.364030 |
Minnesota | MN | 2570.000000 | 2686.777778 | 1.311818e+06 | 1.867259e+07 | 143.902882 | 1986.464052 |
Mississippi | MS | 1251.000000 | 418.777778 | 4.478208e+06 | 7.122651e+06 | 58.753686 | 279.574565 |
Missouri | MO | 4588.333333 | 5146.111111 | 2.854060e+06 | 2.023871e+07 | 254.292323 | 1608.303611 |
Montana | MT | 34.222222 | 788.333333 | 2.210444e+04 | 3.660813e+06 | 214.944902 | 1525.795754 |
Nebraska | NE | 618.888889 | 1154.888889 | 3.701520e+05 | 6.709768e+06 | 172.269972 | 1687.725359 |
Nevada | NV | 2450.000000 | 4480.333333 | 1.052192e+06 | 8.647157e+06 | 517.401564 | 2316.374085 |
New Hampshire | NH | 89.777778 | 784.777778 | 7.873600e+04 | 5.012056e+06 | 156.580888 | 1141.127571 |
New Jersey | NJ | 5429.555556 | 4971.888889 | 5.241910e+06 | 2.595141e+07 | 191.427955 | 1037.217679 |
New Mexico | NM | 260.111111 | 3136.000000 | 2.053876e+05 | 6.905377e+06 | 454.129135 | 1268.115549 |
New York | NY | 6035.777778 | 6600.222222 | 1.373077e+07 | 5.534157e+07 | 119.253616 | 439.581451 |
North Carolina | NC | 9549.000000 | 6759.333333 | 8.804027e+06 | 2.844145e+07 | 238.320077 | 1088.968561 |
North Dakota | ND | 100.666667 | 386.222222 | 6.583289e+04 | 2.583206e+06 | 149.190455 | 1536.987272 |
Ohio | OH | 3632.888889 | 3733.333333 | 5.879375e+06 | 3.844592e+07 | 97.107129 | 617.699379 |
Oklahoma | OK | 1577.333333 | 3049.000000 | 1.189604e+06 | 1.160567e+07 | 262.904593 | 1326.463864 |
Oregon | OR | 375.444444 | 3125.000000 | 3.292284e+05 | 1.402225e+07 | 222.819615 | 1148.158169 |
Pennsylvania | PA | 11227.000000 | 10652.111111 | 5.945100e+06 | 4.232445e+07 | 251.598838 | 1893.415475 |
Rhode Island | RI | 274.888889 | 595.000000 | 3.275551e+05 | 3.592825e+06 | 165.605635 | 837.932682 |
South Carolina | SC | 4703.222222 | 3094.111111 | 5.365012e+06 | 1.324712e+07 | 234.287821 | 877.892998 |
South Dakota | SD | 103.777778 | 448.333333 | 6.154533e+04 | 2.903489e+06 | 153.995184 | 1641.137012 |
Tennessee | TN | 7603.000000 | 9068.666667 | 4.460808e+06 | 2.070126e+07 | 438.486812 | 1708.022356 |
Texas | TX | 10821.666667 | 21122.111111 | 1.345661e+07 | 8.628389e+07 | 245.051258 | 803.917061 |
Utah | UT | 193.222222 | 1797.333333 | 1.558876e+05 | 1.079659e+07 | 166.431266 | 1240.117890 |
Vermont | VT | 54.222222 | 520.555556 | 3.017111e+04 | 2.376143e+06 | 219.129918 | 1785.111547 |
Virginia | VA | 4059.555556 | 3071.222222 | 6.544598e+06 | 2.340732e+07 | 131.178648 | 620.504151 |
Washington | WA | 1791.777778 | 5870.444444 | 1.147000e+06 | 2.289368e+07 | 256.632241 | 1566.862244 |
West Virginia | WV | 294.111111 | 1648.666667 | 2.597649e+05 | 6.908718e+06 | 238.517207 | 1132.059057 |
Wisconsin | WI | 3525.333333 | 4046.222222 | 1.516534e+06 | 2.018658e+07 | 200.441064 | 2325.622492 |
Wyoming | WY | 28.777778 | 464.555556 | 2.856356e+04 | 2.151349e+06 | 216.004646 | 1005.725503 |
Let's visualize this stuff.
1. Absolute arrest counts
plt = df_arrests_2010_2018_agg[['white', 'black']].sort_index(ascending=False). plot.barh(color=['g', 'olive'], figsize=(10, 20))
plt.set_ylabel('')
plt.set_xlabel('Year-average arrest count (2010-2018)')
2. Arrest counts per million population (for each race)
plt = df_arrests_2010_2018_agg[['white_arrests_promln', 'black_arrests_promln']]. sort_index(ascending=False).plot.barh(color=['g', 'olive'], figsize=(10, 20))
plt.set_ylabel('')
plt.set_xlabel('Year-average arrest count per 1 mln. within race (2010-2018)')
What can we infer from this data?
First of all, we see that the number of arrests is affected by population — this is observed for both races.
Secondly, Whites get busted somewhat more often than Blacks in absolute figures. The 'somewhat' — because this rule isn't universal for all the states (exclusions are North Carolina, Georgia, Louisiana, etc.); at the same time, the difference is but slight in most states, except a few (like California, Texas, Colorado, Massachusetts and a few others).
Last but not least, Blacks get arrested much more often in all the states in per capita values.
Let's back these observations by numbers.
Difference between the average White and Black arrest counts:
df_arrests_2010_2018['white'].mean() / df_arrests_2010_2018['black'].mean()
— we get 1.56. That is, the observed 9 years saw on average one and a half times more Whites being arrested than Blacks.
Then in per capita values:
df_arrests_2010_2018['white_arrests_promln'].mean() /
df_arrests_2010_2018['black_arrests_promln'].mean()
— the ratio is 0.183. That is, a Black person is on average 5.5 times more likely to get arrested than a White person.
Thus, the previous conclusion of higher criminality among Blacks (compared to Whites) is confirmed by the arrest data for all the states of the USA.
To understand how race and criminality are connected with lethal force victims, let's merge the two datasets.
First, we prepare the use-of-force data with the victims' race details:
df_fenc_agg_states1 = df_fenc.merge(df_state_names, how='inner',
left_on='State', right_on='state_abbr')
df_fenc_agg_states1.fillna(0, inplace=True)
df_fenc_agg_states1 = df_fenc_agg_states1.rename(columns={
'state_name_x': 'state_name', 'Year': 'year'})
df_fenc_agg_states1 = df_fenc_agg_states1.loc[df_fenc_agg_states1['year']. between(2000, 2018), ['year', 'Race', 'state_name', 'UOF']]
df_fenc_agg_states1 = df_fenc_agg_states1.groupby(['year', 'state_name', 'Race'])['UOF']. count().unstack().reset_index()
df_fenc_agg_states1 = df_fenc_agg_states1.rename(columns={
'Black': 'black_uof', 'White': 'white_uof'})
df_fenc_agg_states1 = df_fenc_agg_states1.fillna(0).astype({
'black_uof': 'uint32', 'white_uof': 'uint32'})
Resulting UOF dataset
Race | year | state_name | black_uof | white_uof |
---|---|---|---|---|
0 | 2000 | Alabama | 4 | 3 |
1 | 2000 | Alaska | 0 | 2 |
2 | 2000 | Arizona | 0 | 11 |
3 | 2000 | Arkansas | 1 | 3 |
4 | 2000 | California | 19 | 78 |
... | ... | ... | ... | ... |
907 | 2018 | Virginia | 11 | 7 |
908 | 2018 | Washington | 0 | 24 |
909 | 2018 | West Virginia | 2 | 5 |
910 | 2018 | Wisconsin | 3 | 7 |
911 | 2018 | Wyoming | 0 | 4 |
Then we're merging it with the arrest data:
df_arrests_fenc = df_arrests.merge(df_fenc_agg_states1,
on=['state_name', 'year'])
df_arrests_fenc = df_arrests_fenc.rename(columns={
'white': 'white_arrests', 'black': 'black_arrests'})
Example data for 2017
year | state | black_arrests | white_arrests | state_name | black_uof | white_uof | |
---|---|---|---|---|---|---|---|
15 | 2017 | AK | 266 | 859 | Alaska | 2 | 3 |
34 | 2017 | AL | 3098 | 2509 | Alabama | 7 | 17 |
53 | 2017 | AR | 2092 | 2674 | Arkansas | 6 | 7 |
72 | 2017 | AZ | 2431 | 7829 | Arizona | 6 | 43 |
91 | 2017 | CA | 24937 | 80367 | California | 25 | 137 |
110 | 2017 | CO | 1781 | 6079 | Colorado | 2 | 27 |
127 | 2017 | CT | 1687 | 2114 | Connecticut | 1 | 5 |
140 | 2017 | DE | 1198 | 782 | Delaware | 4 | 3 |
159 | 2017 | GA | 7747 | 4171 | Georgia | 15 | 21 |
173 | 2017 | HI | 88 | 419 | Hawaii | 0 | 1 |
192 | 2017 | IA | 1400 | 3524 | Iowa | 1 | 5 |
210 | 2017 | ID | 61 | 1423 | Idaho | 0 | 6 |
229 | 2017 | IL | 2847 | 947 | Illinois | 13 | 11 |
248 | 2017 | IN | 3565 | 4300 | Indiana | 9 | 13 |
267 | 2017 | KS | 585 | 1651 | Kansas | 3 | 10 |
286 | 2017 | KY | 1481 | 2035 | Kentucky | 1 | 18 |
305 | 2017 | LA | 5875 | 2284 | Louisiana | 13 | 5 |
324 | 2017 | MA | 2953 | 6089 | Massachusetts | 1 | 4 |
343 | 2017 | MD | 6662 | 3371 | Maryland | 8 | 5 |
361 | 2017 | ME | 89 | 675 | Maine | 1 | 8 |
380 | 2017 | MI | 6149 | 5459 | Michigan | 6 | 7 |
399 | 2017 | MN | 2513 | 2681 | Minnesota | 1 | 7 |
418 | 2017 | MO | 4571 | 5007 | Missouri | 13 | 20 |
437 | 2017 | MS | 1266 | 409 | Mississippi | 7 | 10 |
455 | 2017 | MT | 50 | 915 | Montana | 0 | 3 |
474 | 2017 | NC | 8177 | 5576 | North Carolina | 9 | 14 |
501 | 2017 | NE | 80 | 578 | Nebraska | 0 | 1 |
516 | 2017 | NH | 113 | 817 | New Hampshire | 0 | 3 |
535 | 2017 | NJ | 4859 | 4136 | New Jersey | 9 | 6 |
554 | 2017 | NM | 205 | 2094 | New Mexico | 0 | 20 |
573 | 2017 | NV | 2695 | 4657 | Nevada | 3 | 12 |
592 | 2017 | NY | 5923 | 6633 | New York | 7 | 9 |
611 | 2017 | OH | 4472 | 3882 | Ohio | 11 | 23 |
630 | 2017 | OK | 1638 | 2872 | Oklahoma | 3 | 20 |
649 | 2017 | OR | 453 | 3222 | Oregon | 2 | 9 |
668 | 2017 | PA | 10123 | 10191 | Pennsylvania | 7 | 17 |
681 | 2017 | RI | 315 | 633 | Rhode Island | 0 | 1 |
700 | 2017 | SC | 4645 | 2964 | South Carolina | 3 | 10 |
712 | 2017 | SD | 124 | 537 | South Dakota | 0 | 2 |
731 | 2017 | TN | 6654 | 8496 | Tennessee | 4 | 24 |
750 | 2017 | TX | 11493 | 20911 | Texas | 18 | 56 |
769 | 2017 | UT | 199 | 1964 | Utah | 1 | 5 |
788 | 2017 | VA | 4283 | 3247 | Virginia | 8 | 17 |
804 | 2017 | VT | 75 | 626 | Vermont | 0 | 1 |
823 | 2017 | WA | 1890 | 5804 | Washington | 8 | 27 |
842 | 2017 | WV | 350 | 1705 | West Virginia | 1 | 10 |
856 | 2017 | WY | 36 | 549 | Wyoming | 0 | 1 |
872 | 2017 | DC | 135 | 8 | District of Columbia | 1 | 1 |
890 | 2017 | WI | 3604 | 4106 | Wisconsin | 6 | 15 |
892 | 2017 | FL | 12 | 7 | Florida | 19 | 43 |
OK, time to calculate the correlation coefficients between arrests and lethal force fatalities, as we did before:
df_corr = df_arrests_fenc.loc[:, ['white_arrests', 'black_arrests',
'white_uof', 'black_uof']].corr(method='pearson').iloc[:2, 2:]
df_corr.style.background_gradient(cmap='PuBu')
white_uof | black_uof | |
---|---|---|
white_arrests | 0.872766 | 0.622167 |
black_arrests | 0.702350 | 0.766852 |
Again we've produced quite good correlations: 0.87 for Whites and 0.77 for Blacks. It's curious that these values are very close to those we obtained for All Offenses in the previous part of the article (0.88 for Whites and 0.72 for Blacks).
What about our 'offender shootdown' index? Let's check:
df_arrests_fenc['white_uof_by_arr'] = df_arrests_fenc['white_uof'] /
df_arrests_fenc['white_arrests']
df_arrests_fenc['black_uof_by_arr'] = df_arrests_fenc['black_uof'] /
df_arrests_fenc['black_arrests']
df_arrests_fenc.replace([np.inf, -np.inf], np.nan, inplace=True)
df_arrests_fenc.fillna({'white_uof_by_arr': 0, 'black_uof_by_arr': 0}, inplace=True)
To see how this index is distributed geographically, let's take the 2018 data point:
plt = df_arrests_fenc.loc[df_arrests_fenc['year'] == 2018,
['state_name', 'white_uof_by_arr', 'black_uof_by_arr']]. sort_values(by='state_name', ascending=False). plot.barh(x='state_name', color=['g', 'olive'], figsize=(10, 20))
plt.set_ylabel('')
plt.set_xlabel('Ratio of UOF victims to violent crimes (2018)')
The index for Whites is greater in most states, with some exclusions (Utah, West Virginia, Kansas, Idaho, and District Columbia).
Let's compare the values for Whites and Blacks averaged for all the states:
plt = df_arrests_fenc.loc[:, ['white_uof_by_arr', 'black_uof_by_arr']]. mean().plot.bar(color=['g', 'olive'])
plt.set_ylabel('Ratio of UOF victims to violent crimes (2018)')
plt.set_xticklabels(['White', 'Black'], rotation=0)
The index is 2.5 times greater for Whites than for Blacks. If this index really says something, it means that a White criminal is on average 2.5 times more likely to meet death from the police than a Black criminal. Of course, this index varies much from state to state: for example, in Idaho a Black criminal is twice as likely to become a law enforcement victim, whereas in Mississippi — four times less likely.
Well, that's it really. Time to summarize our research.
Conclusions
- In the US, criminality is a function of population. The most 'criminal' states that we are used to watching movies or read about are simply the most populated. When analyzing per capita crime rates, the top positions are taken by some quite unexpected states like Alaska, District Columbia (with Washington City) and New Mexico.
- Southern states are on average more criminal than Northern states (in per capita crime values).
- Per capita crimes and arrests are unevenly distributed among the US white and black populations: black persons commit 3 times more crimes and are 5 times more often arrested than white persons.
- A black person is on average 2.5 times more likely to get killed in an encounter with law enforcement than a white person.
- Lethal force fatalities correlate well with criminality: the higher the crime rate, the more people get killed by the police. This correlation holds true for most states and for both races, although it is somewhat more pronounced among the white population. This is also confirmed by the difference in the victim-to-crime ratio between the races: white criminals are more likely to get killed by the police.
As a final word, I'd like to say thanks to my readers for their valuable comments and advice.
P.S. In a future (separate) article I am planning to continue analyzing crime and its connection with race in the US. We can first look into hate crimes and then discuss the law enforcement / offender interfaces from a reversed point of view, investigating line-of-duty fatalities among US police officers. I'd appreciate if you let me know in the comments if this subject is of interest.