Categorical Clustering of Pittsburgh Car Accidents Using K-Modes

26 min readMay 7, 2021

Cluster analysis, or clustering, refers to the segmentation of items or observations into groups based on their similarity to one another. This is typically accomplished using some form of iterative algorithm that repeatedly assigns and reassigns items to different groups in an effort to try and maximize similarity within the clusters while also attempting to keep the clusters distinct from one another. For numerical data, similarity between items or observations within a dataset can be evaluated using Euclidian distance between points or some other form of distance measurement, allowing for points that are closer together in space to be included in the same group.

A variety of algorithms are available for clustering numerical data, including k-means, Gaussian mixture models, DBSCAN, and many others. This article, however, will focus on clustering a dataset using categorical features only. To do this, we will be using the k-modes algorithm.

Dataset Background

For this analysis, we will be using data collected for over 42,000 car accidents reported to the police in the City of Pittsburgh, Pennsylvania, between 2010 and 2019. The dataset includes several categorical and numerical features for each crash, providing information about the roadway, crash conditions, the vehicles involved, and anonymized data about the passengers. The raw dataset used for this analysis is available through the Western Pennsylvania Regional Data Center and can be found at the following location: https://data.wprdc.org/dataset/allegheny-county-crash-data.

A data dictionary explaining the meaning behind each variable name and the associated categorical values can be found here. These variables will be referred to occasionally throughout the article.

A repository containing the code used for this analysis can be found here.

Analysis Outline

This analysis will consist of two parts:

Categorical Clustering — first, we will use a clustering algorithm (k-modes) to separate the dataset into groups of crashes with similar characteristics by using the categorical features from our dataset.
Cluster Exploration — next, we will perform an exploratory data analysis based on the clusters that we have assigned. We will use categorical features from each cluster to develop a narrative description of each crash category.

Unsupervised vs. Supervised Learning

Before we begin evaluating clustering algorithms, we should first identify the type of problem we are trying to solve. During the first stage of our analysis, we will attempt to group crash observations together based on similarity to one another. Each crash has not, however, been assigned to a specific group or class before this analysis. Our goal is to group accidents together into clusters based on their similarity to one another.

We will be using a clustering algorithm to assign each crash to a group, but we do not have a ground-truth set of labels for our data. This type of problem is known as an unsupervised learning problem in machine learning. Grouping will be based on the internal relationship between individual data points within our dataset. This type of problem differs from supervised learning problems such as classification or regression, for which observations used for model training contain a ground-truth, or “correct,” label at the problem onset.

Part 1: Categorical Clustering

Dataset Preparation

The raw dataset contains 190 features for over 121,000 car crashes occurring in Allegheny County from 2004–2019. We will begin our analysis by filtering the dataset to only include accidents occurring in the City of Pittsburgh during the years we are interested in (2010–2019). A map illustrating the locations of all accidents that meet our filtering criteria is shown below. The boundaries shown on this map represent municipal borders within Allegheny County, the county that contains the City of Pittsburgh. The thick black boundary with gray shading represents the City of Pittsburgh. The “hole” in the southern portion of the City represents Mt. Oliver, a separate municipality that is fully surrounded by the City.

It is worth noting that the dataset contains some accidents that are mislabeled as occurring within the City limits but whose geographic coordinates lie outside of the City boundaries. While there are several potential reasons for these discrepancies, the impacted number of crash occurrences is relatively small. These crash occurrences have been kept in the dataset as part of this analysis, but they warrant further investigation as part of a more thorough analysis.

Next, we will identify features that should be removed prior to analysis. First, we will remove all features that contain only one unique value since these features will not provide useful information for clustering. An additional set of features to remove was identified by going through the data dictionary for the dataset and identifying features that do not provide useful information for clustering, that provide duplicate information, or that include too large a number of potential classes to be practical for a clustering algorithm to manage in a reasonable amount of time. Examples of features that were dropped include street names, roadway segments and offsets, route numbers, and others. Unique Crash Record Number (CRN) values have been assigned to each accident, and while these values will be maintained to keep track of each observation, they will not be used when performing clustering.

A subset of numerical features was also identified for removal from the dataset prior to clustering. Clustering using mixed datatypes (categorical and numerical) can be accomplished using the k-prototypes algorithm, but this article is focused on clustering using categorical data only.

Missing values were imputed prior to cluster analysis. If a categorical value contained an “unknown” class, this class was used for all imputed values. Otherwise, the most commonly observed categorical value in the dataset for the variable was imputed.

Most categorical data was already present in numerical form. Categorical features containing strings were label encoded using LabelEncoder from scikit-learn prior to clustering. Code used for data loading and preprocessing is shown below:

k-Modes Clustering

Several standard clustering algorithms work by trying to minimize the distance between points within a cluster while maximizing the distance between points of different clusters. To do so requires measurement of the distance between individual datapoints. While this may be done simply with numerical data, it does not apply to non-numerical, or categorical, data. As a result, the approaches used by numerical clustering algorithms do not directly translate to categorical data.

The k-modes algorithm was developed for clustering categorical data. The algorithm is based on k-means, a popular numerical clustering algorithm that we will explore in a future analysis, and it was initially proposed in the paper, “Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values” by Zhexue Huang. A python implementation (kmodes) was developed based on the approach described in this paper.

While k-means operates by comparing calculated distances between points, k-modes uses a matching dissimilarity measure to compare categorical features instead. The matching dissimilarity is calculated by taking the total number of classes that do not match between a pair of observations. Matching dissimilarity between groups is minimized using cluster modes. This differs from the standard k-means algorithm, which uses means for clusters. Implementing this algorithm places observations with several shared categorical characteristics in groups together.

The k-modes algorithm has a couple of disadvantages which it shares with the k-means algorithm. Cluster formation is highly dependent on initialization conditions, and different clusters may be formed depending on the initial centroids that are selected. There are a number of different approaches to selecting the initial centroid points. They may be selected based on random combinations of categorical features, or they may be based on randomly selected observations. The latter method is known as the “Huang” initialization and is based on the approach used in the paper that originally proposed the k-modes algorithm. For our analysis, we will compare results using each of these two initialization methods.

Example code used to compare cluster sizes using the Huang initialization and the resulting figure demonstrating the relationship between cost and cluster size are shown below:

This procedure was repeated using random initialization, and the resulting plot of cost vs. cluster size is shown below:

We can see each plot demonstrates an elbow near n=6, suggesting that six clusters may be a good choice. This elbow is most clearly visible in the plot generated using the Huang initialization, so we will use this initialization. Using six clusters (n_components = 6) and Huang initialization, we will assign each observed accident to a cluster.

Part 2: Cluster Exploration

Using the clusters assigned during Part 1 of this analysis, we can now group our dataset to look for shared accident characteristics among observations within each cluster. Based on these characteristics, we can assign broad descriptions to each cluster. To do this, we will compare the relative frequency of each value for each categorical variable in each cluster. Percentages included in the cluster descriptions below refer to the in-cluster frequency of specific categorical observations. As a source of comparison, we can also calculate the out-of-cluster frequency for all observations not included in the cluster being evaluated.

Plots of each cluster shown on a map containing municipality borders are also included in the sections that follow. The geographical distribution of each categorical cluster of accidents is highly informative, and this information can be used to identify intersections or roadway segments that appear to be “hot-spots” for specific categories of accidents. Standard numerical clustering algorithms may be used to further explore each cluster and identify locations with high accident density. This analysis is beyond the scope of this article but will be explored in a future evaluation.

Chi-square Test

We will need a test statistic to evaluate which categorical variables are most strongly dependent upon the cluster assigned by k-modes. This will give us an idea of which categorical features most strongly differentiate the clusters from one another. This can be accomplished using the Chi-square test. The null hypothesis of this test is that there is no significant relationship between two variables (in this case, between the assigned cluster and each of our other categorical features), while the alternative hypothesis is that a significant relationship exists.

To complete a Chi-square test, a contingency table must first be produced that lists frequencies of occurrence for each class within the the categorical variable being evaluated, grouped based on the second categorical variable (in our case, this is the cluster assigned by the k-modes algorithm). An example contingency table for our assigned cluster (KMODE_CLUSTER) and the type of collision (CRASH_MONTH) is shown below:

The following equation is used to calculate the Chi-square test statistic:

In this equation, the chi-square value (χ2) is calculated by squaring the difference between the observed count (Oi) and the expected count (Ei) and dividing by the expected count for each class k. The sum of these values is then calculated, producing χ2. The expected count is determined based on the assumption that the null hypothesis is true and the two variables are fully independent.

To obtain a p-value for our hypothesis test, the degrees of freedom are calculated by determining the number of classes associated with each variable being compared, subtracting one from each of those quantities, and taking the product of the two values. This is represented by the following equation:

Here, r represents the number of rows of the contingency table (number of categories associated with first variable), while c represents the number of columns of the contingency table (number of categories associated with the second variable).

In addition to the degrees of freedom, we also need to select a value for alpha, which represents the significance level at which we choose to reject or accept the null hypothesis. For this analysis, we will choose a conservative alpha value of 0.01.

Using the selected alpha value and calculated degrees of freedom, there are several calculators and tables that can be used to determine a p-value. We will be using chi2_contingency from the scipy.stats library to perform our analysis, and this function automatically provides a p-value based on the characteristics of the input contingency table and our selected alpha value.

The code below goes through each variable (except for the crash CRN) and performs a Chi-square test to determine if that variable is strongly related to the assigned cluster.

While we expect the relationship between many variables and the assigned clusters to be strong since the k-modes algorithm used these variables to create the clusters, this analysis will help us determine which variables were least useful and may safely be ignored when comparing the clusters to one another. To determine these values, we will filter the results of the Chi-square test to only show variables whose p-value exceeds our alpha value of 0.01:

These features are essentially independent of cluster assignment and do not provide useful information differentiating the clusters from one another.

Comparison Plots and Tables

For binary categorical variables, a comparison heatmap was created that identifies the frequency of occurrence for each feature and compares these values across each of the six clusters. This heatmap is shown below:

In addition to calculating the prevalence of positive responses within each cluster for each binary variable, we also calculated the prevalence of positive responses that can be found for all observations from the dataset that were not included in the cluster. Comparing the prevalence of positive responses between observations within the cluster and out-of-cluster observations gives us an idea of whether the cluster differs from the rest of the dataset and, if so, to what degree.

Categorical variables with more than two possible classes were also compared using both visualizations and calculations. Count plots were generated for crash hour, day of week, and month for each cluster to compare time-based trends. An example plot including the number of accidents for each hour of the day grouped by cluster is shown below:

For variables such as the type of collision (COLLISION_TYPE) or the condition of the roadway (ROAD_CONDITION), the percentage occurrence of each class was calculated for each cluster and these values were compared between the clusters.

Cluster Descriptions

A summary of the characteristics associated with each cluster is included in the list below. More detailed descriptions of each cluster are included in the sections that follow.

Cluster 0 (Local Road Daytime Impairment / Inclement Weather) — Largest cluster. Accidents primarily occur on local roads and confined to roadway. These accidents primarily occur during the day and peak at 5 PM, suggesting commuter traffic. No strong seasonal pattern. Mostly angle collisions, but with significant numbers of rear-end collisions fixed object collisions as well. High occurrences of head-on and pedestrian collisions. Highest prevalence of accidents associated with distracted driving or driver fatigue. Second highest prevalence of factors related to impairment / alcohol / drugs. Second highest prevalence of factors related to inclement weather. Lowest likelihood of minor or moderate injuries.
Cluster 1 (Local Road Aggressive Driving / Lack of Clearance) — Smallest cluster; Accidents primarily occur on local roads between 7 AM and 7 PM with a peak at 4 PM. No strong seasonal pattern. Generally, higher prevalence of accidents Monday-Friday than on weekends. Accidents primarily confined to roadway. High prevalence of accidents associated with a lack of vehicle clearance. High occurrence of accidents that involve running a stop sign. Mostly angle collisions, with occasional fixed object and rear-end collisions. Highest prevalence of aggressive driving. Low association with inclement weather conditions or impairment-related factors. Lowest prevalence of major injuries.
Cluster 2 (Large Road Nighttime Impairment / Inclement Weather) — Accidents primarily occur on larger roads (freeways and arterial roadways). High prevalence of accidents on roadway shoulder or outside of the trafficway. Highest occurrences of accidents on curved roads and accidents involving speeding. Strong seasonal pattern, with peak accidents during winter months. Primarily occur during the nighttime (6 PM — 6 AM) with a peak at 2 AM. Accident prevalence is highest on Saturday-Sunday and lower on weekdays. Highest prevalence of collisions with fixed objects. Highest occurrence of alcohol / drug / impairment features. Highest prevalence of accidents citing snow / ice / wet roads as a factor.
Cluster 3 (Large Road Rear-End / Tailgating / Speeding — Injury-Causing) — Accidents primarily occur on interstates / freeways and larger arterial roadways. Accidents primarily confined to roadway. Typical commuter traffic pattern with a peak at 4 PM. No strong observable seasonality. Higher prevalence on weekdays than on the weekend, with a strong peak on Fridays. Highest prevalence of rear-end collisions, accidents involving tailgating, and accidents associated with speeding. High prevalence of aggressive driving. Low occurrence of features related to impairment or inclement weather. Second highest prevalence of minor, moderate, or major injuries.
Cluster 4 (Pedestrian / Motorcycle / Bicycle — Injury-Causing) — More even distribution between local and state roads, but relatively low occurrence on freeways. Accidents are mostly confined to the roadway. No strong seasonality. Typical daytime hour pattern with a peak at 3 PM. Occurrence remains significant until 11 PM. Secondary hourly peak visible during morning rush hour. Slightly higher prevalence on weekdays than on weekends with a peak on Thursday. Highest prevalence of accidents involving an unbelted passenger or driver. Highest prevalence of collisions involving pedestrians, motorcycles, or bicycles. Mostly rear-end or angle collisions. Moderate prevalence of impairment / alcohol-related factors, but not a primary defining characteristic of the cluster. Highest prevalence of minor, moderate, and major injuries.
Cluster 5 (Local Road Intersection / Running a Red-Light / Wet Roads) —Accidents primarily on local roads, almost entirely at intersections. Typical daytime traffic pattern with a peak at 5 PM. Higher accident prevalence on Friday-Saturday. Highest occurrence of accidents that involve running a red light. Primarily angle or rear-end collisions. Second highest prevalence of accidents with pedestrians. Low prevalence of impairment / drug / alcohol-related factors. High prevalence of accidents due to wet roadway conditions, but low prevalence of accidents related to snow/ice. Low prevalence of injuries (minor, moderate, or major).

Cluster 0 (Local Road Aggressive Driving / Lack of Clearance):

This cluster includes 9,814 accidents (23.0% of the dataset), making it the largest cluster. A map of accidents included in this cluster is shown below:

Road & Location: This cluster is composed of accidents occurring primarily on local roads (89.2% in-cluster vs. 61.8% out-of-cluster). Most accidents in this cluster occurred within the roadway (63.6%), with occasional occurrences outside of the trafficway (9.2%) or roadside (9.1%). This cluster also contained the highest prevalence of accidents in parking lanes (10%).

Seasonality & Hourly Distribution: The peak month for observed accidents included in this cluster is May. Very little seasonality is identifiable, but a slight increase in accidents during December — January is observable. Accidents grouped in this category typically occur during daylight hours between 7 AM and 6 PM, with a peak at 5 PM and secondary peaks at 8 AM and 3 PM. This pattern suggests these accidents are primarily composed of drivers commuting to and from work or school.

Accident Characteristics / Collision Types: This cluster contained the highest prevalence of accidents associated with vehicle failure (7.4% in-cluster vs. 4.6% out-of-cluster). Accidents in this cluster also had the highest prevalence of collisions involving a crossed median (2.2% in-cluster vs. 0.7% out-of-cluster) and collisions with parked vehicles (34.5% in-cluster vs. 5.2% out-of-cluster). The high prevalence of collisions with parked vehicles is predictable given the high prevalence of accidents occurring in parking lanes.

Angle collisions are the most prevalent type of collision within this cluster (25%). However, accidents in this cluster also include high prevalences of rear-end collisions (17.5%), collisions with fixed objects (16.8%) and same-direction sideswipes (16.8%). This cluster also includes the highest prevalence of head-on collisions (7.1%) and the second highest prevalence of collisions involving pedestrians (5.5%).

Driver Characteristics & Behavior: Accidents in this cluster had the lowest prevalence of aggressive driving as a causal factor (23.1% in-cluster vs. 54.8% out-of-cluster). This cluster included the highest prevalence of accidents involving distracted driving (14% in-cluster vs. 10.4% out-of-cluster) and driver fatigue (1.9% in-cluster vs. 0.9% out-of-cluster).

Impairment (Drugs & Alcohol): This cluster contains the second highest observed prevalences of accidents associated with impairment (10.1%), drug use (2.1%), and alcohol (8.8%). Only Cluster 2 contained a higher prevalence of accidents associated with impairment or other features related to drugs and alcohol. Accidents in this cluster were also more likely to be related to illegal drugs (1.3% in-cluster vs. 0.7% out-of-cluster) than any of the other clusters.

Weather & Road Conditions: While all of the clusters contain a fairly even balance of accidents occurring under wet or dry conditions, only two clusters (Clusters 0 and 2) include a prevalence of accidents involving snow or ice above 2%. While Cluster 2 contain the highest prevalence of accidents related to snow/ice, Cluster 0 contains the second highest prevalence (7.7%).

Injuries & Damage: Accidents included in this cluster were the least likely to result in minor or moderate injuries. Only 8.4% of accidents in Cluster 0 resulted in a minor injury compared to 19% in the remainder of the dataset, and only 5.5% of accidents in Cluster 0 resulted in a moderate injury compared to 10.4% out-of-cluster. This cluster also includes the highest prevalence of fire damage to vehicles (1.3% in-cluster vs. 0.3% out-of-cluster).

Cluster 1 (Local Road Aggressive Driving / Lack of Clearance):

This cluster includes 5,329 accidents (12.5% of the dataset), making it the smallest cluster. A map of accidents included in this cluster is shown below:

Road & Location: Like Cluster 0, accidents in this cluster occurred predominantly on local roads (95.5% in-cluster vs. 64.2% out-of-cluster). A majority of accidents included in this group occurred within the roadway (87.7%), with infrequent occurrences outside the trafficway (5.3%).

Seasonality & Hourly Distribution: The peak month for accidents included in this cluster is November. A strong seasonal pattern is not identifiable, but accident rates for this cluster increase slightly in October — November. Accidents grouped in this category typically occur between 7AM and 7 PM, with a peak at 4 PM. The lowest accident prevalence is observed from 12–6 AM. Like Cluster 0, this cluster also includes a typical daily commuter pattern.

Accident Characteristics / Collision Types: This cluster contained the highest prevalence of accidents associated with school buses (1.3% in-cluster vs. 0.9% out-of-cluster). This cluster also includes the lowest observed prevalence for accidents associated with vehicle failure (3.5% in-cluster vs. 5.5% out-of-cluster). A very high percentage of accidents in this cluster were associated a lack of vehicle clearance (31.8% in-cluster vs. 2.1% out-of-cluster), indicating that this is a strong defining feature of this cluster. This cluster contains significantly more accidents associated with running stop signs than any of the other clusters (11.3% in-cluster vs. 0.2% out-of-cluster).

Angle collisions are the most prevalent type of collision within this cluster by a wide margin (60.7%). Less frequently observed collision types included collisions with fixed objects (9.8%), rear-end collisions (9.4%), and head-on collisions (6.0%).

Driver Behavior: Accidents within this cluster contained the highest observed prevalence of aggressive driving (69.8% in-cluster vs. 44.3% out-of-cluster). Low prevalences were observed for accidents associated with fatigue/falling asleep (0.5% in-cluster vs. 1.2% out-of-cluster), distracted driving (5.9% in-cluster vs. 12.0% out-of-cluster), and cell phone usage (0.6% in-cluster vs. 1.1% out-of-cluster).

Impairment (Drugs & Alcohol): Accidents in this cluster had the lowest observed prevalence of association with impairment (4.3% in-cluster vs. 9.6% out-of-cluster), drugs (1.0 % in-cluster vs. 1.7% out-of-cluster), and alcohol (3.8% in-cluster vs. 8.9% out-of-cluster).

Weather & Road Conditions: A majority of the accidents included in this cluster occurred during dry road conditions (69.9%), while 23.3% occurred on wet roadways. Snow/ice conditions were observed in <5% of accidents within this cluster.

Injuries & Damage: Accidents included in this cluster were the least likely to result in major injuries (1.1% in-cluster vs. 2.1% out-of-cluster). This cluster also included the lowest prevalence of accidents involving overturned vehicles (0.5% in-cluster vs. 1.6% out-of-cluster).

Cluster 2 (Large Road Nighttime Impairment / Inclement Weather):

This cluster includes 6,700 accidents (15.7% of the dataset). A map of accidents included in this cluster is shown below:

Road & Location: Accidents in this cluster occurred predominantly on larger state-operated roads (70.1% in-cluster vs. 42.4% out-of-cluster). This cluster also includes accidents on local roads, but with a lower prevalence than the remainder of the dataset (37.5% in-cluster vs. 73.8 out-of-cluster). A majority of accidents included in this group occurred along the roadway shoulder (29.2%), with frequent occurrences outside the trafficway (23.6%) or off of the trafficway but in a vehicle area (21.1%). This cluster also contained the highest prevalence of accidents associated with curved roads (36.4% in-cluster vs. 9.5% out-of-cluster).

Seasonality & Hourly Distribution: The peak month for observed accidents included in this cluster is January. Accident rates for this cluster follow a strong seasonal pattern, with increased numbers of accidents occurring between October and March with a dip in accident rates from April to September. Accidents grouped in this category typically occur during the night, between 6 PM and 6 AM. Values increase steadily beginning at 11 PM with a peak at 2 AM.

Accident Characteristics / Collision Types: The vast majority of collisions associated with this cluster were with a fixed object (82.7%). The second highest observed prevalence was for rear-end collisions, with only 6.1% of accidents included in this cluster reporting this type of collision. This cluster also contains the highest prevalence of accidents that involve a single vehicle running off of the roadway (89.3% in-cluster vs. 15.8% out-of-cluster).

Driver Behavior: This cluster contained the highest observed prevalence of accidents associated with the driver speeding (6.4% in-cluster vs. 3.2% out-of-cluster) or making an error associated with a curve in the road (7.0% in-cluster vs. 1.3% out-of-cluster).

Impairment (Drugs & Alcohol): This cluster had the highest observed prevalence of accidents associated with impairment (19.5% in-cluster vs. 7.0% out-of-cluster), alcohol (18.3% in-cluster vs. 6.4% out-of-cluster), and drugs (2.4% in-cluster vs. 1.5% out-of-cluster). High prevalences of alcohol usage and other features associated with impairment are characteristics that distinguish this cluster from the others.

Weather & Road Conditions: A majority of the accidents included in this cluster occurred during nighttime or during periods of poor illumination (78.4% in-cluster vs. 28.9%). This cluster also included the highest prevalence of accidents associated with wet roads (22.2% in-cluster vs. 16.7% out-of-cluster), ice (9.5% in-cluster vs. 2.9% out-of-cluster), and snow/slush (7.2% in-cluster vs. 2.7% out-of-cluster).

Injuries & Damage: Accidents included in this cluster were more likely to require notification of a highway maintenance crew than any other cluster (5.7% in-cluster vs. 1.2% out-of-cluster).

Cluster 3 (Large Road Rear-End / Tailgating / Speeding — Injury-Causing):

This cluster includes 6,795 accidents (15.9% of the dataset). A map of accidents included in this cluster is shown below:

Road & Location: This cluster contains the highest prevalence of accidents occurring on larger state roads (89.5% in-cluster vs. 38.7% out-of-cluster), particularly interstates (40.0% in-cluster vs. 5.6% out-of-cluster). As a result, this cluster contains the lowest percentage of accidents on local roads (11.2% in-cluster vs. 78.8% out-of-cluster). This cluster also contains the highest prevalence of accidents occurring in work zones (3.8% in-cluster vs. 1.2% out-of-cluster).

A very high percentage of accidents included in this group occurred within the roadway (92.1%). Additionally, this cluster has the highest prevalence of accidents occurring at a non-intersection (i.e. along a linear segment of roadway) (93.7% in-cluster vs. 38.1% out-of-cluster).

Seasonality & Hourly Distribution: Accident rates for this cluster are fairly steady from month to month, with a slight increase from July to November with a large peak in October. Most accidents within this cluster occur between 7 AM and 6 PM, with a peak at 4 PM.

Accident Characteristics / Collision Types: This cluster included the lowest prevalence of accidents involving a single vehicle running off of the roadway (7.5% in-cluster vs. 3.1% out-of-cluster).

This cluster contained the highest prevalence of rear-end accidents (75.3% in-cluster vs. 16.7% out-of-cluster) and the lowest prevalence of collisions with fixed objects (4.3% in-cluster vs. 24.6% out-of-cluster).

Driver Behavior: This cluster had the highest prevalence of accidents involving tailgating (18.1% in-cluster vs. 2.6% out-of-cluster), as well as the highest percentage of accidents associated with speeding, either by the driver or others (30.1% in-cluster vs. 11.3% out-of-cluster). This cluster also includes the second-highest prevalence of accidents associated with aggressive driving (64.0% in-cluster vs. 44.3% out-of-cluster).

Impairment (Drugs & Alcohol): This cluster had a relatively low prevalence of accidents associated with impairment (6.0% in-cluster vs. 9.5% out-of-cluster), alcohol (5.3% in-cluster vs. 8.8% out-of-cluster), or drugs (1.5% in-cluster vs. 1.7% out-of-cluster), indicating that the presence of impairment factors is not a primary characteristic of this cluster.

Weather & Road Conditions: This cluster had the highest prevalence of dry road conditions (71.4%). Wet road conditions were observed in approximately 22.7% of accidents. Accidents characterized by icy or snowy road conditions had a prevalence of <2% within this cluster, indicating that inclement weather is not a defining characteristic of this group. Also, this cluster contained the lowest prevalence of accidents association with dark conditions or poor illumination (22.3% in-cluster vs. 39.4% out-of-cluster).

Injuries & Damage: Accidents within this cluster contained the second highest prevalences of minor injuries (27.2% in-cluster vs. 14.6% out-of-cluster), moderate injuries (14.3% in-cluster vs. 8.3% out-of-cluster), and major injuries (2.5% in-cluster vs. 1.9%).

Cluster 4 (Pedestrian / Motorcycle / Bicycle — Injury-Causing):

This cluster includes 6,145 accidents (14.4% of the dataset). A map of accidents included in this cluster is shown below:

Road & Location: This cluster contains a fairly even distribution of accidents on smaller local roads and larger state-owned roads, with relatively few accidents occurring on interstates (5.3% in-cluster vs. 5.6% out-of-cluster). Most accidents within this cluster were confined to the roadway (87.6%), with occasional occurrences in the outside of the trafficway (4.0%), roadside (3.1%), or in the roadway shoulder (2.9%).

Seasonality & Hourly Distribution: Accident prevalence in this cluster peaks during the month of January. No strong seasonal pattern of accident prevalence is identifiable. This cluster demonstrates higher accident prevalence Monday — Friday, with a peak on Thursday. Accident prevalence is highest from 6 AM — 11 PM. An initial increase in accidents occurs from 6 AM — 8 AM, followed by a brief decline until 10 AM. Accident prevalence increases steadily until reaching a peak at 3 PM.

Accident Characteristics / Collision Types: This cluster contains the highest prevalence of accidents involving at least one person not wearing a seat belt (12.6% in-cluster vs. 10.6% out-of-cluster). This cluster also contains the highest prevalence of accidents involving at least one pedestrian (12.3% in-cluster vs. 6.6% out-of-cluster), with at least one motorcycle (4.5% in-cluster vs. 1.7% out-of-cluster), or with at least one bicycle (2.8% in-cluster vs. 1.0% out-of-cluster).

Rear-end collisions are the most common type of accident within this cluster (30.1%), followed by angle collisions (29.2%), collisions with pedestrians (12.3%), and collisions with fixed objects (10.4%).

Driver Behavior: This cluster contains the second highest prevalence of accidents related to tailgating (6.5% in-cluster vs. 4.8% out-of-cluster) as well as the third-highest prevalence of accidents associated with speeding (10.3%). This cluster also has the second highest prevalence of accidents that involve running a red light (6.4% in-cluster vs. 4.0% out-of-cluster).

Impairment (Drugs & Alcohol): This cluster had the third highest prevalence of accidents associated with both impairment (7.1%) and alcohol usage (6.9%). However, the prevalence of factors associated with impairment within this cluster was moderate relative to the other accident groupings.

Weather & Road Conditions: The prevalences of accidents associated with wet roads (16.6% in-cluster vs. 17.7% out-of-cluster), icy roads (2.2% in-cluster vs. 4.2% out-of-cluster), or snow/slush (2.0% in-cluster vs. 3.6% out-of-cluster) are low within this cluster. Inclement weather conditions do not appear to be a defining characteristic of this cluster.

Injuries & Damage: This cluster contains the highest prevalence of accidents resulting in minor injuries (33.1% in-cluster vs. 13.8% out-of-cluster), moderate injuries (17.4% in-cluster vs. 7.9% out-of-cluster), and major injuries (3.7% in-cluster vs. 1.7% out-of-cluster). Accidents within this cluster also have the lowest prevalence of requiring a damaged vehicle to be towed (75.9% in-cluster vs. 90.8% out-of-cluster).

Cluster 5 (Local Road Intersection / Running a Red-Light / Wet Roads):

This cluster includes 7,899 accidents (18.5% of the dataset). A map of accidents included in this cluster is shown below:

Road & Location: Accidents within this cluster occur primarily on smaller local roads (90.8% in-cluster vs. 62.9% out-of-cluster). This cluster contains the lowest prevalence of accidents occurring on interstates (0.5% in-cluster vs. 13.5% out-of-cluster) as well as the lowest prevalence of accidents on curved roads (3.9% in-cluster vs. 14.6% out-of-cluster). A vast majority (93.2%) of accidents within this cluster occurred within the roadway, with occasional occurrences outside of the trafficway (3.1%) or roadside (1.4%). This cluster also contains the highest prevalence of accidents occurring at intersections (99.9% in-cluster vs. 42.4% out-of-cluster).

Seasonality & Hourly Distribution: The peak accident prevalence for this cluster occurs during the month of December. Accident prevalence is highest on Friday and Saturday, with a peak on Saturday. During the day, accident prevalence is highest between 7 AM —7 PM, with a peak at 5 PM. The lowest observed accident prevalence occurs between 3 — 5 AM.

Accident Characteristics / Collision Types: This cluster contains the second highest observed prevalence of angle collision accidents (48.9%). Rear-end collisions were the second most common type of collision observed within this cluster (19.0%), followed by collisions with pedestrians (10.1%), and head-on collisions (6.0%). The observed prevalence of collisions with pedestrians was the second highest among all clusters.

Driver Behavior: This cluster contains the highest prevalence of accidents that involve running a red light (17.7% in-cluster vs. 1.3% out-of-cluster) and the lowest prevalence of accidents that involve running a stop sign (0.1% in-cluster vs. 1.9% out-of-cluster). Additionally, this cluster had the lowest prevalence of accidents involving speeding (6.3% in-cluster vs. 16.1% out-of-cluster), accidents associated with driver error related to a curve in the roadway (0.6% in-cluster vs. 2.6% out-of-cluster), accidents related to driver fatigue (0.5% in-cluster vs. 1.3% out-of-cluster), and accidents where at least one passenger was not wearing a seatbelt (9.6% in-cluster vs. 10.8% out-of-cluster).

Impairment (Drugs & Alcohol): Accidents within this cluster had the lowest prevalence of association with drugs (1.0% in-cluster vs. 1.8% out-of-cluster). The cluster contains the second lowest prevalence of accidents involving impairment (5.8% in-cluster vs. 9.7% out-of cluster) and the third lowest prevalence of accidents involving alcohol (5.7% in-cluster vs. 8.8% out-of-cluster). These results indicate that features related to impairment are not a primary characteristic of this cluster.

Weather & Road Conditions: The prevalence of accidents associated with wet roads within this cluster (19.4%) was the second highest. This cluster had the lowest prevalence of accidents associated with icy roads (1.6% in-cluster vs. 4.2% out-of-cluster).

Injuries & Damage: This cluster contains the second lowest prevalences of accidents resulting in minor injuries (14.0% in-cluster vs. 14.6% out-of-cluster) and major injuries (1.4% in-cluster vs. 1.9% out-of-cluster). Accidents within this cluster have the lowest prevalence of requiring notification of highway maintenance (0.7% in-cluster vs. 90.8% out-of-cluster), as well as overturned vehicles (0.5% in-cluster vs. 1.5% out-of-cluster).

Further Analysis

There are numerous opportunities for future studies and analyses based on the raw dataset used to prepare this article. The dataset contains several numerical features that may also be used to refine clustering. Additionally, alternative clustering algorithms for categorical data may be evaluated.

The clusters assigned during this evaluation may be evaluated in greater detail to determine the location of geographical “hot-spots.” This information may prove useful for identifying specific locations that pose an additional risk towards a specific category of accident. These areas may be difficult to identify when looking at all accidents lumped together. Traditional clustering methods such as k-means or Gaussian mixture models may be explored for identification of these locations.

Thanks for Reading!

Please give this article a clap if you found it useful or interesting!