Finding Analogous Cases with K-Nearest Neighbor
How to use scikit-learn's NearestNeighbors() to find matter analogues in tabular legal data
Introduction
Partner: What others cases like this has the firm handled in the past?
Associate: Uhm... (starts thinking of search terms, cancels evening plans)
In an earlier post, I provided a brief introduction to the NearestNeighbors()
class from Python's scikit-learn library. In this post, I demonstrate one way this algorithm can be applied to legal data in tabular form. The dataset I use comprises matter profiles of 10,000 (fake) cases. I show how NearestNeighbors()
can be used to query the dataset to find cases that are similar to each other.
For a brief primer on KNN and further resources, please see my first post on this topic.
Applying NearestNeighbors to Legal Data
Importing Dependencies and Loading the Dataset
We begin with importing our dependencies and loading the dataset. For this demo, we only need the pandas library and the NearestNeighbors()
class from scikit-learn.
The dataset is a CSV file hosted on the web (in my GitHub repository).
# Import dependencies
import pandas as pd
from sklearn.neighbors import NearestNeighbors
# Load the dataset and remove one unnecessary column that was created when this DataFrame was converted into a CSV file
df = 'https://raw.githubusercontent.com/litkm/KNN-Project/main/cases_dataset.csv'
df = pd.read_csv(df)
df = df.drop('Unnamed: 0', axis=1)
# Print the first 5 rows from the DataFrame
df.head()
This is a dataset of matter profiles:
- Each row represents one matter.
- Each column indicates a feature relating to these matters.
- There are ten different lawyers, eight matter types, five jurisdictions, eight industries, and three client roles.
- The columns labelled Tort, Contract, Restitution, and Statute relate to causes of action. A "1" indicates the type of cause of action applies to a given matter, whereas a "0" indicates it does not.
- The columns labelled Injunction, Jurisdiction Motion, Motion to Strike, and Summary Judgment relate to motions. A "1" indicates this type of motion occurred in a given matter, where as a "0" indicates it did not.
- The final column, Claim Amount, specifies the dollar amount of damages the plaintiff claimed.
By way of example, let's take a closer look at the second matter in the dataset.
# Print data at row 1 of the DataFrame
display(df.iloc[1])
This entry indicates:
- Lawyer Beta is the lead lawyer.
- This is a competition class action filed in B.C. relating to the retail industry.
- The client is the plaintiff.
- The claim involves causes of action in tort, contract, restitution, and statute.
- There was no injunction.
- The matter involved motions relating to jurisdiction and striking a pleading, but not summary judgment.
- The plaintiff claims ~$16 million in damages.
Per the below, this dataset contains 10,000 entries, i.e. matters.
# Print summary of the DataFrame
df.info()
You can view the dataset in "raw" form at this link.
The content of this dataset is entirely fake. I generated it randomly. If you would like to create your own datasets for experimenting with KNN, you can find the code I used in my GitHub respository for this project.
Normalizing the Dataset
As is, the dataset is not in a form the KNN algorithm can process. We must make certain changes to it. First, we must scale the numerical features so they use a standard range. Second, we must convert the categorial features, i.e. the non-numerical features (e.g. the various lawyer names under the Lawyer column), into numbers. These two processes are forms of normalizing a dataset.
Normalizing the numerical features requires converting them, as necessary, so that each references the same range. If this step is neglected, then high magnitude numbers can disproportionately influence the outcome of the calculations the computer performs. When scaled, each feature will be weighted equally in the outcome.
In this instance, all of the numerical features are either "0" or "1", with the exception of Claim Amount. A sensible range for this dataset, therefore, is 0 to 1. Accordingly, we must scale the Claim Account column so that this data is converted into numbers within this range.
This can be done using a technique called "min-max scaling", which involves applying the following formula:
- y = (x - min) / (max - min)
That is, for each number in the Claim Amount column (x), we subtract from it the lowest Claim Amount (min), and we divide this result by the range between the highest Claim Amount (max) and the lowest Claim Amount (min). The final result is a value (y), which is a number scaled between 0 and 1.
The code to do this is detailed below. We simply apply the min-max formula to every value in the Claim Amount column. Pandas has handy methods to identify the lowest and highest values in a DataFrame column (min()
and max()
).
# Copy the dataset into a new DataFrame. We will refer to the preprocessed DataFrame later because it is easier to read
dataset = df.copy()
# Apply min-max scaling to Claim Amount; that is, y = (x - min) / (max - min)
column = 'Claim Amount'
dataset[column] = (dataset[column] - dataset[column].min()) / (dataset[column].max() - dataset[column].min())
# Print the first five rows of the dataset to verify the technique worked
dataset.head()
When we review the Claim Amount column now, we see each feature is now a number ranging between 0 and 1. If we were to review all 10,000 matters in this dataset, we would find the same result.
Next, we must convert the categorical features into numbers. For this dataset, we'll use a technique called "one-hot encoding". With this approach, we create new columns for each of the different categorical features under a given column. Then, either a "1" or a "0" is assigned under each these new columns, depending on whether the feature applies to a given matter. If the feature applies, a "1" is used; if not, then a "0".
For instance, under Client Role, there are three possible features: Plaintiff, Defendant, or Third Party. When we apply one-hot encoding to this column, we replace the generic Client Role column with three new columns: Plaintiff, Defendant, and Third Party. If the client in a given matter is a plaintiff, then a "1" is assigned under the Plaintiff column, and the Defendant and Third Party columns each receive a "0" for that particular matter.
Conveniently, Pandas has a method called get_dummies()
, which can be used to apply one-hot encoding to a DataFrame containing categorical features. Below, we use this method on the columns containing categorical features in our dataset. We then print out the first matter in the dataset to verify the technique has worked.
# One-hot encode the categorical columns
dataset = pd.get_dummies(dataset, columns=['Lawyer', 'Matter Type', 'Jurisdiction', 'Industry', 'Client Role'])
# Print the first matter in the dataset to verify the encoding was successful
display(dataset.iloc[0])
Per the above, we can see each categorical feature has become a separate column, and whether the feature applies or not, is indicated with a "1" or a "0".
Our dataset is now ready to be fed into the KNN algorithm.
Creating the KNN Model
Next, we create our KNN model and load the dataset into it. As described in my first post on KNN, we do not need to code the algorithm from scratch. Rather, we can use the NearestNeighbors()
class from Python's scikit-learn library. Easy, peasy!
# Initialize a NearestNeighbors() object and assign it to the variable 'model'
model = NearestNeighbors(n_neighbors=4)
# Run the dataset through the KNN algorithm
model.fit(dataset)
When we create a model of the data using KNN, the algorithm plots each matter from the dataset in a multi-dimensional space. The algorithm calculates the location of each matter based on its features. Since this dataset (once normalized) has 43 features, the algorithm plots the matters in a 43-dimensional space!
To find matter analogues, the algorithm will calculate the distance between a specified matter and those nearest to it in this 43-dimensional space. The less distance between the specified matter and those nearest to it, the more similiar the matters will be.
When we create our model per the code above, we configure the algorithm to return the four nearest neighbors to our specificed matter. The first "nearest neighbor" is always the specified matter itself. We'll ignore this. Instead, we'll be interested in the three other matters - i.e. the three nearest neighbors.
Now we can use our model to find matter analogues in our dataset of 10,000 cases!
Querying the Dataset
Let's look at a case. The below matter profile is at row 257 of the dataset.
# Print row 257 of the DataFrame
display(df.iloc[257])
In the next line of code, we use the algorithm to query the dataset to identify the cases most similiar to this one.
# Print the four nearest neighbors to row 257 in the DataFrame
print(model.kneighbors([dataset.iloc[257]]))
This code returns two lists. The first list contains the distance measurements between our specified matter and its four nearest neighbors. The second list provides the row index from the dataset for these nearest neighbors.
As noted above, the first nearest neighbor is the specified matter itself. We'll ignore it. Let's examine the second matter, located at row 2582 of our dateset. Notably, this matter has a distance measurement less than 1. We can anticipate matters 257 and 2582 will be very similar.
# Print row 2582 of the DataFrame
display(df.iloc[2582])
Indeed, when we compare these matters, we see they are virtually identical. They differ only in the amount of money the plaintiff is claiming.
The next nearest neighbor is matter 8283.
# Print row 8283 of the DataFrame
display(df.iloc[8283])
This matter differs from our target matter in only three respects: Jurisdiction, Claim Amount, and it has a summary judgment motion.
Let's look at the last of the nearest neighbors, matter 2373.
# Print row 2373 of the DataFrame
display(df.iloc[2373])
This matter also differs from our target matter in only three respects: Lawyer, Client Role, and Claim Amount.
Conclusion
Although there are 10,000 cases in this dataset, using the KNN algorithm we can find accurate matter analogues within seconds!
That said, we need the right dataset. KNN can only take data in a structured form, and structured data is often in short supply.
Final Thoughts
Have you experimented with using KNN on legal data? Please reach out!
Appendix
For ease of review, all of the code from this post is set out below.
# Import dependencies
import pandas as pd
from sklearn.neighbors import NearestNeighbors
# Load the dataset and remove one unnecessary column that was created when this DataFrame was converted into a CSV file
df = 'https://raw.githubusercontent.com/litkm/KNN-Project/main/cases_dataset.csv'
df = pd.read_csv(df)
df = df.drop('Unnamed: 0', axis=1)
# Print the first 5 rows from the DataFrame
df.head()
# Print data at row 1 of the DataFrame
display(df.iloc[1])
# Print summary of the DataFrame
df.info()
# Copy the dataset into a new DataFrame. We will refer to the preprocessed DataFrame later because it is easier to read
dataset = df.copy()
# Apply min-max scaling to Claim Amount; that is, y = (x - min) / (max - min)
column = 'Claim Amount'
dataset[column] = (dataset[column] - dataset[column].min()) / (dataset[column].max() - dataset[column].min())
# Print the first five rows of the dataset to verify the technique worked
dataset.head()
# One-hot encode the categorical columns
dataset = pd.get_dummies(dataset, columns=['Lawyer', 'Matter Type', 'Jurisdiction', 'Industry', 'Client Role'])
# Print the first matter in the dataset to verify the encoding was successful
display(dataset.iloc[0])
# Initialize a NearestNeighbors() object and assign it to the variable 'model'
model = NearestNeighbors(n_neighbors=4)
# Run the dataset through the KNN algorithm
model.fit(dataset)
# Print row 257 of the DataFrame
display(df.iloc[257])
# Print the four nearest neighbors to row 257 in the DataFrame
print(model.kneighbors([dataset.iloc[257]]))
# Print row 2582 of the DataFrame
display(df.iloc[2582])
# Print row 8283 of the DataFrame
display(df.iloc[8283])
# Print row 2373 of the DataFrame
display(df.iloc[2373])