Introduction

Partner: What others cases like this has the firm handled in the past?

Associate: Uhm... (starts thinking of search terms, cancels evening plans)

In an earlier post, I provided a brief introduction to the NearestNeighbors() class from Python's scikit-learn library. In this post, I demonstrate one way this algorithm can be applied to legal data in tabular form. The dataset I use comprises matter profiles of 10,000 (fake) cases. I show how NearestNeighbors() can be used to query the dataset to find cases that are similar to each other.

For a brief primer on KNN and further resources, please see my first post on this topic.

Applying NearestNeighbors to Legal Data

Importing Dependencies and Loading the Dataset

We begin with importing our dependencies and loading the dataset. For this demo, we only need the pandas library and the NearestNeighbors() class from scikit-learn.

The dataset is a CSV file hosted on the web (in my GitHub repository).

# Import dependencies

import pandas as pd
from sklearn.neighbors import NearestNeighbors

# Load the dataset and remove one unnecessary column that was created when this DataFrame was converted into a CSV file

df = 'https://raw.githubusercontent.com/litkm/KNN-Project/main/cases_dataset.csv'
df = pd.read_csv(df)
df = df.drop('Unnamed: 0', axis=1)

The Dataset

Next, let's inspect the dataset.

# Print the first 5 rows from the DataFrame

df.head()

This is a dataset of matter profiles:

Each row represents one matter.
Each column indicates a feature relating to these matters.
There are ten different lawyers, eight matter types, five jurisdictions, eight industries, and three client roles.
The columns labelled Tort, Contract, Restitution, and Statute relate to causes of action. A "1" indicates the type of cause of action applies to a given matter, whereas a "0" indicates it does not.
The columns labelled Injunction, Jurisdiction Motion, Motion to Strike, and Summary Judgment relate to motions. A "1" indicates this type of motion occurred in a given matter, where as a "0" indicates it did not.
The final column, Claim Amount, specifies the dollar amount of damages the plaintiff claimed.

By way of example, let's take a closer look at the second matter in the dataset.

# Print data at row 1 of the DataFrame

display(df.iloc[1])

Lawyer                                Lawyer Beta
Matter Type            Class Action (Competition)
Jurisdiction                     British Columbia
Industry                                   Retail
Client Role                             Plaintiff
Tort                                            1
Contract                                        1
Restitution                                     1
Statute                                         1
Injunction                                      0
Jurisdiction Motion                             1
Motion to Strike                                1
Summary Judgment                                0
Claim Amount                             16003023
Name: 1, dtype: object

This entry indicates:

Lawyer Beta is the lead lawyer.
This is a competition class action filed in B.C. relating to the retail industry.
The client is the plaintiff.
The claim involves causes of action in tort, contract, restitution, and statute.
There was no injunction.
The matter involved motions relating to jurisdiction and striking a pleading, but not summary judgment.
The plaintiff claims ~$16 million in damages.

Per the below, this dataset contains 10,000 entries, i.e. matters.

# Print summary of the DataFrame

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Lawyer               10000 non-null  object
 1   Matter Type          10000 non-null  object
 2   Jurisdiction         10000 non-null  object
 3   Industry             10000 non-null  object
 4   Client Role          10000 non-null  object
 5   Tort                 10000 non-null  int64 
 6   Contract             10000 non-null  int64 
 7   Restitution          10000 non-null  int64 
 8   Statute              10000 non-null  int64 
 9   Injunction           10000 non-null  int64 
 10  Jurisdiction Motion  10000 non-null  int64 
 11  Motion to Strike     10000 non-null  int64 
 12  Summary Judgment     10000 non-null  int64 
 13  Claim Amount         10000 non-null  int64 
dtypes: int64(9), object(5)
memory usage: 1.1+ MB

You can view the dataset in "raw" form at this link.

The content of this dataset is entirely fake. I generated it randomly. If you would like to create your own datasets for experimenting with KNN, you can find the code I used in my GitHub respository for this project.

Normalizing the Dataset

As is, the dataset is not in a form the KNN algorithm can process. We must make certain changes to it. First, we must scale the numerical features so they use a standard range. Second, we must convert the categorial features, i.e. the non-numerical features (e.g. the various lawyer names under the Lawyer column), into numbers. These two processes are forms of normalizing a dataset.

Normalizing the numerical features requires converting them, as necessary, so that each references the same range. If this step is neglected, then high magnitude numbers can disproportionately influence the outcome of the calculations the computer performs. When scaled, each feature will be weighted equally in the outcome.

In this instance, all of the numerical features are either "0" or "1", with the exception of Claim Amount. A sensible range for this dataset, therefore, is 0 to 1. Accordingly, we must scale the Claim Account column so that this data is converted into numbers within this range.

This can be done using a technique called "min-max scaling", which involves applying the following formula:

y = (x - min) / (max - min)

That is, for each number in the Claim Amount column (x), we subtract from it the lowest Claim Amount (min), and we divide this result by the range between the highest Claim Amount (max) and the lowest Claim Amount (min). The final result is a value (y), which is a number scaled between 0 and 1.

The code to do this is detailed below. We simply apply the min-max formula to every value in the Claim Amount column. Pandas has handy methods to identify the lowest and highest values in a DataFrame column (min() and max()).

# Copy the dataset into a new DataFrame. We will refer to the preprocessed DataFrame later because it is easier to read

dataset = df.copy()

# Apply min-max scaling to Claim Amount; that is, y = (x - min) / (max - min) 

column = 'Claim Amount'
dataset[column] = (dataset[column] - dataset[column].min()) / (dataset[column].max() - dataset[column].min())

# Print the first five rows of the dataset to verify the technique worked

dataset.head()

When we review the Claim Amount column now, we see each feature is now a number ranging between 0 and 1. If we were to review all 10,000 matters in this dataset, we would find the same result.

Next, we must convert the categorical features into numbers. For this dataset, we'll use a technique called "one-hot encoding". With this approach, we create new columns for each of the different categorical features under a given column. Then, either a "1" or a "0" is assigned under each these new columns, depending on whether the feature applies to a given matter. If the feature applies, a "1" is used; if not, then a "0".

For instance, under Client Role, there are three possible features: Plaintiff, Defendant, or Third Party. When we apply one-hot encoding to this column, we replace the generic Client Role column with three new columns: Plaintiff, Defendant, and Third Party. If the client in a given matter is a plaintiff, then a "1" is assigned under the Plaintiff column, and the Defendant and Third Party columns each receive a "0" for that particular matter.

Conveniently, Pandas has a method called get_dummies(), which can be used to apply one-hot encoding to a DataFrame containing categorical features. Below, we use this method on the columns containing categorical features in our dataset. We then print out the first matter in the dataset to verify the technique has worked.

# One-hot encode the categorical columns 

dataset = pd.get_dummies(dataset, columns=['Lawyer', 'Matter Type', 'Jurisdiction', 'Industry', 'Client Role'])

# Print the first matter in the dataset to verify the encoding was successful

display(dataset.iloc[0])

Tort                                          0.000000
Contract                                      0.000000
Restitution                                   0.000000
Statute                                       1.000000
Injunction                                    1.000000
Jurisdiction Motion                           1.000000
Motion to Strike                              1.000000
Summary Judgment                              1.000000
Claim Amount                                  0.585066
Lawyer_Lawyer Alpha                           0.000000
Lawyer_Lawyer Beta                            0.000000
Lawyer_Lawyer Delta                           0.000000
Lawyer_Lawyer Epsilon                         0.000000
Lawyer_Lawyer Eta                             0.000000
Lawyer_Lawyer Gamma                           0.000000
Lawyer_Lawyer Iota                            1.000000
Lawyer_Lawyer Kappa                           0.000000
Lawyer_Lawyer Theta                           0.000000
Lawyer_Lawyer Zeta                            0.000000
Matter Type_Action (Commercial)               0.000000
Matter Type_Action (Privacy)                  0.000000
Matter Type_Action (Product Liability)        0.000000
Matter Type_Action (Securities)               0.000000
Matter Type_Class Action (Competition)        0.000000
Matter Type_Class Action (Privacy             0.000000
Matter Type_Class Action (Product Liabilty    0.000000
Matter Type_Class Action (Securities)         1.000000
Jurisdiction_Alberta                          0.000000
Jurisdiction_British Columbia                 1.000000
Jurisdiction_Federal                          0.000000
Jurisdiction_Ontario                          0.000000
Jurisdiction_Quebec                           0.000000
Industry_Banking                              0.000000
Industry_Construction                         0.000000
Industry_Insurance                            0.000000
Industry_Pharmaceuticals                      0.000000
Industry_Resources                            1.000000
Industry_Retail                               0.000000
Industry_Technology                           0.000000
Industry_Transportation                       0.000000
Client Role_Defendant                         1.000000
Client Role_Plaintiff                         0.000000
Client Role_Third Party                       0.000000
Name: 0, dtype: float64

Per the above, we can see each categorical feature has become a separate column, and whether the feature applies or not, is indicated with a "1" or a "0".

Our dataset is now ready to be fed into the KNN algorithm.

Creating the KNN Model

Next, we create our KNN model and load the dataset into it. As described in my first post on KNN, we do not need to code the algorithm from scratch. Rather, we can use the NearestNeighbors() class from Python's scikit-learn library. Easy, peasy!

# Initialize a NearestNeighbors() object and assign it to the variable 'model'

model = NearestNeighbors(n_neighbors=4)

# Run the dataset through the KNN algorithm

model.fit(dataset)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=None, n_neighbors=4, p=2,
                 radius=1.0)

When we create a model of the data using KNN, the algorithm plots each matter from the dataset in a multi-dimensional space. The algorithm calculates the location of each matter based on its features. Since this dataset (once normalized) has 43 features, the algorithm plots the matters in a 43-dimensional space!

To find matter analogues, the algorithm will calculate the distance between a specified matter and those nearest to it in this 43-dimensional space. The less distance between the specified matter and those nearest to it, the more similiar the matters will be.

When we create our model per the code above, we configure the algorithm to return the four nearest neighbors to our specificed matter. The first "nearest neighbor" is always the specified matter itself. We'll ignore this. Instead, we'll be interested in the three other matters - i.e. the three nearest neighbors.

Now we can use our model to find matter analogues in our dataset of 10,000 cases!

Querying the Dataset

Let's look at a case. The below matter profile is at row 257 of the dataset.

# Print row 257 of the DataFrame

display(df.iloc[257])

Lawyer                               Lawyer Zeta
Matter Type            Class Action (Securities)
Jurisdiction                             Federal
Industry                               Insurance
Client Role                            Defendant
Tort                                           0
Contract                                       0
Restitution                                    1
Statute                                        0
Injunction                                     0
Jurisdiction Motion                            1
Motion to Strike                               0
Summary Judgment                               0
Claim Amount                            11953137
Name: 257, dtype: object

In the next line of code, we use the algorithm to query the dataset to identify the cases most similiar to this one.

# Print the four nearest neighbors to row 257 in the DataFrame

print(model.kneighbors([dataset.iloc[257]]))

(array([[0.        , 0.08120263, 1.73799853, 2.0000405 ]]), array([[ 257, 2582, 8283, 2373]]))

This code returns two lists. The first list contains the distance measurements between our specified matter and its four nearest neighbors. The second list provides the row index from the dataset for these nearest neighbors.

As noted above, the first nearest neighbor is the specified matter itself. We'll ignore it. Let's examine the second matter, located at row 2582 of our dateset. Notably, this matter has a distance measurement less than 1. We can anticipate matters 257 and 2582 will be very similar.

# Print row 2582 of the DataFrame

display(df.iloc[2582])

Lawyer                               Lawyer Zeta
Matter Type            Class Action (Securities)
Jurisdiction                             Federal
Industry                               Insurance
Client Role                            Defendant
Tort                                           0
Contract                                       0
Restitution                                    1
Statute                                        0
Injunction                                     0
Jurisdiction Motion                            1
Motion to Strike                               0
Summary Judgment                               0
Claim Amount                            10370112
Name: 2582, dtype: object

Indeed, when we compare these matters, we see they are virtually identical. They differ only in the amount of money the plaintiff is claiming.

The next nearest neighbor is matter 8283.

# Print row 8283 of the DataFrame

display(df.iloc[8283])

Lawyer                               Lawyer Zeta
Matter Type            Class Action (Securities)
Jurisdiction                             Ontario
Industry                               Insurance
Client Role                            Defendant
Tort                                           0
Contract                                       0
Restitution                                    1
Statute                                        0
Injunction                                     0
Jurisdiction Motion                            1
Motion to Strike                               0
Summary Judgment                               1
Claim Amount                            14753800
Name: 8283, dtype: object

This matter differs from our target matter in only three respects: Jurisdiction, Claim Amount, and it has a summary judgment motion.

Let's look at the last of the nearest neighbors, matter 2373.

# Print row 2373 of the DataFrame

display(df.iloc[2373])

Lawyer                              Lawyer Alpha
Matter Type            Class Action (Securities)
Jurisdiction                             Federal
Industry                               Insurance
Client Role                            Plaintiff
Tort                                           0
Contract                                       0
Restitution                                    1
Statute                                        0
Injunction                                     0
Jurisdiction Motion                            1
Motion to Strike                               0
Summary Judgment                               0
Claim Amount                            11705007
Name: 2373, dtype: object

This matter also differs from our target matter in only three respects: Lawyer, Client Role, and Claim Amount.

Conclusion

Although there are 10,000 cases in this dataset, using the KNN algorithm we can find accurate matter analogues within seconds!

That said, we need the right dataset. KNN can only take data in a structured form, and structured data is often in short supply.

Final Thoughts

Have you experimented with using KNN on legal data? Please reach out!

Appendix

For ease of review, all of the code from this post is set out below.

# Import dependencies

import pandas as pd
from sklearn.neighbors import NearestNeighbors

# Load the dataset and remove one unnecessary column that was created when this DataFrame was converted into a CSV file

df = 'https://raw.githubusercontent.com/litkm/KNN-Project/main/cases_dataset.csv'
df = pd.read_csv(df)
df = df.drop('Unnamed: 0', axis=1)

# Print the first 5 rows from the DataFrame

df.head()

# Print data at row 1 of the DataFrame

display(df.iloc[1])

# Print summary of the DataFrame

df.info()

# Copy the dataset into a new DataFrame. We will refer to the preprocessed DataFrame later because it is easier to read

dataset = df.copy()

# Apply min-max scaling to Claim Amount; that is, y = (x - min) / (max - min) 

column = 'Claim Amount'
dataset[column] = (dataset[column] - dataset[column].min()) / (dataset[column].max() - dataset[column].min())

# Print the first five rows of the dataset to verify the technique worked

dataset.head()

# One-hot encode the categorical columns 

dataset = pd.get_dummies(dataset, columns=['Lawyer', 'Matter Type', 'Jurisdiction', 'Industry', 'Client Role'])

# Print the first matter in the dataset to verify the encoding was successful

display(dataset.iloc[0])

# Initialize a NearestNeighbors() object and assign it to the variable 'model'

model = NearestNeighbors(n_neighbors=4)

# Run the dataset through the KNN algorithm

model.fit(dataset)

# Print row 257 of the DataFrame

display(df.iloc[257])

# Print the four nearest neighbors to row 257 in the DataFrame

print(model.kneighbors([dataset.iloc[257]]))

# Print row 2582 of the DataFrame

display(df.iloc[2582])

# Print row 8283 of the DataFrame

display(df.iloc[8283])

# Print row 2373 of the DataFrame

display(df.iloc[2373])

Lawyer                                Lawyer Beta
Matter Type            Class Action (Competition)
Jurisdiction                     British Columbia
Industry                                   Retail
Client Role                             Plaintiff
Tort                                            1
Contract                                        1
Restitution                                     1
Statute                                         1
Injunction                                      0
Jurisdiction Motion                             1
Motion to Strike                                1
Summary Judgment                                0
Claim Amount                             16003023
Name: 1, dtype: object

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Lawyer               10000 non-null  object
 1   Matter Type          10000 non-null  object
 2   Jurisdiction         10000 non-null  object
 3   Industry             10000 non-null  object
 4   Client Role          10000 non-null  object
 5   Tort                 10000 non-null  int64 
 6   Contract             10000 non-null  int64 
 7   Restitution          10000 non-null  int64 
 8   Statute              10000 non-null  int64 
 9   Injunction           10000 non-null  int64 
 10  Jurisdiction Motion  10000 non-null  int64 
 11  Motion to Strike     10000 non-null  int64 
 12  Summary Judgment     10000 non-null  int64 
 13  Claim Amount         10000 non-null  int64 
dtypes: int64(9), object(5)
memory usage: 1.1+ MB

Tort                                          0.000000
Contract                                      0.000000
Restitution                                   0.000000
Statute                                       1.000000
Injunction                                    1.000000
Jurisdiction Motion                           1.000000
Motion to Strike                              1.000000
Summary Judgment                              1.000000
Claim Amount                                  0.585066
Lawyer_Lawyer Alpha                           0.000000
Lawyer_Lawyer Beta                            0.000000
Lawyer_Lawyer Delta                           0.000000
Lawyer_Lawyer Epsilon                         0.000000
Lawyer_Lawyer Eta                             0.000000
Lawyer_Lawyer Gamma                           0.000000
Lawyer_Lawyer Iota                            1.000000
Lawyer_Lawyer Kappa                           0.000000
Lawyer_Lawyer Theta                           0.000000
Lawyer_Lawyer Zeta                            0.000000
Matter Type_Action (Commercial)               0.000000
Matter Type_Action (Privacy)                  0.000000
Matter Type_Action (Product Liability)        0.000000
Matter Type_Action (Securities)               0.000000
Matter Type_Class Action (Competition)        0.000000
Matter Type_Class Action (Privacy             0.000000
Matter Type_Class Action (Product Liabilty    0.000000
Matter Type_Class Action (Securities)         1.000000
Jurisdiction_Alberta                          0.000000
Jurisdiction_British Columbia                 1.000000
Jurisdiction_Federal                          0.000000
Jurisdiction_Ontario                          0.000000
Jurisdiction_Quebec                           0.000000
Industry_Banking                              0.000000
Industry_Construction                         0.000000
Industry_Insurance                            0.000000
Industry_Pharmaceuticals                      0.000000
Industry_Resources                            1.000000
Industry_Retail                               0.000000
Industry_Technology                           0.000000
Industry_Transportation                       0.000000
Client Role_Defendant                         1.000000
Client Role_Plaintiff                         0.000000
Client Role_Third Party                       0.000000
Name: 0, dtype: float64

Lawyer                               Lawyer Zeta
Matter Type            Class Action (Securities)
Jurisdiction                             Federal
Industry                               Insurance
Client Role                            Defendant
Tort                                           0
Contract                                       0
Restitution                                    1
Statute                                        0
Injunction                                     0
Jurisdiction Motion                            1
Motion to Strike                               0
Summary Judgment                               0
Claim Amount                            11953137
Name: 257, dtype: object

(array([[0.        , 0.08120263, 1.73799853, 2.0000405 ]]), array([[ 257, 2582, 8283, 2373]]))

Lawyer                               Lawyer Zeta
Matter Type            Class Action (Securities)
Jurisdiction                             Federal
Industry                               Insurance
Client Role                            Defendant
Tort                                           0
Contract                                       0
Restitution                                    1
Statute                                        0
Injunction                                     0
Jurisdiction Motion                            1
Motion to Strike                               0
Summary Judgment                               0
Claim Amount                            10370112
Name: 2582, dtype: object

Lawyer                               Lawyer Zeta
Matter Type            Class Action (Securities)
Jurisdiction                             Ontario
Industry                               Insurance
Client Role                            Defendant
Tort                                           0
Contract                                       0
Restitution                                    1
Statute                                        0
Injunction                                     0
Jurisdiction Motion                            1
Motion to Strike                               0
Summary Judgment                               1
Claim Amount                            14753800
Name: 8283, dtype: object

Lawyer                              Lawyer Alpha
Matter Type            Class Action (Securities)
Jurisdiction                             Federal
Industry                               Insurance
Client Role                            Plaintiff
Tort                                           0
Contract                                       0
Restitution                                    1
Statute                                        0
Injunction                                     0
Jurisdiction Motion                            1
Motion to Strike                               0
Summary Judgment                               0
Claim Amount                            11705007
Name: 2373, dtype: object

	Lawyer	Matter Type	Jurisdiction	Industry	Client Role	Tort	Contract	Restitution	Statute	Injunction	Jurisdiction Motion	Motion to Strike	Summary Judgment	Claim Amount
0	Lawyer Iota	Class Action (Securities)	British Columbia	Resources	Defendant	0	0	0	1	1	1	1	1	11907902
1	Lawyer Beta	Class Action (Competition)	British Columbia	Retail	Plaintiff	1	1	1	1	0	1	1	0	16003023
2	Lawyer Epsilon	Class Action (Privacy	Alberta	Pharmaceuticals	Plaintiff	1	0	0	0	0	1	1	1	19069229
3	Lawyer Eta	Action (Commercial)	Ontario	Insurance	Defendant	0	1	1	0	1	0	1	0	1244921
4	Lawyer Theta	Class Action (Securities)	British Columbia	Technology	Plaintiff	0	0	1	0	1	0	0	0	6633831