Introduction

In Prof. Wolfgang Alschner's fantastic course, Data Science for Lawyers, Lesson 8 uses machine learning to predict Justice Brennan's voting record. One of the key questions is: if a machine learning model "studies" information about thousands of Justice Brennan's cases, how well can it predict the way he voted on other cases?

The lesson reviews several different machine learning algorithms.¹ This excercise inspired me to attempt to apply a different machine learning approach to the same dataset; namely, deep learning. In this post, I detail the results of this effort.

Justice Brennan

Per Wikipedia, Justice Brennan (1906 – 1997) was an American lawyer and jurist who served as an Associate Justice of the Supreme Court of the United States (SCOTUS) from 1956 to 1990. He was the seventh-longest-serving justice in Supreme Court history, and known for being a leader of the Court's liberal wing.

The length of Justice Brennan's tenure is key for present purposes. Since he sat on SCOTUS for so long, he has a lengthy voting record. This is important for "teaching" a machine learning model effectively (the more data, the better).

The Dataset

The dataset is available online at the course website. The data is from The Supreme Court Database. In this database, court decisions are coded for a variety of variables relating to the identification, chronology, background, substance, and outcome of each case.

The dataset is a simple CSV file. Click to view it in a "raw" format.

Below, the first five entries of dataset are printed out.

Decoded, the first entry indicates:

The case was heard in 1956 (term)
The petitioner (appellant) was a "bank, savings and loan, credit union, investment company" (petitioner)
The respondent was an "agent, fiduciary, trustee, or executor" (respondent)
The court assumed jurisdiction on the basis of a writ of certiorari (jurisdiction)
The case originated from the Pennsylvania Western U.S. District Court (caseOrigin)
The U.S. Court of Appeals, Third Circuit, was the source of the decision SCOTUS reviewed (caseSource)
SCOTUS granted the writ of certiorari in order to "to resolve important or significant question" (certReason)
The subject matter of the controversy related to "cruel and unusual punishment, death penalty (cf. extra legal jury influence, death penalty)" (issue)
The preceding variable was categorized as relating to federalism (issueArea)
Lastly, Justice Brennan voted with the majority (vote)

Below, additional information from the dataset is set out.

For present purposes, the most important information shown here is that the dataset contains 4746 entries, i.e. there is information regarding 4746 cases, including whether Justice Brennan voted with the majority or the minority of the SCOTUS panel.

The Deep Learning Model

Machine learning is a subfield of computer science. The basic objective is to program computers to learn so that they can perform tasks for which they were not explicitly programmed.²

There are many approaches to machine learning, of which deep learning is only one. This approach is based on artificial neural networks, which are a kind of algorithm loosely modelled on neurons in the human brain.³

The deep learning model I used for this project is based on a model from a lesson in the Codecademy course, Build Deep Learning Models with TensorFlow. This model is coded in a programming language called Python, and uses a deep learning framework from Google, known as Keras (TensorFlow).

My aim was to assemble a model that takes the Brennan dataset as an input, and outputs accurate predictions regarding his voting.

Critically, the model is not pre-programmed with any particular patterns, rules, or guidelines specific to Justice Brennan and the way he voted on SCOTUS. Rather, the model applies the deep learning algorithm to process the dataset and develop, independently, its own "understanding" of his voting history. Based on this understanding, it make predictions.

Forgive me for glossing over a lot of details. But, in simple terms, this is how the model in this project works:

It randomly apportions the dataset into two subsets: one for training, and another for testing (70% for training, 30% for testing).
Then it looks at each case in the training dataset, one-by-one.
With every case, it considers each of the variables (petitioner, respondent, etc), and then predicts whether Justice Brennan voted with the majority or the minority in this particular case.
The model then checks the final column of the dataset: was the prediction correct or not?
It then recalibrates its weighting of each variable based on whether it made a correct or incorrect prediction. When a model works well, this recalibration results in incrementally better (more accurate) predictions.
Once the model has reviewed each of the cases in the training dataset, it then tests itself, case-by-case again, against the second (testing) dataset. This is an important way check against the model merely memorizing the training dataset, as opposed to calibrating its predictive process to enable it to generalize and make accurate predictions about new cases (the test dataset).
After the model cycles through both the training and the testing datasets, it repeats this process over again. Models will do this cycle many times (one hundred, in this instance - but it can be much more). Ideally, the predictive accuracy of the model increases each cycle until it plateaus when it reaches its predictive potential.

The Results

So how did the model do?

Not too badly. It learned to predict accurately whether Justice Brennan voted with the majority or the minority of the SCOTUS panel with slightly less than 80% accuracy.

If you are interested in the code and the output it produces, click here.

The Real World

Of course, in the "real world" we do not have the benefit of a judge's entire voting record. Rather, we would have a record of previous votes, from which we would want to predict future votes.

This is the approach taken in Prof. Alschner's lesson. The dataset is not randomly split. Rather, the machine learning algorithm is trained on the voting record from 1956 until 1979, and then tested against the record from the 1980s. That is, the pre-1980 voting record is used to predict the votes from the 1980s.

How did my deep learning model do when the dataset was likewise apportioned? Performance dropped. At best, the model achieved about 69% accuracy.

I tweeted about this and Prof. Alschner kindly commented, noting the pre-1980 variables seem to miss something important for predicting the post-1980 voting.

This made me do some digging that I should have done at the outset (in my defence, I am a hobbyist who did not know any better at the time!).

Below is a graph showing Justice Brennan's votes over the entire dataset (on the x axis, 0 means a minority vote and 1 means a majority vote):

As you can see, Justice Brennan voted with the majority slightly less than 80% of the time over the course of his entire SCOTUS tenure.

When we look at his record from the 1950s until the end of the 1970s, we see that he voted with the majority slightly more than 80% of the time:

Finally, let's look at his voting record for the 1980s only:

We now see a big drop! He voted with the majority less than 70% of the time. This may explain why the model's prediction accuracy dropped when the dataset was split to train on the pre-1980s data and test on the 1980s data.

Concluding Thoughts

Some final thoughts:

Sometimes a simple approach will yield valuable insight. Just graphing Justice's Brennan's voting history provides enough information to make some reasonably accurate predictions. No fancy machine learning algorithms required.
That said, I still consider this experiment a success. The model appears to work, for one thing. At the outset, it "knows" nothing about Justice Brennan's voting record; it is simply programmed to process the dataset in a certain way. On completion, the model provides reasonably accurate predictions - better than a coin toss, in any event!
What about using these techniques in legal practice? Litigation analytics is already "a thing", especially in the US. In Canada, unfortunately, we suffer from a deficit of publicly available litigation data, so progress is much slower. For example, do not have an equivalent to the Supreme Court Database, discussed above, for data about the Supreme Court of Canada.

Thanks for reading! Did I make a mistake? Does something not make sense? Please reach out.

1. These algorithms are naive bayes, support vector machines, and K-nearest neighbor.↩

2. Andrew Trask, Grokking Deep Learning (Shelter Island, NY: Manning Publications, 2019), p. 11↩

3. Ibid., p. 10↩