Oct 2024 Benchmark of OpenAI, Anthropic, and Google LLMs for Legal Tasks on SpotDraft

Blog /

Resources

Benchmark Of OpenAI, Anthropic, And Google LLMs And How We Choose The Best Model For Each Legal Task

SpotDraft Secures
$54 Million To Lead AI Contract Lifecycle Management

Learn More

Email me the Free NDA Playbook Template

Download the AI Use Policy Playbook

Email me the Free Contract Drafting Checklist Template

Email me the free Business Contract Template

Email me the free Contract Risk Assessment Checklist

Email me the Free Contract Tracking Spreadsheet

Email me the Free Contract Template

Email me the Free Template

Email me the Free Contract Tracking Spreadsheet

Explore the benchmark of leading LLMs, including GPT-4, Anthropic, and Google, for legal tasks like contract summarization, review and metadata extraction.

By

Mangal Joe Edwin

•

Feb 13, 2025

•

15 min. read

I. Introduction

Legal tech is evolving at a breakneck pace—blink and you might miss it. At the center of it all? Large Language Models (LLMs). In just the past few weeks, major AI players like OpenAI, Anthropic, and Google have rolled out new models, each promising to redefine what’s possible with AI.

But which one actually delivers on tasks that matter to legal professionals—like contract summarization, review, or extracting key details from documents?

At SpotDraft, we take a model-agnostic approach to AI, choosing the best model for each specific task rather than committing to a single vendor. This flexibility allows us to use the most effective tool for every legal workflow. In this article, we’ll break down how these models perform on three critical legal tasks—contract review, summarization, and party information extraction. By benchmarking their performance, we’ll explain why we selected specific models for certain jobs and how they compare.

II. Benchmark Overview

We focused on three critical legal tasks where LLMs can provide significant value: contract review, contract summarization, and contract party information extraction.

A. Contract review

Legal contract review is a time-consuming but critical process for any organization. It’s all about identifying risks and ensuring compliance by comparing contracts against organizational playbooks and individual guidelines. LLMs can help speed up reviews while ensuring no detail is missed.

In this task, we tested how effectively LLMs can flag potential risks and inconsistencies, providing an automated layer to streamline legal reviews. Two main components of a review are:

Comparing risks in a contract against playbooks: The ability of LLMs to evaluate entire contracts against pre-set playbooks, mimicking how a lawyer would assess legal risks, is crucial.
Generating playbooks from a Standard Contract: LLMs were tasked with generating playbooks based on standard/template contracts, extracting key concepts to form a standard against which other contracts can be evaluated.

Dataset for legal/contract review

We developed a "golden dataset" to benchmark LLM performance across legal/contract review scenarios. This dataset consisted of six contract types: NDAs, DPAs, MSAs, Consulting Agreements, SaaS Agreements, and Employment Agreements. In total, there were 45 unique contracts spread across these six types, with each contract tested against a set of guidelines*.

*Legal functions generally have a contract review playbook for all lawyers to follow when reviewing and negotiating third party contracts. These playbooks or guides lay out the rules or guidelines that are acceptable to the organization.

Golden dataset:

To properly evaluate how well the LLMs performed on contract review, we used a carefully curated “golden dataset.” This dataset ensures consistency and provides a solid foundation for benchmarking model performance.

Contract types: 6 contract types (NDAs, DPAs, MSAs, Consulting Agreements, SaaS Agreements, and Employment Agreements)
Total unique contracts: 45 contracts
Guidelines: 60 unique guidelines (an average of 10 guidelines per contract type)
Total samples: 300 samples; each sample is a combination of (contract, guideline, and expected result). E.g.: (SpotDraft NDA for Saas, “The term should be less than 5 years”, Met/Not Met)

B. Contract summarization

Contract summarization gives legal teams quick access to key contract metadata without requiring a full read-through. Contract metadata refers to essential information embedded within a contract, such as dates, parties, obligations, payment terms, and renewal conditions. Successful contract summarization depends on two key factors:

Extraction of key metadata: We assessed how well different LLM models could summarize entire documents and extract data.
Description generation: The role of the LLM here is to generate detailed, contextually accurate descriptions for the contract metadata we are searching for.

Dataset for contract summarization

Just as we did for the contract review task, we developed a Golden Dataset for contract summarization, which was slightly different, and was curated to be representative of different kinds, lengths, and qualities of contracts. In total, there were 300 fields that were to be populated across 45 different contracts.

Golden dataset:

Contract types: 4 contract types (NDAs, MSAs, SaaS Agreements, and Employment Agreements)
Total unique contracts: 45 contracts
Metadata fields: 40 unique fields (an average of 10 metadata fields per contract type)
Total samples: 300 samples; Each sample is a combination of (contract, metadata field, expected value of a field). E.g: (SpotDraft NDA, “Expiry date”, 31st December 2024)

Here, the metadata field is a combination of the field label, and a brief description outlining the meaning of the field in the context of the contract.

C. Party information extraction

Parties to an agreement are a fundamental property of the contract. For this super-focused task, we evaluated the AI’s ability to identify these parties, their alias names in the contract, and the roles they play in the contract.

Dataset for party information extraction

Since parties are a fundamental part of every contract, we curated a dataset using a wide variety of contracts. In total, this dataset included over 500 contracts across 20 different contract types. Unlike other tasks, this one is not open-ended; the golden dataset is much simpler, consisting only of the contract and the expected list of parties.

Golden dataset:

Contract types: 20 contract types (Sales agreements, Affiliate agreements, Distributor Agreements, etc.)
Total unique contracts: 510 contracts
Total samples: 510 samples; each sample is a combination of (contract, expected list of parties) E.g.: (SpotDraft NDA, SpotDraft, and Acme)

With the tasks clearly defined, we can now evaluate how the top LLMs—OpenAI, Anthropic, and Google—perform on these benchmarks. Each model has its strengths, but how do they stack up when it comes to handling the complexities of legal language?

III. Models under evaluation

A. OpenAI Models

GPT-4
GPT-4 Turbo
GPT-4o
o1-mini
o1-preview
GPT- 4o mini

B. Anthropic Models

Claude 3.5 Sonnet
Claude 3 Sonnet

C. Google Models

Gemini 1.5 Flash
Gemini 1.5 Pro

Each model’s accuracy was tested across legal tasks, noting fully correct or incorrect results. These models demonstrated varying abilities to handle complex legal language and contract analysis.

Now that we’ve introduced the models, let’s look at how they were tested. Our evaluation process involved multiple benchmarks to see how well each LLM handled the tasks discussed in Section II.

IV. The evaluation process

Our evaluation process involved testing each LLM across different scenarios in contract review, contract summarization, and party information extraction, using the datasets discussed in Section II. Each model’s output was compared against a ground-truth set of annotations.

By selecting models with the best attributes for specific functions, we further fine-tune these models with post-processing, optimizing them for superior legal performance. This ensures that our AI is both accurate and reliable.

Contract review

Comparing risks in a contract against playbooks

This task can be likened to instructing a junior lawyer to identify and flag potential risks in a contract based on an organization's playbook, much like a lawyer would use reference material to ensure a contract's compliance.

Each sample in the golden dataset is a combination of three elements: the contract, the guideline, and the expected result. The task of the LLM is to check if the contract complies with a specific guideline (e.g., "The term should be less than 5 years"). Based on the contract and the given guideline, the LLM evaluates whether the contract meets the conditions, producing an answer of "Met" or "Not Met." By comparing the LLM’s output to the expected result, we can measure how accurately the model identifies the correct outcome.

Let’s dive into the results. In the contract review task, accuracy and recall are key factors. We want models that can identify potential risks with minimal error. Here's how the models performed:

LLM choice: For this task, GPT-4 was selected due to its exceptional reasoning ability, high recall, and accuracy in interpreting legal language and guidelines. While there have been claims that models like GPT-4 Turbo and GPT-4o outperform GPT-4, our evaluations show that GPT-4 still delivers the highest recall, making it the most reliable choice for identifying risks in contracts.

In sensitive tasks like contract review, our priority is ensuring that no potential risks are overlooked. We are willing to accept a few misclassifications, but the cost of missing a risk far outweighs the cost of miscategorizing a non-risk. Rather than aiming to replace lawyers, we focus on assisting them by surfacing all possible risks, ensuring they have the most comprehensive insight into a contract.

Generating playbooks from a golden/ideal contract

In this task, LLMs were evaluated on their ability to reverse-engineer an ideal or standard contract into a comprehensive playbook. The goal was to extract key conditions and clauses that could serve as a benchmark for evaluating future contracts. This process required the LLMs to identify and codify the essential elements that make a contract ideal, forming the foundation of an AI-driven playbook.

LLM choice: o1-mini was selected for guide generation tasks, particularly those that demand high-level reasoning and deeper insights.

Example: Consider the following two examples where we attempted to generate a playbook using GPT- 4 and o1-mini. The differences in the output highlight the varying levels of specificity and value that each model provides.

Examples of rules generated by GPT- 4:

The agreement should be a Mutual Non-Disclosure Agreement.

The agreement should be made between multiple parties.

The agreement should be effective from a specified date.

The agreement should define confidential information.

These guidelines are quite generic, and lawyers typically wouldn’t include such rules in their playbooks as they provide little value.

Examples of guidelines generated by o1-mini :

The non-disclosure obligations should survive for three (3) years from the date of disclosure.

Confidential Information should include technology, processes, products, specifications, inventions or designs, trade secrets, and commercially sensitive information.

Trade secret protection obligations should remain in effect as long as the information qualifies as a trade secret.

The agreement should explicitly state that no license is granted under any intellectual property rights.

In contrast, these guidelines are clear and specific, and capture the nuances of the standard contract they were derived from. They would be useful additions to a playbook and serve as effective criteria for reviewing contracts.

o1-mini excelled in scenarios that required deeper reasoning and more nuanced understanding. It demonstrated an impressive ability to generate comprehensive guidelines from contracts, showcasing its strength in handling tasks that require a more thoughtful approach.

To further refine this process, we provided the LLMs with a set of best practices for writing guidelines. o1-mini showed a significant edge in comprehending and applying these practices, consistently generating guidelines that were more detailed and thorough compared to other models.

‍

Contract summarization

1. Extraction of key metadata

When comparing different LLMs for this task, there were two routes that we looked at. One was to pass smaller segments of the contract into an LLM with a limited context window, meaning they consider only a fixed number of preceding and subsequent words when generating responses. The second method involved passing full contracts into an LLM, allowing it to take into consideration the entire document for better overall context.

GPT-4 proved capable of handling smaller contract segments with speed and accuracy. However, when dealing with large-scale contracts this limitation can affect the accuracy of data extraction, as important information may span beyond the retrieved segments of the contract.

LLM choice: Gemini 1.5 Flash for its high speed, accuracy, and ability to process large contract contexts, making it a cost-effective most suitable model for this task. While GPT-4 excelled in smaller, focused contexts, Gemini Flash demonstrated a superior ability to process entire contracts without breaking them into parts. This capability allows for more contextual and accurate metadata extraction across large documents while maintaining speed and cost-effectiveness.

Notable observation:

In the huge context, more powerful models like Gemini 1.5 Pro and GPT-4o seem to over-index on certain keywords from either the contract text or the described fields and end up with either the wrong answer or no answer due to a lack of confidence.

For example, when asked to find the Employment Position in an employment contract, the reasoning that Gemini 1.5 Pro gave was as follows:

The Employment Agreement states that the Executive shall serve the Bank as Chief Executive Officer, and as Chairman of the Board of Directors.

Because it found two positions rather than 1, it set the answer to null.

Also, it seems like Gemini 1.5 Pro generally has a higher tendency to give the right explanation for its answers but not the answer itself. This explains the unnaturally low accuracy for a powerful model in this long-context use case.

Example:

Consider this case where we evaluated GPT-4 and Gemini 1.5 Flash to extract the expiration date defined within a contract.

The expiration date in this contract was defined using the effective date and the term of the contract, where the effective date is the last signed date and the term is in the terms and termination section in the middle of the contract.

With a model like GPT-4, which has a limited context window, only parts of this information were captured. For example, it missed the connection between the effective date (located near the end of the contract) and the term (buried in the middle), leading to incomplete or inaccurate metadata extraction.

In contrast, Gemini 1.5 Flash can process the entire contract, capturing all relevant details and ensuring higher accuracy. Here’s an example of the reasoning Gemini 1.5 Flash used to extract the expiration date correctly:

"The agreement is valid for 1 year from the effective date, unless terminated earlier. The effective date is the same as the execution date. The execution date is the date the last signing party signs the agreement. The last signing party is Susan Patel, who signed on 2024.05.20. Therefore, the expiration date is one year after 2024-05-20, which is 2025-05-20.”

‍

2. Description Generation

Description generation focuses on creating accurate descriptions for the contract metadata we are searching for, which helps define the specific information we are looking for to be extracted from a contract.

The role of the LLM here is to generate detailed, contextually accurate descriptions based on the contract’s contents. These descriptions serve as prompts, enabling other LLMs to accurately extract the relevant metadata. This comprehensive description creation is possibly the simplest use case that any LLM could perform with confidence. This laid a focus on speed and thus we used GPT-4 Turbo which takes an average of 5s to create descriptions, either by itself or with user context.

LLM choice: GPT-4 Turbo.

Examples of descriptions generated by GPT- 4 Turbo:

1. The Pilot Fee is the fee charged for the initial trial or testing period of a Software as a Service (SaaS) product, as specified in the SaaS Order Form. It is typically presented in the currency format and reflects the cost associated with the pilot phase of the service.

2. The subscriber contact name field captures the name of the primary contact for the subscriber as specified in the SaaS Order Form. The name should be formatted as "Last name, first name" to ensure clarity and consistency.

3. This is the date on which the SaaS Order Form will expire. It is determined by taking the effective date, adding the term length in years, and then subtracting one day. If the effective date is not specified, the expiration date defaults to one day less than the last signed date of the contract.

Party Information Extraction

Extracting Relevant Party Information

Extracting accurate party information means parsing contracts and fetching the right parties that are present in the contract. This is a highly focused problem where the scope of needing to derive or infer information is limited, unlike the other tasks. Here, the ideal outcome is high recall and high precision, where all of the parties of the contract are identified (without missing any), and only entities that are actual parties to the contract are identified (without miscategorizing a non-party as one). We use precision and recall scores, along with their harmonic mean, the f1 score.

All of the models were pretty close here. GPT models boasted the highest performance, while Gemini was the quickest. GPT-4o mini performed with the highest recall and precision scores while being relatively quick and cheap.

LLM choice: GPT-4o mini

Finding the best fit

After putting the latest LLMs from OpenAI, Anthropic, and Google through their paces on key legal tasks, it’s clear that no single model reigns supreme across the board. Each has its strengths, depending on the specific task you’re aiming to automate or streamline in the legal field.

For contract review, where the stakes are high and thoroughness is key, OpenAI’s GPT-4 emerged as the leader, excelling in risk identification and guideline comparison. It’s your go-to when you need precision and depth.
In contract summarization, especially when dealing with large, complex documents, Gemini 1.5 Flash impressed with its ability to process entire contracts and extract metadata efficiently. If speed and cost-effectiveness matter, this is the model to consider.
For party information extraction, GPT-4o mini outperformed others with its balance of accuracy and speed, making it the top choice when identifying the parties involved in a contract.

It’s clear that no single model reigns supreme across the board. Each model shines in its own right, depending on the specific task at hand. Our model-agnostic approach ensures that we select the right tool for the job, prioritizing precision, speed, and cost-effectiveness based on the unique needs of legal teams. As AI technology continues to evolve, staying flexible in choosing models will be crucial in maintaining a competitive edge in legal tech.

Download the Free Template

Email me the free Business Contract Template

Download the Free Template

Email me the Free NDA Playbook Template

Email me the free Contract Risk Assessment Checklist

Email me the Free Contract Drafting Checklist Template

Email me the Free Contract Template

Email me the Free Template

Email me the Free Contract Tracking Spreadsheet

Learn More

Benchmark Of OpenAI, Anthropic, And Google LLMs And How We Choose The Best Model For Each Legal Task

SpotDraft Secures
$54 Million To Lead AI Contract Lifecycle Management

I. Introduction