Extracting Text from Our Collection of PACER Documents

Michael Lissner

September 26, 2016

We're getting ready to launch a brand new search engine for PACER content. When it launches, one of the big features it will have is full-text search for the millions of documents that people have submitted using our RECAP system. To our knowledge, this will be the first free system for searching PACER content in this way, allowing you to look up documents by any word they might contain.

The big problem with this goal? We have about a million PDFs that consist only of images. Some of these are actually quite beautiful:

A beautiful handwritten motion. It goes on like this for 46 pages.

But others are hideous:

An 84 page log from 1957. It's come a long ways just to appear on this blog today.

But no matter how a document looks, we want to extract the text so that we can make it searchable. This is done using a system called Optical Character Recognition (OCR), which looks at each pixel in each page of each document and tries to figure out what letter it is a part of. As you might expect, this can take a while when you're processing millions of documents averaging 9.1 pages each.

About a month ago we started working on this using two very powerful computers, which together used 40 CPU cores. The two computers have been working very hard on extracting text from these documents. For example, this is what one of these computers looks like right now:

A system manager showing lots of cores fully pegged.

This shows 24 CPUs each at 100% utilization.

As of today, we have only about 100,000 more documents to go, and we expect we'll be done in about another ten days.

In the meantime, we're working on the software side of this project, and we will be launching this feature soon!

Tagged:ocrpacer

Started in 2010, Free Law Project is the leading non-profit using technology, data, and advocacy to make the legal ecosystem more equitable and competitive. We host major open databases of opinions, federal filings, judges, financial disclosures, and oral arguments. We build open‑source tools like eyecite, juriscraper, and x-ray.

We rely on your donations for our continued success.

Please Support Our Work

Extracting Text from Our Collection of PACER Documents

About

Blog

Tools

Datasets

Our Work

Support FLP