Extracting Text from Our Collection of PACER Documents
We're getting ready to launch a brand new search engine for PACER content. When it launches, one of the big features it will have is full-text search for the millions of documents that people have submitted using our RECAP system. To our knowledge, this will be the first free system for searching PACER content in this way, allowing you to look up documents by any word they might contain.
The big problem with this goal? We have about a million PDFs that consist only of images. Some of these are actually quite beautiful:
But others are hideous:
But no matter how a document looks, we want to extract the text so that we can make it searchable. This is done using a system called Optical Character Recognition (OCR), which looks at each pixel in each page of each document and tries to figure out what letter it is a part of. As you might expect, this can take a while when you're processing millions of documents averaging 9.1 pages each.
About a month ago we started working on this using two very powerful computers, which together used 40 CPU cores. The two computers have been working very hard on extracting text from these documents. For example, this is what one of these computers looks like right now:
As of today, we have only about 100,000 more documents to go, and we expect we'll be done in about another ten days.
In the meantime, we're working on the software side of this project, and we will be launching this feature soon!