Data Mining PDF documents; using data conversion to reduce analysis time

Problem

A month ago, we became aware of a way to harvest legal notifications from a government web-site. Link Here The web-server allows simple requests to be crafted in order to download PDF documents related to court proceedings. After a few hours, we had over 25,000 PDF documents available to analyze. Now the question becomes: What is the best way to do this?

If you said large-data analysis or machine learning; you are exacerbating the process. For us, these technologies are apt for over 1TB of data inputs. In this case, 25,000+ PDF’s is less than 4 GB since the PDF’s only have text.

We did a quick proof-of-concept in order to determine the best way to extract all the text from the documents. Needless to say this took less than 10 lines of code; which we shall share below in this post once 100% of the sample is completed. We did however notice some limitations with this process and felt we should bring this forward to explain why the process did not work 100% of the time. This data source has two types of PDF files;

a Document format converted to PDF
a printout of a document scanned as an image.

The first type is extremely simple to analyze. Tools like pdf2ps or PDF to post-script quickly extracts all the text.

The scanned documents however are more troublesome because of the:

Quality of the scanned document
Alignment of paper during scan

It’s hard to believe an official government communication would be skewed and delivered this way, but often times the clerk’s responsible for this process can read it themselves and thus aren’t terribly concerned with the quality of this document after it’s ingested into the document distribution system. Because of this, we employed a method to convert the PDF documents into high-quality images, align them and then extract text using Tesseract.

We also employed the use of http://coolpythoncodes.com/flesch-index-python-script/ to analyze the quality of the documents.

The Process

Download the documents (Complete)
Determine if the documents downloaded are actually PDF’s or junk downloads
Determine if the valid PDF’s are of the text nature or scanned nature
1. If text, extract and dump all text
2. Else, convert to high-res image
  1. If skewed, align
  2. Else, convert to text
Measure the contents of extraction to determine if text extraction was valid.
1. If true, pass and mark the Document Complete
2. Else, fail and mark the document for manual review
Take valid documents and run through word-analysis.

Step by Step

Step 1: Completed in our previous post. Link Here Although this script is very simple, we did not feel threading or concurrent downloads were needed to accomplish this.

Step 2: Part of the problem with the approach used relied in determining that all requests done to the web-server returned valid 200 response codes; even if no document was retrieved. The wget output would present instead an HTML page with an error message. To get rid of these, lets use a script to remove everything that is an HTML and not a PDF file.#!/bin/dash START=160000 END=170000 for i in $(seq $START $END);do file Document-ID-$i | if grep -iv pdf; then mv Document-ID-$i failed-downloads/ else continue fi done

Simple, now we have a clean directory with the files that are actually PDF’s.

Step 3: Lets test the PDF documents by locating the amount of images within the PDF. If no images, the the document is a true PDF with text. Our previous methodology of increment file-numbers no longer works since we have now moved files out of the folder if they were not valid PDF’s. Lets modify accordingly:

Step3a: We used a custom action job in Adobe Acrobat Pro DC to run Optical Character Recognition (OCR) on all the documents. The process to open this amount of PDF’s, recognize Text to OCR, and save to a new directory took over 30 hours. This was time prohibitive in itself since using an action job to do this type of conversion took weeks to perform on only a few thousand documents. We consider that legal e-Discovery could use this technique on a few hundred documents; but what about e-Discovery on tens if not hundreds of thousands of documents? Even with a few high-specification computers, using a GUI to do a script’s job is always taxing. None of the conversion elements seem difficult but be advised that using Acrobat this way does require that you leave the system alone and screen-focus will be stolen from you if you continue using a machine for regular tasks.

More to follow soon….

About the Author: CompSec Direct

CompSec Direct is a Cyber Operations firm specialized in Capability Development. ISO 9001:2022, SDVOSB, MBE certified firm of former DOD network operators.