Data Extraction Pipelines

Data extraction pipelines are used to extract metadata and images from PDF files. These pipelines can be used to extract data from a single PDF file or a directory of PDF files, using the CLI and the visarch command or as a Python package. VisArchPy provides three different extraction pipelines: Layout, OCR, and LayoutOCR.

Layout Pipeline

Extracts metadata and visuals (images) from PDF files using a layout analysis. Layout analysis uses the algorithm in pdfminer.six to recursively check elements in a PDF file and sort them into images, text, etc.

Examples

The following examples show how to extract images and metadata from PDF files using the Layout pipeline.

visarch layout from-file <path-to-pdf-file> <path-output-directory>
"Not available"

Tip

Use visarch layout [SUBCOMMAND] -h to see which options are available in the CLI. Or consult the Python Reference if using Python.

OCR Pipeline

Extracts metadata and visuals (images) from PDF files using OCR analysis. OCR analysis extracts images from PDF files using Tesseract OCR.

Examples

The following examples show how to extract images and metadata from PDF files using the OCR pipeline.

visarch ocr from-file <path-to-pdf-file> <path-output-directory>
"Not available"

Tip

Use visarch ocr [SUBCOMMAND] -h to see which options are available in the CLI. Or consult the Python Reference if using Python.

LayoutOCR Pipeline

Extracts metadata and visuals (images) from PDF files using a combination of Layout and OCR analysis. This pipeline first uses the layout analysis to extract images from PDF files. Then, it applies OCR analysis pages in the PDF file that did not produce any images using the first analysis. This condition avoids extracting the same images twice; however, it may miss images not detected by any of the analyses.

Examples

The following examples show how to extract images and metadata from PDF files using the LayoutOCR pipeline.

visarch layoutocr from-file <path-to-pdf-file> <path-output-directory>
"Not available"

Tip

Use visarch ocr [SUBCOMMAND] -h to see which options are available in the CLI. Or consult the Python Reference if using Python.

Pipeline Outputs

All extraction pipelines result in the following outputs. Outputs are saved to the <output directory>.

<output-directory>
 └──00000/  # result directory
    ├── pdf-001  # PDF directory, one per PDF. Extracted images are saved here.
    ├── 00000-metadata.csv  # extracted metadata as CSV
    ├── 00000-metadata.json  # extracted metadata as JSON
    ├── 00000-settings.json  # a copy of settings used by the pipeline
    └── 00000.log  # processing log file

Warning

Be mindful when running the pipeline multiple times on the same <output-directory. The 00000 directory is created if it does not exist. However, if it exists, the pipeline will overwrite/update its contents.

  • pdf-001: existing images are kept, new images are added.

  • 00000-metadata.csv: existing metadata will be overwritten.

  • 00000-metadata.json: existing metadata will be overwritten.

  • 00000-settings.json: existing settings will be overwritten.

  • 00000.log: existing records are kept, new records are added.

Settings

The pipeline settings determine how image extraction is performed. By default, the pipelines use the settings in visarchpy/default-settings.json. However, these settings can be overwritten by passing custom settings to the pipeline.

Default settings can be shown on the terminal by using the following command:

visarch [PIPELINE] settings

Default Setting

Extraction pipelines use the following default settings:

{
    "layout": {
        "caption": {
            "offset": [
                4,
                "mm"
            ],
            "direction": "down",
            "keywords": [
                "figure",
                "caption",
                "figuur"
            ]
        },
        "image": {
            "width": 120,
            "height": 120
        }
    },
    "ocr": {
        "caption": {
            "offset": [
                50,
                "px"
            ],
            "direction": "down",
            "keywords": [
                "figure",
                "caption",
                "figuur"
            ]
        },
        "image": {
            "width": 120,
            "height": 120
        },
        "resolution": 250,
        "resize": 30000,
        "tesseract" : "--psm 1 --oem 3"
    }
}
Settings for the data extraction pipelines in VisArchPy.

Setting

Meaning

Expected values

layout

Group settings for Layout analysis

ocr

Group settings for OCR analysis

caption.offset

Distance around an image boundary
where captions will be searched
for
[ number, "mm" ] (for layout)
[ number, "px" ] (for OCR)

caption.direction

Direction relative to an image
where captions are searched for


all, up, down,
right, left `
down-right, up-left,

caption.keywords

Keywords used to find captions
based on text analysis
[keyword1, keyword2, ...]

image.width

minimum width of an image to be
extracted, in pixels

integer

image.height

minimum height of an image to be
extracted, in pixels

integer

ocr.resolution

DPI used to convert PDF pages
into images before applying OCR

integer

ocr.resize

Maximum width and height of PDF
page used as input by Tesseract.
in pixels. If page conversion
results in a larger image, it will
be downsized to fit this value.
Tesseract maximum values for
width and height is \(2^{15}\)

integer

ocr.tesseract

Tesseract command line options
passed to Tesseract. See Tesseract
man page [1] for more
information.

string

Custom Settings

When defining custom settings, the schema defined above should be used. Note that settings for different extraction approaches are grouped together. When using a pipeline that implements only one approach, settings for the other can be omitted. Custom settings can be passed to a pipeline as a JSON file (CLI) or a dictionary (Python).

visarch layoutocr from-file --settings <settings-file> <path-pdf-directory> \
<path-output-directory>