Data Extraction Pipelines

Data extraction pipelines are used to extract metadata and images from PDF files. These pipelines can be used to extract data from a single PDF file or a directory of PDF files, using the CLI and the visarch command or as a Python package. VisArchPy provides three different extraction pipelines: Layout, OCR, and LayoutOCR.

Layout Pipeline

Extracts metadata and visuals (images) from PDF files using a layout analysis. Layout analysis uses the algorithm in pdfminer.six to recursively check elements in a PDF file and sort them into images, text, etc.

Examples

The following examples show how to extract images and metadata from PDF files using the Layout pipeline.

visarch layout from-file <path-to-pdf-file> <path-output-directory>

"Not available"

visarch layout from-dir <path-pdf-directory> <path-output-directory>

from visarchpy.pipelines import Layout

pipeline = Layout('path-to-data-dir', 'path-to-output-dir',
                 metadata_file='path-to-mods-file',
                 settings=None, # use default settings
                 )

pipeline.run()

Tip

Use visarch layout [SUBCOMMAND] -h to see which options are available in the CLI. Or consult the Python Reference if using Python.

OCR Pipeline

Extracts metadata and visuals (images) from PDF files using OCR analysis. OCR analysis extracts images from PDF files using Tesseract OCR.

Examples

The following examples show how to extract images and metadata from PDF files using the OCR pipeline.

visarch ocr from-file <path-to-pdf-file> <path-output-directory>

"Not available"

visarch ocr from-dir <path-pdf-directory> <path-output-directory>

from visarchpy.pipelines import OCR

pipeline = OCR('path-to-data-dir', 'path-to-output-dir',
              metadata_file='path-to-mods-file',
              settings=None, # use default settings
              )

pipeline.run()

Tip

Use visarch ocr [SUBCOMMAND] -h to see which options are available in the CLI. Or consult the Python Reference if using Python.

LayoutOCR Pipeline

Extracts metadata and visuals (images) from PDF files using a combination of Layout and OCR analysis. This pipeline first uses the layout analysis to extract images from PDF files. Then, it applies OCR analysis pages in the PDF file that did not produce any images using the first analysis. This condition avoids extracting the same images twice; however, it may miss images not detected by any of the analyses.

Examples

The following examples show how to extract images and metadata from PDF files using the LayoutOCR pipeline.

visarch layoutocr from-file <path-to-pdf-file> <path-output-directory>

"Not available"

visarch layoutocr from-dir <path-pdf-directory> <path-output-directory>

from visarchpy.pipelines import LayoutOCR

pipeline = LayoutOCR('path-to-data-dir', 'path-to-output-dir',
                    metadata_file='path-to-mods-file',
                    settings=None, # use default settings
                    )

pipeline.run()

Tip

Use visarch ocr [SUBCOMMAND] -h to see which options are available in the CLI. Or consult the Python Reference if using Python.

Pipeline Outputs

All extraction pipelines result in the following outputs. Outputs are saved to the <output directory>.

<output-directory>
 └──00000/  # result directory
    ├── pdf-001  # PDF directory, one per PDF. Extracted images are saved here.
    ├── 00000-metadata.csv  # extracted metadata as CSV
    ├── 00000-metadata.json  # extracted metadata as JSON
    ├── 00000-settings.json  # a copy of settings used by the pipeline
    └── 00000.log  # processing log file

Warning

Be mindful when running the pipeline multiple times on the same <output-directory. The 00000 directory is created if it does not exist. However, if it exists, the pipeline will overwrite/update its contents.

pdf-001: existing images are kept, new images are added.

00000-metadata.csv: existing metadata will be overwritten.

00000-metadata.json: existing metadata will be overwritten.

00000-settings.json: existing settings will be overwritten.

00000.log: existing records are kept, new records are added.

Settings

The pipeline settings determine how image extraction is performed. By default, the pipelines use the settings in visarchpy/default-settings.json. However, these settings can be overwritten by passing custom settings to the pipeline.

Default settings can be shown on the terminal by using the following command:

visarch [PIPELINE] settings

Default Setting

Extraction pipelines use the following default settings:

{
    "layout": {
        "caption": {
            "offset": [
                4,
                "mm"
            ],
            "direction": "down",
            "keywords": [
                "figure",
                "caption",
                "figuur"
            ]
        },
        "image": {
            "width": 120,
            "height": 120
        }
    },
    "ocr": {
        "caption": {
            "offset": [
                50,
                "px"
            ],
            "direction": "down",
            "keywords": [
                "figure",
                "caption",
                "figuur"
            ]
        },
        "image": {
            "width": 120,
            "height": 120
        },
        "resolution": 250,
        "resize": 30000,
        "tesseract" : "--psm 1 --oem 3"
    }
}

Settings for the data extraction pipelines in VisArchPy.
Setting	Meaning	Expected values
layout	Group settings for Layout analysis
ocr	Group settings for OCR analysis
caption.offset	Distance around an image boundary where captions will be searched for	`[ number, "mm" ]` (for layout) `[ number, "px" ]` (for OCR)
caption.direction	Direction relative to an image where captions are searched for	`all, up, down,` right, left ` `down-right, up-left,`
caption.keywords	Keywords used to find captions based on text analysis	`[keyword1, keyword2, ...]`
image.width	minimum width of an image to be extracted, in pixels	`integer`
image.height	minimum height of an image to be extracted, in pixels	`integer`
ocr.resolution	DPI used to convert PDF pages into images before applying OCR	`integer`
ocr.resize	Maximum width and height of PDF page used as input by Tesseract. in pixels. If page conversion results in a larger image, it will be downsized to fit this value. Tesseract maximum values for width and height is \(2^{15}\)	`integer`
ocr.tesseract	Tesseract command line options passed to Tesseract. See Tesseract man page [1] for more information.	`string`

Custom Settings

When defining custom settings, the schema defined above should be used. Note that settings for different extraction approaches are grouped together. When using a pipeline that implements only one approach, settings for the other can be omitted. Custom settings can be passed to a pipeline as a JSON file (CLI) or a dictionary (Python).

visarch layoutocr from-file --settings <settings-file> <path-pdf-directory> \
<path-output-directory>

from visarchpy.pipelines import LayoutOCR

custom_settings = {}  # a dictionary with custom settings following schema above

pipeline = LayoutOCR('path-to-data-dir', 'path-to-output-dir',
                    metadata_file='path-to-mods-file',
                    settings=custom_settings
                    )

pipeline.run()