Quick Start

Start here if you are new to VisArchPy. This guide will help you get started with VisArchPy. VisArchPy provides a set of pipelines and tools for extracting, transforming, and visualizing images from PDF files. It was developed to support the development of a visual archive of architectural visuals (photographs, drawings, floorplans, 3D renders, etc.); it can be used on any PDF file and image data set.

The main features of VisArchPy are:

Layout: pipeline for extracting metadata and visuals (images) from PDF files using a layout analysis. Layout analysis recursively checks elements in the PDF file and sorts them into images, text, and other elements.
OCR: pipeline for extracting metadata and visuals from PDF files using OCR analysis. OCR analysis extracts images from PDF files using Tesseract OCR.
LayoutOCR: pipeline for extracting metadata and visuals from PDF files that combines layout and OCR analysis.
Dino: utility for transforming images into visual features using the self-supervised learning in [DinoV2.](https://ai.meta.com/blog/dino-v2-computer-vision-self-supervised-learning/)
Viz: an utility to create a bounding box plot. This plot provides an overview of the shapes and sizes of images in a data set.

Note

VisArchPy manages the extraction of metadata of extracted images and the extraction of captions based on text analysis and proximity to images.

Installation

VisArchPy requires the following dependencies:

Dependencies

Installing from PyPI

After installing the dependencies, install VisArchPy using pip.

pip install visarchpy

Installing from source

Install the dependencies.

Clone the repository.

git clone https://github.com/AiDAPT-A/VisArchPy.git

Go to the root of the repository.
```
cd VisArchPy/
```
Install the package using pip.
```
pip install .
```

Usage

The quickest way to get started with VisArchPy is to use the command line interface (CLI). Once installed, you can access the CLI by typing visarch in the terminal.

visarch -h

To access a particular pipeline or tool:

visarch [PIPELINE] -h

For example, to access the layout pipeline:

To run the layout pipeline on a PDF file, do the following:

visarch layout from-file <path-to-pdf-file> <path-output-directory>

To run the layout pipeline on a directory containing PDF files, do the following:

visarch layout from-dir <path-to-pdf-directory> <path-output-directory>

Tip

Use visarch [PIPELINE] [SUBCOMMAND] -h for help.

Outputs

Results from the data extraction pipelines (Layout, OCR, LayoutOCR) are saved to the output directory and organized as follows:

00000/  # results directory
├── pdf-001  # extracted images are saved to a directory. One per PDF file
├── 00000-metadata.csv  # extracted metadata as CSV
├── 00000-metadata.json  # extracted metadata as JSON
├── 00000-settings.json  # settings used by pipeline
└── 00000.log  # log file