Python Reference

The API reference for the most relevant features in the visarchpy package. For other features, please refer to the source code.

Extraction Pipelines

All data extraction pipelines inherit from the Pipeline abstract class.

class visarchpy.pipelines.Pipeline(data_directory: str, output_directory: str, settings: dict = None, metadata_file: str = None, temp_directory: str = None, ignore_id: bool = False)

Abstract base class for all pipelines.

property ignore_id: bool: Gets the ignore_id flag.

property metadata_file: str: Gets the path to the metadata file.

abstract run() → dict: Run the pipeline.

property settings: dict: Gets settings for the pipeline.

property temp_directory: str: Gets the path to the temporary directory.

class visarchpy.pipelines.Layout(data_directory: str, output_directory: str, settings: dict = None, metadata_file: str = None, temp_directory: str = None, ignore_id: bool = False)

Bases: Pipeline

A pipeline for extracting metadata and visuals from PDF files using a layout analysis. Layout analysis recursively checks elements in the PDF file and sorts them into images, text, and other elements.

run() → dict: Run the pipeline.

class visarchpy.pipelines.OCR(data_directory: str, output_directory: str, settings: dict = None, metadata_file: str = None, temp_directory: str = None, ignore_id: bool = False)

Bases: Pipeline

A pipeline for extracting metadata and visuals from PDF files using OCR analysis. OCR analysis extracts images from PDF files using Tesseract OCR.

run(): Run the pipeline.

class visarchpy.pipelines.LayoutOCR(data_directory: str, output_directory: str, settings: dict = None, metadata_file: str = None, temp_directory: str = None, ignore_id: bool = False)

Bases: Pipeline

A pipeline for extracting metadata and visuals from PDF files that combines layout and OCR analysis. Layout analysis recursively checks elements in the PDF file and sorts them into images, text, and other elements. OCR analysis extracts images using Tesseract OCR.

It applyes image search and analysis in two steps: First, it analyses the layout of the PDF file using the pdfminer.six library. Second, it applies OCR to the pages where no images were found by layout analysis.

run(): Run the pipeline.

Transformation Utilities

Utility functions to extract visual features using DINOv2 model and the huggingface transformers package.

visarchpy.dino.transformer.load_pickle_dinov2(pickle_filename: str) → BaseModelOutputWithPooling

Load outputs of dinov2 model from a file.

Parameters:: pickle_filename (str) – Path to pickle file
Returns:: outputs – Outputs of dinov2 model according to the ‘transformers’ package data classes. A Python object.
Return type:: BaseModelOutputWithPooling

visarchpy.dino.transformer.save_csv_dinov2(csv_filename: str, tensor: Tensor) → None

Save pytorch tensor (2D) to a csv file formatted as a Pandas dataframe.

Parameters:

csv_filename (str) – Path to csv file
tensor (Tensor) – 2D tensor to be saved to csv file.

Return type:

None

Raises:

TypeError – If tensor is not a pytorch Tensor object.
ValueError – If tensor is not a 2D pytorch Tensor object.

visarchpy.dino.transformer.save_pickle_dinov2(pickle_filename: str, model_outputs: BaseModelOutputWithPooling) → None

Save outputs of dinov2 model to a file.

Parameters:

pickle_filename (str) – Path to pickle file
outputs (BaseModelOutputWithPooling) – Pickle file with outputs object of dinov2 model. File willl be saved to the same directory as the image file, and with the same name as the image file.

Return type:

None

visarchpy.dino.transformer.transform_to_dinov2(image_file: str, model_name: str = 'facebook/dinov2-small') → Dict[Tensor, BaseModelOutputWithPooling]

Extract features from an image using DINOv2 model.

Parameters:

image_file (str) – Path to image file.
model_name (str) – pretrained DINOv2 model name (e.g. ‘facebook/dinov2-small’)

Returns:

results – Last hidden state of DINOv2 model as squeezed tensor, and model outputs object.

Return type:

Dict

Visualization Utilities

Functions for analyzing and visualizing extracted data.

visarchpy.analytics.plot_bboxes(images: List[str], cmap: str = 'cool', predictor: Any = None, show: bool = True, size: int = 10, resolution: int = 300, scale_factor: float = 1.0, max_image_size: int = 89478485, save_to_file: str = None) → None

Creates a plot of the bounding boxes organize concentrically for the given images.This type of plot is useful for visualizing the distribution sizes and shapes of the given images.

Parameters:

image_paths (List[str]) – A list of image file paths.
cmap (str) – Name of the matplotlib color map to be used. Consult the matplotlib documentation for valid values.
predictor (Kmeans) – A clustering Kmeans (Scikit Learn) trained model for assing a label and color to each image bounding box. If None, a pretained model with features: width and height, and 20 classes will be used.
show (bool) – Shows plot. Default is True.
size (int) – Size of the figure plot in inches. Default is 10. This value influences the quality of the plot when saving to a file.
resolution (int) – Resolution of the plot and figure in dots per inch (dpi). Default is 300.
scale_factor (float) – Scale factor for the image size. Default is 1.0, which means that images will be plotted at their original size. Values larger than 1.0 will increase the image size and values smaller than 1.0 will decrease the image size.
max_image_size (int) – Maximum size of an image in pixels. Images larger than this value will not be plotted. Default is 89478485, which is the maximum size of an image in pixels that can be stored in a 32-bit system.
save_to_file (str) – Path to a PNG file to save the plot. If None, no file is saved.

Return type:

None

Raises:

Warning – If an image has no bounding box in the alpha channel.:
Warning – Decompression Bomb. If an image is larger than the maximum: size allowed for a 32-bit system.
Killed – If system runs out of memory during plotting. Adjusting the: max_image_size and scale_factor parameters may help.