Python Reference
The API reference for the most relevant features in the visarchpy package. For other features, please refer to the source code.
Extraction Pipelines
All data extraction pipelines inherit from the Pipeline abstract class.
- class visarchpy.pipelines.Pipeline(data_directory: str, output_directory: str, settings: dict = None, metadata_file: str = None, temp_directory: str = None, ignore_id: bool = False)
Abstract base class for all pipelines.
- property ignore_id: bool
Gets the ignore_id flag.
- property metadata_file: str
Gets the path to the metadata file.
- abstract run() dict
Run the pipeline.
- property settings: dict
Gets settings for the pipeline.
- property temp_directory: str
Gets the path to the temporary directory.
- class visarchpy.pipelines.Layout(data_directory: str, output_directory: str, settings: dict = None, metadata_file: str = None, temp_directory: str = None, ignore_id: bool = False)
Bases:
PipelineA pipeline for extracting metadata and visuals from PDF files using a layout analysis. Layout analysis recursively checks elements in the PDF file and sorts them into images, text, and other elements.
- run() dict
Run the pipeline.
- class visarchpy.pipelines.OCR(data_directory: str, output_directory: str, settings: dict = None, metadata_file: str = None, temp_directory: str = None, ignore_id: bool = False)
Bases:
PipelineA pipeline for extracting metadata and visuals from PDF files using OCR analysis. OCR analysis extracts images from PDF files using Tesseract OCR.
- run()
Run the pipeline.
- class visarchpy.pipelines.LayoutOCR(data_directory: str, output_directory: str, settings: dict = None, metadata_file: str = None, temp_directory: str = None, ignore_id: bool = False)
Bases:
PipelineA pipeline for extracting metadata and visuals from PDF files that combines layout and OCR analysis. Layout analysis recursively checks elements in the PDF file and sorts them into images, text, and other elements. OCR analysis extracts images using Tesseract OCR.
It applyes image search and analysis in two steps: First, it analyses the layout of the PDF file using the pdfminer.six library. Second, it applies OCR to the pages where no images were found by layout analysis.
- run()
Run the pipeline.
Transformation Utilities
Visualization Utilities
Functions for analyzing and visualizing extracted data.
- visarchpy.analytics.plot_bboxes(images: List[str], cmap: str = 'cool', predictor: Any = None, show: bool = True, size: int = 10, resolution: int = 300, scale_factor: float = 1.0, max_image_size: int = 89478485, save_to_file: str = None) None
Creates a plot of the bounding boxes organize concentrically for the given images.This type of plot is useful for visualizing the distribution sizes and shapes of the given images.
- Parameters:
image_paths (List[str]) – A list of image file paths.
cmap (str) – Name of the matplotlib color map to be used. Consult the matplotlib documentation for valid values.
predictor (Kmeans) – A clustering Kmeans (Scikit Learn) trained model for assing a label and color to each image bounding box. If None, a pretained model with features: width and height, and 20 classes will be used.
show (bool) – Shows plot. Default is True.
size (int) – Size of the figure plot in inches. Default is 10. This value influences the quality of the plot when saving to a file.
resolution (int) – Resolution of the plot and figure in dots per inch (dpi). Default is 300.
scale_factor (float) – Scale factor for the image size. Default is 1.0, which means that images will be plotted at their original size. Values larger than 1.0 will increase the image size and values smaller than 1.0 will decrease the image size.
max_image_size (int) – Maximum size of an image in pixels. Images larger than this value will not be plotted. Default is 89478485, which is the maximum size of an image in pixels that can be stored in a 32-bit system.
save_to_file (str) – Path to a PNG file to save the plot. If None, no file is saved.
- Return type:
None
- Raises:
Warning – If an image has no bounding box in the alpha channel.:
Warning – Decompression Bomb. If an image is larger than the maximum: size allowed for a 32-bit system.
Killed – If system runs out of memory during plotting. Adjusting the: max_image_size and scale_factor parameters may help.