Metadata Reference

This section describes the metadata fields used in the metadata.json file. Most of the metadata fields are based on the MODS standard. However, some fields were added to support TU Delft-specific fields and metadata about extracted images (visuals).

Metadata fields in metadata.json file.

Field

Explanation

Expected values

documents [2]

List of documents processed
by the pipeline

documents.location [2]

Location of the document

{ root_path: string,
file_path: string }

persons [3]

List of persons involved in
documents

array of persons

persons.name

Name of the person

string

persons.role

Role of the person

string

faculty [3]

List of faculties involved in documents

array of faculties

faculty.name

Name of the faculty

string

faculty.departments

List of departments in the faculty

[{ name: string }]

mods_file [2]

Path the MODS file

string

title [3]

Title of thesis

string

abstract [3]

Abstract of thesis

array of strings

submission_date [3]

Submission date of thesis

YYYY-MM-DD

thesis_type [3]

Type of thesis

array of strings

subjects [3]

List of subjects

array of strings

copyright [3]

Copyright information

array of strings

languages [3]

List of languages

array of language identifiers

uuid [1]

Repository unique identifier

string

iid [3]

Repository identifier

string

media_type [3]

Media type of resource

array

issuance [3]

Issuance of resource

array

digital_origin [3]

Digital origin of resource

string

doi [3]

Digital Object Identifier

string

edition [3]

Edition of resource

string

extent [3]

Extent of resource

array

form [3]

Form of resource

array

classification [3]

Classification of resource

array

collection [3]

Catalog collection

string

geo_code [3]

Geographic code

string

corp_names [3]

Corporate names

array

creators [3]

Creators of resource

array

physical_description [3]

Physical description of resource

array

physical_location [3]

Physical location of resource

array

pid [3]

Persistent identifier

string

publication_place [3]

Publication place of resource

array

publisher [3]

Publisher of resource

array

purl [3]

Persistent URL

string

type_resource [3]

Type of resource

string

web_url [1]

Web URL of resource

string

total_visuals [2]

Total number of extracted visuals
for documents in documents

integer

visuals [2]

List of extracted visuals

array of visuals

visuals.document

Document where the visual is located

same as documents.location

visuals.document_page

Document page where visual is located

integer

visuals.bbox

Bounding box of visual

array of floats

visuals.bbox_units

Units of the bounding box

pt (point), px (pixel)

visuals.id

Unique identifier of visual

string

visuals.caption

Extracted caption of visual

array of strings

visuals.visual_type

Type of visual

string

visuals.location

Location where extracted visual is stored

same as documents.location

Footnotes

Metadata File Example

Running any VisArchPy data extraction pipelines will generate a metadata.json file in the output directory. The following example shows the structure of the metadata.json file.

{
"documents": [
    {
        "location": {
            "root_path": "./tests/data/00001/",
            "file_path": "00001_sample.pdf"
        }
    }
],
"persons": [
    {
        "name": "Rom, O.",
        "role": "mentor"
    },
    {
        "name": "Bir, H.",
        "role": "mentor"
    },
    {
        "name": "Jen, P.",
        "role": "mentor"
    },
    {
        "name": "Hans, Y.",
        "role": "author"
    }
],
"faculty": [
    {
        "name": "Architecture",
        "departments": [
            {
                "name": "Architecture"
            }
        ]
    }
],
"mods_file": "./tests/data/00001/00001_mods.xml",
"title": "Resource title",
"abstract": [
    "This is an example."
],
"submission_date": "2020-09-03",
"thesis_type": [
    "master thesis"
],
"subjects": [
    "Mapping",
    "Mental Border"
],
"copyright": [
    "(c) 2020 Hans, Y."
],
"languages": [
    {
        "code": "en",
        "authority": "rfc3066"
    }
],
"uuid": "uuid:0008286e-f16c-4e7b-8334-fe36fe9b09e4",
"iid": null,
"media_type": [],
"issuance": [],
"digital_origin": null,
"doi": null,
"edition": null,
"extent": [],
"form": [],
"classification": [],
"collection": null,
"geo_code": null,
"corp_names": [],
"creators": [],
"physical_description": [],
"physical_location": [],
"pid": null,
"publication_place": [],
"publisher": [],
"purl": [],
"type_resource": "text",
"web_url": "http://resolver.tudelft.nl/uuid:0008286e-f16c",
"total_visuals": 1,
"visuals": [
    {
        "document": {
            "location": {
                "root_path": "./tests/data/00001/",
                "file_path": "00001_sample.pdf"
            }
        },
        "document_page": 1,
        "bbox": [
            49.9742,
            143.462,
            234.7092,
            331.342
        ],
        "bbox_units": "pt",
        "id": "5e5cc208-cd2c-4b09-a61a-5b6203d111b7",
        "caption": [
            "Figure 1: Caption of the figure extracted from the document."
        ],
        "visual_type": null,
        "location": {
            "root_path": "./tests/data/layout/",
            "file_path": "00001/pdf-001/00001-page1-Im0.0.jpg"
        }
    }
]
}

Metadata Classes

VisArchPy handles metadata extracted from the MODS file (if given) and images (visuals) using the following classes:

class visarchpy.metadata.FilePath(root_path: str, file_path: str)

Represents a file path

file_path: str
full_path() str

Returns the full path of the file path

Returns:

full path of the file path

Return type:

str

root_path: str
update_root_path(root_path: str) None

Updates the root path of the file path

Parameters:

root_path (str) – new root path

Return type:

None

class visarchpy.metadata.Person(name: str, role: str)

Represents a person and its role

name: str
role: str
class visarchpy.metadata.Department(name: str)

Represents a department in a Faculty

name: str
class visarchpy.metadata.Faculty(name: str, departments: List[Department])

Represents a Faculty

departments: List[Department]
name: str
class visarchpy.metadata.Document(location: FilePath = None)

Represents a (PDF) document

location: FilePath = None
update_root_path(path: str) None

Updates the root of the path of the document

class visarchpy.metadata.Visual(document: Document, document_page: int, bbox: List[int], bbox_units: str)

A class for handling metadata of visuals (images) extracted from PDF files

bbox: List[int]
bbox_units: str
caption: list | None = None
document: Document
document_page: int
id: str | None
location: FilePath = None
set_caption(caption: str) None

Sets the caption for the visual

Parameters:

caption (str) – caption for the visual

Return type:

None

Raises:

Warning – If the caption already contains two elements

set_location(location: FilePath, update: bool = False) None

Sets the location where the visual is stored

Parameters:
  • location (FilePath) – location where the visual is stored

  • update (bool) – if True, the root_path of location will be updated. If False, an error will be raised if the location is already set

Return type:

None

Raises:

ValueError – If the location is already set and update is False

set_visual_type(visual_type: str) None

Sets the visual type. One of photo, drawing, map, etc.

Parameters:

visual_type (str) – type of visual

Return type:

None

visual_type: str | None = None
class visarchpy.metadata.Metadata

Represents the collection of metadata of an entry. An entry consits of a MODS file and zero or mor PDF files. Most of the fields are based on the MODS standard.

abstract: str = None
add_document(document: Document) None

Adds a document object to the metadata

Parameters:

document (Document) – document object

Return type:

None

Raises:

TypeError – if document is not a Document object

add_pdf_location(path_pdf: str, overwrite: bool = False) None

Sets location of the PDF file

Parameters:
  • path_pdf (str) – path to the PDF file

  • overwrite (bool) – if True, overwrites the PDF location if it is already set

Return type:

None

Raises:

ValueError – if PDF location is already set and overwrite is False

add_visual(visual: Visual) None

Adds a visual to the metadata

Parameters:

visual (Visual) – visual object

Return type:

None

Raises:

TypeError – if visual is not a Visual object

add_web_url(base_url: str, overwrite: bool = False) None

Adds a URL to the metadata

Parameters:
  • base_url (str) – base URL of the repository

  • overwrite (bool) – if True, overwrites the web URL if it is already set

Return type:

None

Raises:

ValueError – if web URL is already set and overwrite is False

as_dataframe() DataFrame

Returns metadata as a Pandas DataFrame

as_dict() dict

Returns metadata as a dictionary

classification: List = None
collection: str = None
copyright: str = None
corp_names: List = None
creators: List = None
digital_origin: str = None
documents: List[Document] = None
doi: str = None
edition: str = None
extent: List = None
faculty: Faculty = None
form: List = None
geo_code: List = None
iid: str = None
issuance: List = None
languages: List[dict] = None
media_type: List = None
mods_file: str = None
persons: List[Person] = None
physical_description: List = None
physical_location: List = None
pid: str = None
publication_place: List = None
publisher: List = None
purl: List = None
save_to_csv(filename: str) None

Writes metadata to a CSV file

Parameters:

filename (str) – name of the CSV file

Return type:

None

save_to_json(filename: str) None

Writes metadata to a JSON file

Parameters:

filename (str) – name of the JSON file

Return type:

None

set_metadata(metadata: dict) None

Sets metadata for a repository entry

Parameters:

metadata (dict) – dictionary with metadata from MODS file

Return type:

None

subjects: List = None
submission_date: str = None
thesis_type: str = None
title: str = None
total_visuals: int | None = 0
type_resource: str = None
uuid: str | None = None
visuals: List[Visual] | None = None
web_url: str = None