Metadata Reference
This section describes the metadata fields used in the metadata.json file. Most of the metadata fields are based on the MODS standard. However, some fields were added to support TU Delft-specific fields and metadata about extracted images (visuals).
Field |
Explanation |
Expected values |
---|---|---|
documents [2] |
List of documents processed
by the pipeline
|
|
documents.location [2] |
Location of the document |
{ root_path: string, file_path: string } |
persons [3] |
List of persons involved in
documents
|
array of persons |
persons.name |
Name of the person |
string |
persons.role |
Role of the person |
string |
faculty [3] |
List of faculties involved in documents |
array of faculties |
faculty.name |
Name of the faculty |
string |
faculty.departments |
List of departments in the faculty |
|
mods_file [2] |
Path the MODS file |
string |
title [3] |
Title of thesis |
string |
abstract [3] |
Abstract of thesis |
array of strings |
submission_date [3] |
Submission date of thesis |
|
thesis_type [3] |
Type of thesis |
array of strings |
subjects [3] |
List of subjects |
array of strings |
copyright [3] |
Copyright information |
array of strings |
languages [3] |
List of languages |
array of language identifiers |
uuid [1] |
Repository unique identifier |
string |
iid [3] |
Repository identifier |
string |
media_type [3] |
Media type of resource |
array |
issuance [3] |
Issuance of resource |
array |
digital_origin [3] |
Digital origin of resource |
string |
doi [3] |
Digital Object Identifier |
string |
edition [3] |
Edition of resource |
string |
extent [3] |
Extent of resource |
array |
form [3] |
Form of resource |
array |
classification [3] |
Classification of resource |
array |
collection [3] |
Catalog collection |
string |
geo_code [3] |
Geographic code |
string |
corp_names [3] |
Corporate names |
array |
creators [3] |
Creators of resource |
array |
physical_description [3] |
Physical description of resource |
array |
physical_location [3] |
Physical location of resource |
array |
pid [3] |
Persistent identifier |
string |
publication_place [3] |
Publication place of resource |
array |
publisher [3] |
Publisher of resource |
array |
purl [3] |
Persistent URL |
string |
type_resource [3] |
Type of resource |
string |
web_url [1] |
Web URL of resource |
string |
total_visuals [2] |
Total number of extracted visuals
for documents in documents
|
integer |
visuals [2] |
List of extracted visuals |
array of visuals |
visuals.document |
Document where the visual is located |
same as documents.location |
visuals.document_page |
Document page where visual is located |
integer |
visuals.bbox |
Bounding box of visual |
array of floats |
visuals.bbox_units |
Units of the bounding box |
|
visuals.id |
Unique identifier of visual |
string |
visuals.caption |
Extracted caption of visual |
array of strings |
visuals.visual_type |
Type of visual |
string |
visuals.location |
Location where extracted visual is stored |
same as documents.location |
Footnotes
Metadata File Example
Running any VisArchPy data extraction pipelines will generate a metadata.json
file in the output directory. The following example shows the structure of the metadata.json
file.
{
"documents": [
{
"location": {
"root_path": "./tests/data/00001/",
"file_path": "00001_sample.pdf"
}
}
],
"persons": [
{
"name": "Rom, O.",
"role": "mentor"
},
{
"name": "Bir, H.",
"role": "mentor"
},
{
"name": "Jen, P.",
"role": "mentor"
},
{
"name": "Hans, Y.",
"role": "author"
}
],
"faculty": [
{
"name": "Architecture",
"departments": [
{
"name": "Architecture"
}
]
}
],
"mods_file": "./tests/data/00001/00001_mods.xml",
"title": "Resource title",
"abstract": [
"This is an example."
],
"submission_date": "2020-09-03",
"thesis_type": [
"master thesis"
],
"subjects": [
"Mapping",
"Mental Border"
],
"copyright": [
"(c) 2020 Hans, Y."
],
"languages": [
{
"code": "en",
"authority": "rfc3066"
}
],
"uuid": "uuid:0008286e-f16c-4e7b-8334-fe36fe9b09e4",
"iid": null,
"media_type": [],
"issuance": [],
"digital_origin": null,
"doi": null,
"edition": null,
"extent": [],
"form": [],
"classification": [],
"collection": null,
"geo_code": null,
"corp_names": [],
"creators": [],
"physical_description": [],
"physical_location": [],
"pid": null,
"publication_place": [],
"publisher": [],
"purl": [],
"type_resource": "text",
"web_url": "http://resolver.tudelft.nl/uuid:0008286e-f16c",
"total_visuals": 1,
"visuals": [
{
"document": {
"location": {
"root_path": "./tests/data/00001/",
"file_path": "00001_sample.pdf"
}
},
"document_page": 1,
"bbox": [
49.9742,
143.462,
234.7092,
331.342
],
"bbox_units": "pt",
"id": "5e5cc208-cd2c-4b09-a61a-5b6203d111b7",
"caption": [
"Figure 1: Caption of the figure extracted from the document."
],
"visual_type": null,
"location": {
"root_path": "./tests/data/layout/",
"file_path": "00001/pdf-001/00001-page1-Im0.0.jpg"
}
}
]
}
Metadata Classes
VisArchPy handles metadata extracted from the MODS file (if given) and images (visuals) using the following classes:
- class visarchpy.metadata.FilePath(root_path: str, file_path: str)
Represents a file path
- file_path: str
- full_path() str
Returns the full path of the file path
- Returns:
full path of the file path
- Return type:
str
- root_path: str
- update_root_path(root_path: str) None
Updates the root path of the file path
- Parameters:
root_path (str) – new root path
- Return type:
None
- class visarchpy.metadata.Person(name: str, role: str)
Represents a person and its role
- name: str
- role: str
- class visarchpy.metadata.Faculty(name: str, departments: List[Department])
Represents a Faculty
- departments: List[Department]
- name: str
- class visarchpy.metadata.Document(location: FilePath = None)
Represents a (PDF) document
- update_root_path(path: str) None
Updates the root of the path of the document
- class visarchpy.metadata.Visual(document: Document, document_page: int, bbox: List[int], bbox_units: str)
A class for handling metadata of visuals (images) extracted from PDF files
- bbox: List[int]
- bbox_units: str
- document_page: int
- id: str | None
Sets the caption for the visual
- Parameters:
caption (str) – caption for the visual
- Return type:
None
- Raises:
Warning – If the caption already contains two elements
- set_location(location: FilePath, update: bool = False) None
Sets the location where the visual is stored
- Parameters:
location (FilePath) – location where the visual is stored
update (bool) – if True, the root_path of location will be updated. If False, an error will be raised if the location is already set
- Return type:
None
- Raises:
ValueError – If the location is already set and update is False
- set_visual_type(visual_type: str) None
Sets the visual type. One of photo, drawing, map, etc.
- Parameters:
visual_type (str) – type of visual
- Return type:
None
- visual_type: str | None = None
- class visarchpy.metadata.Metadata
Represents the collection of metadata of an entry. An entry consits of a MODS file and zero or mor PDF files. Most of the fields are based on the MODS standard.
- abstract: str = None
- add_document(document: Document) None
Adds a document object to the metadata
- Parameters:
document (Document) – document object
- Return type:
None
- Raises:
TypeError – if document is not a Document object
- add_pdf_location(path_pdf: str, overwrite: bool = False) None
Sets location of the PDF file
- Parameters:
path_pdf (str) – path to the PDF file
overwrite (bool) – if True, overwrites the PDF location if it is already set
- Return type:
None
- Raises:
ValueError – if PDF location is already set and overwrite is False
- add_visual(visual: Visual) None
Adds a visual to the metadata
- Parameters:
visual (Visual) – visual object
- Return type:
None
- Raises:
TypeError – if visual is not a Visual object
- add_web_url(base_url: str, overwrite: bool = False) None
Adds a URL to the metadata
- Parameters:
base_url (str) – base URL of the repository
overwrite (bool) – if True, overwrites the web URL if it is already set
- Return type:
None
- Raises:
ValueError – if web URL is already set and overwrite is False
- as_dataframe() DataFrame
Returns metadata as a Pandas DataFrame
- as_dict() dict
Returns metadata as a dictionary
- classification: List = None
- collection: str = None
- copyright: str = None
- corp_names: List = None
- creators: List = None
- digital_origin: str = None
- doi: str = None
- edition: str = None
- extent: List = None
- form: List = None
- geo_code: List = None
- iid: str = None
- issuance: List = None
- languages: List[dict] = None
- media_type: List = None
- mods_file: str = None
- physical_description: List = None
- physical_location: List = None
- pid: str = None
- publication_place: List = None
- publisher: List = None
- purl: List = None
- save_to_csv(filename: str) None
Writes metadata to a CSV file
- Parameters:
filename (str) – name of the CSV file
- Return type:
None
- save_to_json(filename: str) None
Writes metadata to a JSON file
- Parameters:
filename (str) – name of the JSON file
- Return type:
None
- set_metadata(metadata: dict) None
Sets metadata for a repository entry
- Parameters:
metadata (dict) – dictionary with metadata from MODS file
- Return type:
None
- subjects: List = None
- submission_date: str = None
- thesis_type: str = None
- title: str = None
- total_visuals: int | None = 0
- type_resource: str = None
- uuid: str | None = None
- web_url: str = None