Creating a Metadata Workflow for Digitizing Archival Materials

A Metadata Workflow for Digitizing Archival Materials

By Heather Charlotte Owen, Brendan Honick, and Qiaoyi (Joy) Liu

Prepared for the Theodore Burr Covered Resource Center and IST 676 at Syracuse University

Overview

The Theodore Burr Covered Bridge Resource Center is a non-profit organization with the purpose of making covered bridge images (photos, postcards, or slides) and other information resources available to both scholars and enthusiasts. The primary user groups for these information resources are covered bridge enthusiasts and those who specialize in architecture or engineering. As the center begins digitizing these materials, we are working with the institution’s staff to create a workflow to create metadata records and make this metadata accessible to users. To this end, we have created a metadata intake form, a data model, and a sample visualization in Tableau. We have prepared these tools so that the staff and volunteers at the Theodore Burr Covered Bridge Resource Center can create dynamic metadata records. We deeply appreciate the benefits that independent archives bring to their communities, and we intend our work here to reduce the barriers to entry that those organizations face when managing metadata.

Metadata Intake Form

We have created a series of steps for Theodore Burr Covered Bridge Resource Center as they are digitizing records and creating metadata. This workflow is still a work in progress, and it will likely change as the center begins digitization in the coming months.

Try the Form

Tableau Map Visualization

As noted in the previous section’s video, the metadata that staff and volunteers at the Theodore Burr Covered Bridge Resource Center capture with the form can power new pathways for user interaction with the center’s resources. For instance, elements like bridge names, geographic coordinates, links to relevant resources in the collections, and more can be displayed in a map visualization. Below is an example of this practice, which was created with covered data from Wikidata (a free and open database) and Tableau Public, a free visualization tool.

Example Covered Bridge Visualization (US Only)

Data Model and Guide

Introduction

This guide aims to support archivists, librarians, and volunteers in grasping the techniques for building a database for the Theodore Burr Covered Bridges Collection. This database supports as a supplementary to the repository storing archival collection data and items. As the two terms often intersect, there are views now that repositories are specialized in storing objects, while databases exceed in storing and managing data. We propose using database management systems because it will allow interoperability, reusability, and accessibility in more flexible professional and personal ways.

Once the database is built, staff at the back-end could follow protocols to insert values and avoid errors, duplications, and redundant data. Managing data from one collection to another or merging multiple data collections would be more straightforward and free of errors. Users at the front-end would benefit from faster access, a personalized interface, more advanced filtering, etc. However, it also needs to be clear that this design is not an essential requirement for this collection. The idea is to provide advanced service to users. Thus, there are many challenges to fulfilling this method. We suggest the resource center could first ensure the digitization and metadata are completed. Then, if funding is sufficient and there are staff with basic skills in database management or related areas, this approach would be practical to proceed.

The purpose of discussing this method is not to intimidate others about data but to apply some techniques in managing business and research data. Previous projects could shed light on how to create the database. For example, the University of Maryland applied database management to create their archival management system ArchivesUM. We are certain of the benefits it could bring to this collection. In this guide, we will demonstrate our workflow for creating this database, its requirements, and future considerations for the Theodore Burr Covered Bridge Resource Center.

Conceptual Data Modeling

A flowchart of the Conceptual Data Model — **Figure 1:** Conceptual Data Model (Click to enlarge)

Database management begins by setting the metadata. Details about this step are mentioned in the previous sections using the Metadata Intake Form. We used Entity-Relation(E-R) model to control entities (the entire table), attributes (an element in the table), and relations in this data set. It is one of the most frequently used approaches to database management as it ensures data accuracy, data integrity, high security, etc. Two models were created: conceptual data model and logical data model. It represents the same issue but in different stages. The logical data model is based on the conceptual data model, which sets what entities are required, what attributes are in each entity, and what relations between these entities are. The main entity/table, named [Items], contains attributes/metadata that describe the item itself (see Figure 1). We used Diagrams.net to create all the diagrams in this project.

**Figure 2:** The singular-to-multiple relation using Crow’s Foot notation (Click to enlarge)

All the required attributes/metadata are labeled with [R] that can be NULL (e.g. Truss_type) is required, but Language is not. Some attributes/metadata need to be unique so that they can only represent one particular object. These attributes are labeled with [U]. So you can deduce that attributes/metadata with both R and U (i.e. [RU]) need to be both not null and unique. The relations between the entities/tables are shown in what is called the Crow’s Foot notation. Different shapes at the ends of the lines between entities represent the relative cardinality of the relationship. Here, there are two types of notation. First, the singular-to-multiple relation: -|-|———|<-. Every single piece of data from the left can relate to multiple data on the right. For instance, the Collection-item relation using this annotation means a collection includes multiple items, but one item could only belong to one collection (see Figure 2). Another notation is: -|-|———|-|- meaning singular-to-singular relation. It is shown in the Item-Geographic relation so that one coordinate only identifies one item. It cannot share with another item. The conceptual model is the first step toward database management. It gives more space to adjust and redesign.

Logical Data Modeling

**Figure 3:** Logical Model of the Theodore Burr Covered Bridges Collection (Click to enlarge)

Next, we can build the logical data model. The only difference from the conceptual data model is that instead of using Crow’s Foot notation, we demonstrate the relation by using Primary Keys and Foreign Keys. The Primary Key is the primary attribute to uniquely identify the data. Often, it is a number without specific meaning, and it is generated by the database management system. Every table/entity has to have a Primary Key (labeled PK and it is the first attribute in every table). The system generates a new number in the Primary Key with every new item information inserted. Some attributes apart from the primary key should also be unique. These attributes are labeled, similar to the conceptual data model, U(followed by numbers sequentially). In order to show the relations, the logical data model uses a Primary Key-Foreign Key (PK-FK) relation. For example, in the table [Items] there is an attribute called Collection_ID (see Figure 3). It is labeled with FK, meaning it is a Foreign Key. This is the exact same as the Primary Key from the [Collections] table- Collection_ID (see Figure 4). So every time the [Items] table is pulled out from the database, each item has an attribute that shows which collection it belongs to. Multiple items can belong to the same collection (i.e. have the same Collection_ID number), as we mentioned previously in the conceptual data model. In this way, the relations between these tables are presented in a way machine can understand.

By using database management techniques, data could be stored, managed, retrieved, and processed with greater speed and accuracy. Each table would not interfere with one another apart from the PK-FK relations. Back-end administrators and front-end users could extract, update, delete, and insert data based on the structure.

A showcase of Item and Collection relation

Possible Challenges

It is also crucial to consider the actions likely to take place before building the database, which was mainly why we did not mix items, collections, creators, and geographic information under the same table. We do not know whether the resource center has only one single collection containing all items or multiple collections divided by some features (this can be material, format, year, etc.). Thus, separating the collection metadata (e.g., collection_name) would allow adding more collections to the database without affecting the existing collection data. Second, there is the possibility the collection would associate with state archives, national collections, or institution repositories. We hope the database management system ensures accessibility to the collection data while being able to connect with other institutional repositories, databases, or software systems because SQL query language is a unified and commonly shared scripting language among database management practitioners. Finally, our ultimate goal of this project is to provide personalized information. In the big data environment we are in today, data is not something in shortage. The essence lies in personalized, organized, filtered, and cleaned data. Information with true values comes from processed data.

Using database management, visualizations are able to display only elements the user seeks. For instance, a historian may focus on the cultural significance of covered bridges in New York State. As the user provides their request using the metadata intake form, the data is selected from the database and visualized on the map. More advanced SELECT statements in SQL language could even provide related information based on one’s searching history. If a user is interested in the architectural information, the database could select based on truss type or other architectural elements that could be added. All functions mentioned above rely on the stable, secured, scrutinized design of a database.

Reflections

At the onset of this project for the Theodore Burr Covered Bridge Society (Historic Oxford, NY, n.d.), my team had two considerations to cogitate. What type of data product would the users be interested in, and what challenges would we need to account for in the creation of our data product’s workflow?

Early in the development of our data product we settled on creating a map which would allow users to geographically browse through the collection of covered bridge materials. This decision was formed after deep contemplation on the needs of the users we predict would utilize the collection. As noted by Patil, it is imperative that data product designers consider if users would want or need the product before they begin development (2012, p. 4). And, once development starts, it is similarly paramount to have “user needs drive the design process” (Banerjee & Reese, 2019, p. 3).

Digital collections often suffer from inadequate interfaces which allow for poor browsing opportunities and are overly dependent on a search bar—a method that “demands a query, discourages exploration, and withholds more than it provides” (Whitelaw, 2015). Digitizing primary sources allows individuals all over the world to view materials that one once had to travel to see. Additionally, however, digitized collections which are displayed innovatively allow users to interact, browse, and view the collection in new ways. What patterns can emerge if we arrange materials in a timeline or geographically, so that we couldn’t determine when the material was a stack of physical copies? As noted by Cameron, “digital cultural heritage is a distinctly late modern idea and… reveals how we use data to connect with the past, and how we seek to use them to fortify our future” (2021, p. 10). As covered bridges are architectural constructions, the ability to browse them via physical location would increase the findability of items but also possibly generate patterns of note to users.

Creating such a map, however, possesses some difficulties which may feel insurmountable to individuals new to metadata and data tools. The creation of metadata should not be underestimated—as noted by Patil, “one of the biggest challenges of working with data is getting the data in a useful form” (2012, p. 4). Creating a workflow that encourages clean metadata creation and could be replicated by the Theodore Burr Covered Bridge Center was paramount, therefore, to the success of the project. As an individual who works for the Connecticut Digital Archive (n.d.) and is involved with research focused on lowering the barrier of entry to small institutions and/or marginalized communities, the conundrum of metadata is keenly felt. Metadata increases the findability of resources but can also be prohibitive to small institutions with little funding and experience. As individuals with some training in metadata, I believe it is crucial that we develop ways to streamline metadata creation to ease the burden on these small institutions and ensure more institutions can share their resources with the world.

Therefore, I spearheaded the creation of a workflow that would hopefully ease the burden of metadata creation for the Theodore Burr Covered Bridge Center and allow the center to duplicate our process and create a tableau map. This workflow utilizes a google form that will ease metadata entry, due to the detailed instructions on the form guiding the process, and the use of multiple-choice questions for elements that only have a few possible answers. The metadata is interoperable with New York Digital Heritage Collections (2019), in case the bridge center decides to share their resources with them. Additionally, authority control lists were created as well as a truss-type vocabulary (which borrows from AAT) in order to boost interoperability and accuracy. And, finally, I created a video tutorial in order to help volunteers with the creation of metadata in the future.

References

Banerjee, K., & Reese Jr., T. (2019). Building digital libraries: A how-to-do-it manual for librarians (2nd ed.). ALA Neal-Schuman.

Cameron, F. R. (2021). Refiguring digital cultural heritage and curation in a more-than-human world. Routledge.

Connecticut Digital Archive. (n.d.). Welcome to the Connecticut digital archive. https://ctdigitalarchive.org/

Historic Oxford, New York. (n.d.). Theodore Burr covered bridge resource center. http://www.oxfordny.com/community/library/burr-bridge-resource-center.php

New York Heritage Digital Collections. (2019, January). Metadata dictionary and usage guide (version 5). Empire State Library Network. https://nyheritage.org/sites/default/files/extras/NYH-MetadataDictionary-V5.pdf

Patil, D. J. (2012). Data jujitsu: The art of turning data into product. O’Reilly Media, Inc.

Whitelaw, M. (2015). Generous interfaces for digital cultural collections. Digital Humanities
Quarterly, 9(1).
This project has been an engaging experience in which I was able to leverage my previously developed professional skills in an information science context. During my undergraduate education at Stockton University, I worked for a year as a web content management intern for the institution's fiftieth-anniversary initiative. In that role, I learned how to create intuitive and engaging web pages as well as design principles that encourage web accessibility, especially for those who are blind or have vision impairments. In my work for this project, I have made sure to make the web page more accessible, implementing Hypertext Markup Language (HTML) headers and alt-text in images. Therefore, sharing my group's collaborations online has allowed me to build upon previous experiences.

This project has also prompted me to consider why sharing library workflows is essential. Librarians understand how to organize their information based on their educational and workplace experiences. However, what we do may appear obtuse or highly technical to those outside the LIS field. To overcome this problem, librarians must make their data practices more accessible (a different meaning here than above) to colleagues and users. A webpage like this one both makes digital data services more transparent and prompts librarians to reflect on their methodologies.
We began this project from a viewpoint of producing a visualizing product for users. This starting point led us to mapping on Tableau, setting the metadata, and finally to data modeling, an extension design to the collection. More ideas flow in as we work on this project, and it provides a unique opportunity for us to demonstrate the principles of data curation. I personally gained experience from using skills I learned from the Database Management course to apply to an actual archival setting. It blends into the visualization and keeps up with the trends in big data management, user-friendly interface, and system development. It reaches the ultimate goal of the FAIR principles by making the data more findable, accessible, interoperable, and reusable.
It is satisfying to process my ideas of perceiving collection data and information and play with the logical relations between entities. It is standing proof that archival collections also need systematic approaches to manage their data to provide better service to their users. Structured data not only exists in business fields but also in such areas that may seem less ‘technical’. Data modeling, if designed correctly, can fast forward the process of managing digitized collection objects, and support archivists and librarians in their work. However, as mentioned in the guide to data modeling, the challenges emerge along with the benefits. Databases are not perfect. It requires discrete thinking and design to create a well-functioned database for future use. One has to anticipate the possible usage of the database so that every possible action has an outcome. Either it is an error response or a direction to another data set. The benefit of this is a long-term process. Eventually, it would come to its worth.

Resource List

Metadata Resources

New York Heritage Metadata Dictionary: This dictionary explains how to create metadata that can be used by institutions who wish to join New York Digital Heritage. The detailed instructions and examples can be a powerful tool in assisting individuals to create metadata records
Metadata Matters: The Basics by New York State Archives: This is a 38-minute webinar describing the various basics of metadata. It is intended to help individuals who are just getting started with metadata.
Capital District Library Council LibGuide on Metadata: This guide provides tips on how your collection can be added to New York Heritage’s hosted collections. It includes information about metadata, digitization, rights statements, and copyright.

Tableau Resources

Tableau Public: Tableau Public is a free version of Tableau. It can create maps, graphs, and other visualizations which can be shared.
Tableau: Get Started with Tableau Desktop: This website provides directions on how to use Tableau, and has numerous video tutorials which can be followed.

Data Modeling Resources

Diagrams: A free web-based resource to create conceptual data models and logical data models.
Data Modeling Tutorial: A very basic introduction to data modeling that could help to better understand this project.

For Assistance

New York Heritage: This institution has experience helping small institutions digitize their collections and create metadata. They are a good place to go if you have any questions.
Poster Description of Project: Heather, Brendan, and Joy created a poster describing the steps involved in this project. This is an alternative resource to look at to understand our workflow.
A Google Slides presentation that we made about this project

Extra Resources

OpenRefine: This is a free tool that can be used to clean up messy metadata. If you have mistakes in your metadata, you can use OpenRefine to clean it. You can download it from this website, and see numerous video tutorials.