I made a mistake prepping some metadata files the other day. I picked the wrong three-letter code for a division field for use in a spreadsheet, exported the rows of the spreadsheet to 1200 structured data files we use, and then updated those files with technical metadata. So how to repair that mistake?

Option 1: Go back to my spreadsheet, update the columns, update some dependent columns, re-export, and re-update.

Option 2: Copy my Python script to update the technical metadata and rewrite it to update the fields with the typo.

Option 3: Write a sed command.

For me, the right answer was sed.

sed, the stream editor

Here’s the Wikipedia definition: “a Unix utility that parses and transforms text, using a simple, compact programming language.” In other words, sed edits texts as it reads it, according to whatever short editing recipe you give it. For my purposes, the goal is replacing one three-letter string with another three-letter string. Let’s dig in.

# Replace typo string with correct string
# sed, edits a stream of text
# 's/"doh/"div/g', recipe for the edit, '/' divides the parts of the recipe
## s, substitute - replace one string with another
## "doh, string to look for
## "div, string to replace it with
## g, global - perform substitution everywhere it's found, regardless of case
# /path/to/file

> sed 's/"doh/"div/g' /path/to/file

That command will print out the result to your terminal, which is always a good practice as you’re building a recipe.

That recipe performs global replacements, so defining the strings too broadly might lead to problems. For example, if I just looked for ‘doh’, I could overwrite ‘Play-doh’ and ‘Doha Airport’ as ‘Play-div’ and ‘Diva Airport’. I knew my typos always occured with a preceding ‘”’, so I include that ‘”’ in my recipe.

Even more powerfully, sed can immediately write changes back to the original file if you add a specific flag.

# Replace typo string with correct string and rewrite original file
# sed, edits a stream of text
# -i '', save the edited text back to the input file (for Mac)
# 's/"doh/"div/g', recipe for the edit, '/' divides the parts of the recipe
# /path/to/\*.json, path wildcard to call all json files in a directory 

> sed -i '' 's/"doh/"div/g' /path/to/\*json

Once the recipe is performing as expected, the ‘-i’ flag writes the edits back to the file instead of printing it to the terminal. Using the ‘*’ pushes the automation one step further by opening all of files that match the path pattern. In one fell swoop, that’s 2400 corrections across 1200 files for me.

The tool can do much more, but I only really know how to do text substitution with it. If you want to go really deep there’s a grymoire for it.

Structured Data is Nice for Digitization Projects

A side note about metadata editing tools, the format of a metadata container will often define the options for editing.

Using spreadsheets as the primary container for metadata, would mean using spreadsheet software to perform any edits. There’s a longer post to write about comparing spreadsheets and structured data files, but the basic point is that spreadsheets offer enormously powerful tools, within a single spreadsheet. If you have more spreadsheets, you need to perform those actions in each spreadsheet.

On the other had, structured data formats are readable by a number of bash tools and programming languages. And, those tools often have idioms for applying one change to a large number of files. For example, using more wildcards expands my reach from a single directory to multiple.

# Replace typo string with correct string and rewrite original file
# sed, edits a stream of text
# -i '', save the edited text back to the input file (for Mac)
# 's/"doh/"div/g', recipe for the edit, '/' divides the parts of the recipe
# /path/to/project\*/\*.json, path wildcard to call all json files in directory that start with 'project'

> sed -i '' 's/"doh/"div/g' /path/to/project\*/\*json

And the really nice thing about structured data formats is that they are stored as text, so you can use either tools optimized for text processing like sed and tools that can parse the data structure like xml or json.

Conclusions

If you want to see the possibilities, I’d recommend pulling together some XML files in EAD, PBCore, or whatever flavor you tend to use and try out some manipulations.