Ingest Issues Troubleshooting

 

  1. Ingest Issues Troubleshooting
    1. How are issues spotted?
      1. Exception log file
      2. Workflow status on the Preservica website ingest monitor
    2. Specific issues and how we resolve them
      1. Files with unexpected extensions or incorrect file naming conventions
        1. Example 1: Incorrect metadata file
        2. Example 2: Incorrect folder names
      2. Filename encoding issues
      3. Filenames with non-XML compatible characters
      4. Virus detected in the file
        1. Example

Ingesting content into Preservica encounters issues from time to time. Digital Preservation program uses the below methods to understand the cause and resolve them.

How are issues spotted?

There are a couple of ways that we check if the ingest process encounters any blockage.

Exception log file

The “exception log” is a plaintext file generated by the ingest script, and it lists any errors encountered by the ingest script as well as the file(s) that are not included in the package staged for ingest. This is the first place to check.

Workflow status on the Preservica website ingest monitor

After logging into Preservica, click Ingest > Monitor. The page displays the status of all active and completed processes.

If the ingest workflow completed successfully, there will be a green check mark for the package. If the workflow has issue(s), there will be an orange check mark. At the bottom of the page, a section that displays a running log of “Info”, “Warning”, and “Error” message. Successful workflows can sometimes include “warning” and “error” messages for some issues, such as filename encoding errors.

To further check the details of ingest status, follow below steps:

  1. Log into Preservica
  2. Click Ingest > Completed
  3. Set up the Filter to show workflows created by “All Users”, within the date range you’re interested in, and check “All Workflow Instances”. The data filter assumes a time of midnight, so choose a date one after the one you are interested in.
  4. If any workflow is in orange or red texts, they need to be reviewed
  5. To review what step went wrong, click the link in the “Workflow Context” field
  6. Check if any “State” field does not have a green check mark
  7. Click into “List all child workflows” to see if there is any workflow context that has issues, which would be in red texts
  8. Click into the “Workflow Context” for the child workflow that has red texts, and the “Step Progress” table will show us what step was blocking the ingest progress, such as “Virus Check”

Specific issues and how we resolve them

Files with unexpected extensions or incorrect file naming conventions

For files listed in the “Exception log”, we review them individually. These exceptions are defined in the ingest script. Specifically, they catch files with unexpected extensions and incorrect file naming conventions.

Example 1: Incorrect metadata file

Our program expects the Born Digital Archives metadata file to be a comma-separated values or tab-separated values file, which has .CSV and .TSV extensions respectively.

In some instances, we have found metadata files that are plaintext files, with the .TXT extension. We review the files individually with Digital Archives program and decide what the solution should be. The most common reason of having a plaintext metadata file results from incorrect selection of the export file type when the archivist exports the metadata file.

In this context, the content of the file is the same. Therefore, our program migrates the data from plaintext format to the .CSV format.

Sometimes the file is an artifact of a process that is no longer necessary.

In this context, the file is deleted.

Example 2: Incorrect folder names

Our program expects specific naming conventions for some folders and files. For Born Digital Archives, the package folder name should confirm to the regular expression, M[0-9]+_(ER|DI|EM)_[0-9]+, where ER means Electronic Records; DI means Disk Images; and EM means Email. A valid example is M1126_ER_1

In one collection, we found packages named as M1234_ER1, M1234_ER2, etc. When similar cases show up, our program also reviews these packages with the Digital Archives program to decide what the solution should be.

This issue usually comes from human errors, where other structures are correct, and only the folder names need to be updated. Therefore, after the review process with Digital Archives staff, we update the folder names to M1234_ER_1 and M1234_ER_2.

Filename encoding issues

Filenames may include UTF-8 characters with unclear or unusable renderings. Two common common sources of these issues are PUA encoded characters and control characters such as ASCII BEL.

To view the bytecode for the specific unrenderable character, we use a hex viewer like xxd. For instance, to see the filename’s hex value of a file called, digital_preservation.docx, we can use this command in the terminal:

    ls -1 path/digital_preservation.docx | xxd

In two collections, our program discovered files with hidden characters, \x7f, in their filenames. After some research, we found out \x7f means DEL, delete in ASCII characters.

Because there did not seem to be a reason for this character, we delete it. We can also use a Unicode visual representation of the character, in this case, U+2421.

In one collection, we found discovered files with the bytes \xEF \x80 \xA3, which is the UTF-8 encoding of the PUA character U+F021, used by Microsoft to represent * in filenames creates in HFS and transferred to NTFS.

In this case, we convert the unrenderable bytes to the original intended character *.

Filenames with non-XML compatible characters

Preservica uses XML to store and transact metadata. In XML files, the characters &, <, >, ", and ' must be escaped if used as a value. As a result, any filename containing these characters must be escaped within the XML. The ingest script does this escaping, but it may occasionally fail on complex cases.

In this case, the actual filename does not need to be changed. The XML file is updated manually with the correct escaping.

Virus detected in the file

Preservica scans for computer virus with ClamAV, an open source antivirus software, before ingesting the package. Therefore, if there is any virus detected in one or more files in one package, the package will not be ingested, and the ingest workflow will be halted.

After the workflow is halted, our program researches the specific computer virus to understand its effects on modern computer environments and assess the risks. Depending on the results, we decide whether to ingest the file.

If the risks still exist, we remove the infected file from the package before ingest.

In some cases, such as when the virus specifically targeted now-obosete software, we ingest the file into the repository. To do so, we temporarily alter the virus checking stage of the ingest workflow.

  1. Log into Preservica
  2. Click Ingest > Manage
  3. Click “Workflow Error Configuration” for the Ingest Workflow context
  4. Scroll to the step that says “The Virus Check step found a virus in the package”
  5. On the status drop-down menu, it should show “Abort workflow”. Change it to “Continue workflow”
  6. Click Ingest > Monitor
  7. Find the package that needs to be re-ingested
  8. Click the menu icon on the top right of the box. It appears as three vertical dots.
  9. Click “Resubmit” and the workflow will run again
  10. After the package is ingested, go back to step 2. to 4.
  11. On the status drop-down menu, change it back to “Abort workflow”

Example

In one collection, we found a Windows Trojan virus named “Win.Trojan.Cap-1” in ClamAV’s virus registry. After some research using the Internet Archive, we found that this computer virus, “CAP”, was most likely a Microsoft Word Macro virus. This Microsoft Security Intelligence page, this Internet Archive capture , the Virus Encyclopedia and F-Secure give us information most relevant to this virus.

A macro, in computer programming, is a rule, pattern, or sequence of events that allows users to automate repetitive tasks. In the context of Microsoft Office Suites, the user can create macros to record a set of actions that they want to run as many times as they’d like.

Quoting from the Virus Encyclopedia of the behavior of this virus:

When an infected file is opened, CAP removes the macros in NORMAL.DOT and replaces them with its own. It removes the options of Macros and Customize under the Tools drop menu, as well as Templates under file. If there is an icon on the toolbar, it will still be there, but it will not function. When the macros are decrypted, the following text can be seen: ‘C.A.P: Un virus social.. y ahora digital.. ‘“j4cKy Qw3rTy” (jqw3rty@hotmail.com). ‘Venezuela, Maracay, Dic 1996. ‘P.D. Que haces gochito ? Nunca seras Simon Bolivar.. Bolsa ! This translates into: “‘C.A.P: A social virus, and now a digital one. (The next two lines are about the creator and the time and location of the virus’s creation.) PS, What are you doing little cowboy? You will never be Simon Bolivar! Stupid!

In this case, we decided to ingest files with this virus because of a couple of reasons.

  1. Over the years, Microsoft has done many interventions about these viruses. One change from 2022 is that macros from the internet are blocked by default in Microsoft office.
  2. Microsoft also added more warnings before the use can enable the macro (see 25 years on, Microsoft makes another stab at stopping macro malware)
  3. On top of the intervention from Microsoft, NYPL’s processes for accessing these Microsoft Word documents creates surrogates of the original files that cannot contain macros.

With these considerations, we made the decisions to ingest the files.