Media Ingest

media ingest diagram

Cluster

Information about the QA and Production clusters to help people who want to access the machines.

QA/Development Tier

This tier is called both “QA” and “Development”. It’s the same tier. It’s our one-and-only pre-production tier.

Apps	IPs
Filewatcher, FilePusher	10.224.6.14
Capture Creator, ET Enqueuer, S3Uploader	10.224.6.15
JSON Bag Validator	10.224.6.16

Production

Apps	IPs
Filewatcher (VideoProcessor)	10.224.6.32
FilePusher (FilestorePusher)	10.224.6.31
Capture Creator, ET Enqueuer, S3Uploader	10.224.6.33, 10.224.6.34, 10.224.6.36
JSON Bag Validator	10.224.6.17, 10.224.6.19, 10.224.6.20

Access

All machines have a sudo-powered user named developer who can login with with nypl-digital-dev private key (in parameter store).

RabbitMQ

Many applications in media-ingest talk to each other via RabbitMQ queues.
Read more about our Rabbit setup here.

Special Directories

“hot folder” is where PAMI drops bags to be seen by FileWatcher. “upload purgatory” is where FileStorePusher drops files that will uploaded by S3Uploader.

QA

hot folder: /ifs/prod/video_ingest/qa_parking_lot
upload purgatory: /ifs/prod/video_ingest/qa_media_ingest_pending_s3_upload

Production

hot folder: /ifs/prod/parking/lpa/Video
upload purgatory: /ifs/prod/parking/media_ingest_pending_s3_upload

Deploying

We use NYPL’s Bamboo to build & deploy all of the applications.

Eccentricities

Some (mostly QA/Development) tiers are deployed through a build, not deploy job. Other apps have distinct build, and deploy jobs.
Some deployments work by bamboo calling Capistrano scripts contained in the app’s repo. Other deployments SSH into the machines directly and do the work. (git pull origin [BRANCH] && systemctl restart [SERVICE-NAME])

We should seek to unify how the apps are deployed.

In general, The QA/Development tier is deployed from the apps’ “qa” branch and the production tier is deployed from the “production” branch. We should seek to have each app’s README document its git/deploy workflow.

Re-deploying is a good way to restart services.

The Application Machines…

SSH Access

As of writing, we have user-specific linux users, tied to actual people’s SSH keys and/or passwords to get onto the machine. SEB-1580 is a ticket for getting a general developer user account for those machines.

Running Commands

systemd

All machines use systemd’s systemctl command to stop/start/restart processes. Each application’s Capistrano scripts end up calling systemctl commands, post deploy to restart the process. (example) .
See each applications’s “Process Control” below for info about how to manually stop/start processes.

Finding WHERE Apps Are on a Machine

systemd has a configuration .service file that has stop/start commands and rules.
If you cat the .service file, the rules usually tell you where the app lives.

Example (JSON bag parser Rabbit Consumer):

$ systemctl status json_bag_parser_rabbit_consumer.service

● json_bag_parser_rabbit_consumer.service - JSON Bag Parser Rabbit Consumer
   Loaded: loaded (/etc/systemd/system/json_bag_parser_rabbit_consumer.service; static; vendor preset: disabled)
   Active: active (running) since Fri 2019-10-11 14:28:17 EDT; 6 days ago
 Main PID: 26644 (ruby)
   CGroup: /system.slice/json_bag_parser_rabbit_consumer.service
           └─26644 ruby /opt/cap/json_bag_parser/current/consume_json_bags_from_rabbit.rb run

This tells you the config file lives in /etc/systemd/system/json_bag_parser_rabbit_consumer.service.

cat /etc/systemd/system/json_bag_parser_rabbit_consumer.service
[Unit]
Description=JSON Bag Parser Rabbit Consumer

[Service]
Type=simple
User=git
Group=git
WorkingDirectory=/opt/cap/json_bag_parser/current/
ExecStart=/usr/bin/bash -lc 'bundle exec ruby /opt/cap/json_bag_parser/current/consume_json_bags_from_rabbit.rb run'
TimeoutSec=30
RestartSec=15s
Restart=always

This tells you the app lives in /opt/cap/json_bag_parser/current.

To list all processes under systemd’s control.

$ systemctl list-units
UNIT                                                                                          LOAD   ACTIVE SUB       DESCRIPTION
proc-sys-fs-binfmt_misc.automount                                                             loaded active waiting   Arbitrary Executable File Formats File System Automount Point
sys-devices-pci0000:00-0000:00:11.0-0000:02:01.0-ata1-host2-target2:0:0-2:0:0:0-block-sr0.device loaded active plugged   VMware_Virtual_SATA_CDRW_Drive RHEL-7.2_Server.x86_64
sys-devices-pci0000:00-0000:00:15.0-0000:03:00.0-host0-target0:0:0-0:0:0:0-block-sda-sda1.device loaded active plugged   Virtual_disk 1
sys-devices-pci0000:00-0000:00:15.0-0000:03:00.0-host0-target0:0:0-0:0:0:0-block-sda-sda2.device loaded active plugged   LVM PV QkZtEs-qOli-IwqV-wtHW-mWvo-Et6A-F4ltKI on /dev/sda2 2
...snip
json_bag_parser.service                                                                       loaded active running   JSON Bag Parser Resque Worker
json_bag_parser_rabbit_consumer.service                                                       loaded active running   JSON Bag Parser Rabbit Consumer
kdump.service                                                                                 loaded active exited    Crash recovery kernel arming
kmod-static-nodes.service                                                                     loaded active exited    Create list of required static device nodes for the current kernel
lvm2-lvmetad.service                                                                          loaded active running   LVM2 metadata daemon

systemctl list-units is verbboooossseee so maybe pipe it to grep with a likelike search term like:

$ systemctl list-units | grep -i json
json_bag_parser.service                                                                          loaded active running   JSON Bag Parser Resque Worker
json_bag_parser_rabbit_consumer.service                                                          loaded active running   JSON Bag Parser Rabbit Consumer

Applications

FileWatcher (Processor)

Watches for bags, validates them, and sends them to be processed.

Listens to:

SQS Queue that gets pushed to via S3 Events when S3Uploader uploads a file.

Process Control:

sudo systemctl [status|start|stop|restart] elastic-transcoder-enqueuer.service

Source control:

https://bitbucket.org/NYPL/media-ingest-elastic-transcoder-enqueuer

URLSigner

Source control:

https://github.com/NYPL/rights-aware-cloudfront-url-signer

IngestReporter

Lives at: http://ingest-reporter.nypl.org/ It’s hosted on Elastic Beanstalk. It’s a rails app with a Rabbit consumer that runs alongside it to insert records in its database once a file is done being ingested.

Listens to:

FileStorePusher (via Rabbit)

Source control:

https://bitbucket.org/NYPL/ingest_reporter

mock_filewatcher

MockFilewatcher is a Ruby gem for testing media-ingest.
Source control: https://github.com/NYPL/mock_filewatcher

Log Aggregation & Failure Notification

Logs

All services have their logs aggregated and set to our loggly account. OPS (Ho-Ling/Brett) are admins and can give you an account.

Failure Notifications

We use loggly alerts to get notified about exceptions. Like CloudWatch alerts, they are filters/metrics that send an email once they reach a certain threshold.

The alerts are configured to send an email to media.ingest.failures@nypl.org. We’ve configured Jira to poll that inbox and automatically create tickets in a project named MIF for each exception.

Metadata Tools

ami-tools
ami-data
MediaIngest JSON schema repository
Mapping of metadata-JSON to MODS