Compute Resources and Data Storage
The two main computers used in the ingestion pipeline are a Dell XPS 15 laptop and a newer Exxact workstation. For memory intensive and multi-core data processing, the workstation is a useful resource. It could either be used directly from the lab or ssh’ed into to run processing jobs. Using VS Code’s Remote-SSH extension, you can connect and modify files over ssh without using command line editors. To start a connection, click on the bottom left green icon. The ip address is for the workstation is 128.208.238.117
Data Flow
The web validator stores submitted datasets to Dropbox (Dropbox/Apps/<dataset_short_name>/<dataset_short_name_timestamp.xlsx>). After submission the CMAP data team runs the dataset through the QC API. The outputs from the QC API are saved in Dropbox (Dropbox/Apps/<dataset_short_name>/iterations/1/propose). When changes are approved by the submitter, a copy of the finalized dataset is added to the accept folder within the iteration folder structure, as well as to the final folder where ingestion will pull from (Dropbox/Apps/<dataset_short_name>/final). Only one file should be saved in the final folder for ingestion.
Ingesting a dataset submitted through the validator pulls from the final folder and creates a folder based on the table name in the vault/ directory.
Data Storage
Both the web application and the data ingestion pipeline share storage over dropbox. With an unlimited account, we can use dropbox to store all our pre-DB data. In addition to dropbox, the vault/ also is synced on the workstation under: ~/data/CMAP Data Submission Dropbox/Simons CMAP/vault/
For details on the vault structure, see Jira ticket 329 (https://simonscmap.atlassian.net/browse/CMAP-329)
├── assimilation
├── r2r_cruise
├── model
├── observation
├── in-situ
│ ├── cruise
| | |── {table_name}
| | |── code
| | |── doc
| | |── metadata
| | |── nrt
| | |── raw
| | |── rep
| | |── stats
│ ├── drifter
│ ├── float
│ ├── mixed
│ └── station
└── remote
└── satellite
Dropbox’s CLI tools are installed on the workstation. Using the selective sync feature of dropbox, the /vault stored on disk can be synced with the cloud. By reading/writing to disk, IO speeds for data processing should be improved.
If dropbox has stopped syncing, you can start the CLI by typing in terminal:
dropbox start
dropbox status
Workstation Repositories
Scripts for new SQL tables and indicies are written to the DB repository found here: ~/Documents/CMAP/DB/
Python scripts for collection, ingestion, and processing are written to the cmapdata repository found here: ~/Documents/CMAP/cmapdata/. The dataingest branch contains the most recent updates.
The vault directory that syncs with Dropbox is found here: /data/CMAPDataSubmissionDropbox/SimonsCMAP/vault/ Note there are spaces in the directories “CMAP Data Submission” and “Simons CMAP”
Thumbnails for the catalog page are saved here: /data/CMAPDataSubmissionDropbox/SimonsCMAP/static/mission_icons
Synology NAS and Drobo Storage
Before storing data on Dropbox, two non-cloud storage methods were tried. Both the Drobo and Synology NAS are desktop size hard disk storage. Each contains ~40-50TB of disk space. There are limitations to each of these. The Drobo requires a connection through usb-c/thunderbolt. The Synology NAS can be accessed over the internet, ie (Network Attached Storage). They read/write speed for both is quite slow compared to the disks on the workstation. Perhaps one or both could be used as another backup?