collect
This submodule contains various scripts for collecting outside data. Methods include FTP, curl, wget and others. Scripts are usually one-off and are a record of the method used to collect the data. They are organized hierarchically in a similar fashion to /vault.
├── assimilation
├── model
├── observation
├── in-situ
│ ├── cruise
| | |── cruise_name
| | |── collect_{cruise_name}.py
│ ├── drifter
│ ├── float
│ ├── mixed
│ └── station
└── remote
└── satellite
collection strategies
The oceanography data in CMAP comes from multiple sources which vary in the amount of data processing required and available metadata. The first step of ingesting a dataset from an outside source into CMAP is collecting the data. This generally starts with a python collection script. This both servers to collect the data as well as leave a record.
FTP Servers
Some datasets, especially when there are multiple files, are available over FTP servers. To retrieve this data, you can either use some GUI FTP application such as FileZilla or a command line utility such as wget or curl. Examples of using wget are available in some of the collect.py scripts. Some FTP sites required registrations and username/passwords.
Zipped File Links
Some data providers such as Pangea provide datasets and metadata as zipped files. While this is very convenient, it is a good idea to still create a collect_datasetname.py file with the zipped file link.
Webscrapping
Some of the cruise trajectory and metadata was initially collected from R2R (Rolling Deck to Repository). Generally, webscraping is only a last resort.