Welcome to pytroll-collectors’s documentation!

Pytroll-collectors is a collection of scripts and modules to aid in operational processing of satellite reception data. The scripts run continuously (like daemons) and communicate with each other using posttroll messages.

For example, a chain for processing Metop AVHRR data from direct reception, in which external software deposits files on the file system, may look somewhat like this:

The chain starts with trollstalker to monitor files. trollstalker uses inotify and sends a posttroll message when a file appears.
This message is received by geographic_gatherer. Depending on the reception system, a single Metop AVHRR overpass may produce multiple files. geographic_gatherer determines what files belong together in a region and sends a posttroll message containing all those filenames.
AVHRR data need preprocessing with the external software AAPP before Satpy can read the data. This preprocessing can be done with aapp-runner, For this preprocessing, it is advantageous to pass a single file. Therefore, the cat.py script may be listening to messages from geographic_gatherer and concatenate files together (it will need Kai to do so). When done, it sends another message.
For pre-processing data with AAPP and ANA, aapp-runner is responsible and can be configured to read posttroll messages either from cat.py or directly from geographic_gatherer. See documentation for aapp-runner.

The exact configuration depends on factors that will vary depending on what satellite data are processed, whether those are from direct readout, EUMETCAST, or another source, what system is used for direct readout, and other factors. Some users use the 3rd party software supervisor to start and monitor the different scripts in pytroll-collectors.

There are example configurations in the examples/ directory.

Scripts

All scripts use posttroll messages to communicate. They are normally running in the background (as a daemon). Most are waiting for a posttroll message to trigger their role and will send another posttroll message when their task is done.

cat.py

Concatenates granules or segments from NOAA and Metop level 0 data to a single file. This may be a useful step before using external software for preprocessing data. In particular, AAPP removes some scanlines from each end of the granule, so processing single granules will leave gaps between them. You will need Kai from EUMETSAT to concatenate Metop granules (but not for NOAA granules). Cat listens to input messages via posttroll according to topics defined in the configuration file. Upon completion, it will publish a posttroll message with topic defined in the configuration file.

The kai configuration file is in the INI format and should have one ore more sections. Each section may have the following fields:

output_file_pattern: A pattern (trollsift syntax) on what the output file should be.
aliases: Optional (what does it do?)
min_length: Optional, integer, minimum number of minutes needed to consider the data
command: Command used for concatenation.
stdout: Optional; if command writes to stdout, redirect output here.
publish_topic: Optional; publish a message when file is produced, using this topic.
publish_port: Optional; use a custom port when publishing a message.
nameservers: Optional; nameservers to publish on.
subscriber_nameserver: Optional; nameserver to listen to.

Example configuration:

[kai_cat]
topic=/EPS/0/
# Command used for concatenation.
command=kai -i {input_files} -o {output_file}
# Pattern for produced output file, with trollsift syntax.
output_file_pattern=/san1/polar_in/ears/metop/avhrr_{platform_name}_{start_time:%Y%m%d%H%M%S}_{end_time:%Y%m%d%H%M%S}.eps
# Minimum number of minutes to continue processing.  If the files cover
# less time, does not write an output file.
#minutes=10
# Topic to use for posttroll publishing.
publish_topic=/EPS/cat
# Minimum number of granules to consider processing
min_length=2

[hrpt_cat]
topic=/HRPT/0/
command=cat {input_files}
# Redirect output from command to this file
stdout={output_file}
output_file_pattern=/san1/polar_in/ears/hrpt/avhrr_{platform_name}_{start_time:%Y%m%d%H%M%S}_{end_time:%Y%m%d%H%M%S}.hrp

catter

Alternative to cat that does something else.

[noaa 19/0]
subject=HRPT/0
cat=bz2
pattern=/tmp/avhrr_{start_time:%Y%m%d_%H%M%S}_{platform:4s}{number:2s}.hrp

[metop-a/0]
subject=EPS/0
cat=bz2
pattern=AVHR_HRP_00_M02_{start_time:%Y%m%d%H%M%S}Z_{end_time:%Y%m%d%H%M%S}Z_N_O_{proc_time:%Y%m%d%H%M%S}Z

[metop-b/0]
subject=EPS/0
cat=bz2
pattern=AVHR_HRP_00_M01_{start_time:%Y%m%d%H%M%S}Z_{end_time:%Y%m%d%H%M%S}Z_N_O_{proc_time:%Y%m%d%H%M%S}Z

geographic_gatherer

This was previously known as gatherer, but was renamed to clarify the usage.

Collects granulated swath data so that the granules cover the configured target area(s) in a contiguous manner. Uses pytroll-schedule (which uses pyorbital) to calculate the required granules using orbital parameters (Three Line Elements; TLEs). Pyresample is required to handle the area definitions that describe the target area.

Watches files or messages and gathers satellite granules in “collections”, sending then the collection of files in a message for further processing. Determines what granules with different start times belong together. A use case may be a reception system in which a single overpass results in multiple files, that should be grouped together for further processing. It uses pytroll-schedule to estimate the area coverage based on start and end times contained in filenames.

The geographic_gatherer collection is started when it receives a posttroll message, perhaps from trollstalker or segment-gatherer. Using the configured granule duration and the area of interest, it calculates the starting times of granules it should expect to be covered in this area before and after the granule it was messaged about. Collection is considered finished when either all expected granules have been collected or when a timeout is reached, whatever comes first. Timeout is configured with the timeliness option (see below).

The configuration file in INI format needs a section called [DEFAULT] and one or more sections corresponding to what should be gathered. The [DEFAULT] section holds common items for all other sections. It can be used to define the regions:

regions: A whitespace separated list of names corresponding to areas for which granules are gathered.

All other sections have the following mandatory fields:

pattern: Defines the file pattern. This is needed to create the full list of expected files to know what to wait for. If you don’t pass this, gatherer will not fail, but …
topics: Defines what posttroll topics to listen to for messages related to files having arrived.
publish_topic: Defines what posttroll topic shall be used to publish the news of all the files that have been gathered.
timeliness: Defines the maximum allowed age of the granule in minutes (Warning: unit different compared to duration). Collection is stopped timeliness minutes after the expected end time of the last expected granule.

And the following optional fields:

service: The posttroll service name which publishing the messages. If given, only subscribe to messages from this service.
sensor: Defines the sensor. This is used for …
platform_name: Defines the platform name. This is used for …
format: Defines the file format. This is used for …
type: File type. Used how? Difference with format?
variant: Defines variant through which data come in. Used how?
level: Data level. Some downstream scripts may expect to see this in the messages they receive.
duration: Duration of a granule in seconds (Warning: unit different compared to timeliness)
orbit_type: What type of orbit? Some downstream scripts may expect to receive this information through posttroll messages.
inbound_connection: The list of addresses to get the messages from when using posttroll. Addresses are given as host:port`format. One of the addresses can be given as just `host, in which case it is interpreted as a nameserver to query addresses from. If omitted, the default behaviour is to use localhost as a nameserver.

[DEFAULT]
# gather data within those areas
regions = euron1 afghanistan afhorn
area_definition_file = /path/to/areas.yaml

[local_viirs]
# gatherer needs to create the full list of expected files to know what to wait for
pattern = /san1/pps/import/PPS_data/source/npp_????????_????_?????/SV{channel:3s}_{platform}_d{start_date:%Y%m%d}_t{start_time:%H%M%S%f}_e{end_time:%H%M%S%f}_b{orbit_number:5d}_c{proctime:%Y%m%d%H%M%S%f}_cspp_dev.h5
format = SDR
type = HDF5
level = 1B
platform_name = Suomi-NPP
sensor = viirs
# max allowed age of the granule in MINUTES.  Collection is stopped if
# the current time is ``timeliness`` minutes after the estimated end of
# the estimated last expected granule to be collected.  That means that
# if the gatherer expects 5-minute granules at 10:05, 10:10, 10:15, 10:20,
# and 10:25, but gets nothing after 10:10, and timeout is 10 minutes, it
# will wait until 10:25 + 5 minutes + 10 minutes = 10:40 before giving up.
timeliness = 10
# duration of a granule in SECONDS
duration = 180
publish_topic =
# The topics to listen for:
topics = /viirs/sdr/1

[ears_viirs]
pattern = /data/prod/satellit/ears/viirs/SVMC_{platform}_d{start_date:%Y%m%d}_t{start_time:%H%M%S%f}_e{end_time:%H%M%S%f}_b{orbit_number:5d}_c{proctime:%Y%m%d%H%M%S%f}_eum_ops.h5.bz2
format = SDR_compact
type = HDF5
level = 1B
platform_name = Suomi-NPP
sensor = viirs
timeliness = 30
duration = 85.4
variant = EARS
publish_topic =
# The topics to listen for:
topics = /ears/viirs/sdr/1

[ears_noaa18_avhrr]
pattern = /data/prod/satellit/ears/avhrr/avhrr_{start_time:%Y%m%d_%H%M%S}_noaa18.hrp.bz2
format = HRPT
type = binary
level = 0
duration = 60
platform_name = NOAA-18
sensor = avhrr/3
timeliness = 15
variant = EARS
publish_topic =
topics = /ears/avhrr/hrpt/1

[ears_noaa19_avhrr]
pattern = /data/prod/satellit/ears/avhrr/avhrr_{start_time:%Y%m%d_%H%M%S}_noaa18.hrp.bz2
format = HRPT
type = binary
level = 0
duration = 60
platform_name = NOAA-18
sensor = avhrr/3
timeliness = 15
variant = EARS
publish_topic =
topics = /ears/avhrr/hrpt/1

[ears_metop-b]
pattern = /data/prod/satellit/ears/avhrr/AVHR_HRP_{data_processing_level:2s}_M01_{start_time:%Y%m%d%H%M%S}Z_{end_time:%Y%m%d%H%M%S}Z_N_O_{proc_time:%Y%m%d%H%M%S}Z.bz2
format = EPS
type = binary
platform_name = Metop-B
sensor = avhrr/3
timeliness = 15
level = 0
variant = EARS
publish_topic =
topics = /ears/avhrr/metop/eps/1

[ears_metop-a]
pattern = /data/prod/satellit/ears/avhrr/AVHR_HRP_{data_processing_level:2s}_M02_{start_time:%Y%m%d%H%M%S}Z_{end_time:%Y%m%d%H%M%S}Z_N_O_{proc_time:%Y%m%d%H%M%S}Z.bz2
format = EPS
type = binary
platform_name = Metop-A
sensor = avhrr/3
timeliness = 15
level = 0
variant = EARS
publish_topic = /EARS/Metop-B
topics = /ears/avhrr/metop/eps/1

[gds_metop-b]
pattern = /data/prod/satellit/metop2/AVHR_xxx_{data_processing_level:2s}_M01_{start_time:%Y%m%d%H%M%S}Z_{end_time:%Y%m%d%H%M%S}Z_N_O_{proc_time:%Y%m%d%H%M%S}Z
format = EPS
type = binary
platform_name = Metop-B
sensor = avhrr/3
timeliness = 100
variant = GDS
orbit_type = polar
publish_topic = /GDS/Metop-B
topics = /gds/avhrr/metop/eps/1

[gds_metop-a]
pattern = /data/prod/satellit/metop2/AVHR_xxx_{level:2s}_M02_{start_time:%Y%m%d%H%M%S}Z_{end_time:%Y%m%d%H%M%S}Z_N_O_{proc_time:%Y%m%d%H%M%S}Z
format = EPS
type = PDS
platform_name = Metop-A
sensor = avhrr/3
timeliness = 100
variant = GDS
publish_topic = /GDS/Metop-A
topics = /gds/avhrr/metop/eps/1

[EARS_terra]
pattern = /data/prod/satellit/modis/lvl1/thin_MOD021KM.A{start_time:%Y%j.%H%M}.005.{proc_time:%Y%j%H%M%S}.NRT.hdf
format = EOS_thinned
type = HDF4
level = 1B
platform_name = EOS-Terra
sensor = modis
timeliness = 180
duration = 300
variant = EARS
topics = /ears/modis/hdf4/1

[EARS_aqua]
pattern = /data/prod/satellit/modis/lvl1/thin_MYD021KM.A{start_time:%Y%j.%H%M}.005.{proc_time:%Y%j%H%M%S}.NRT.hdf
format = EOS_thinned
type = HDF4
level = 1B
platform_name = EOS-Aqua
sensor = modis
timeliness = 180
duration = 300
variant = EARS
topics = /ears/modis/hdf4/1

scisys_receiver

Translates messages published by Scisys reception software to posttroll messages.

segment_gatherer

Collects together files that belong together for a single time slot.

Geostationary example: Single full disk dataset of Meteosat SEVIRI data are segmented to 144 separate files. These are prolog (PRO), epilog (EPI), 24 segments for HRV and 8 segments for each of the lower resolution channels. For processing, some of those segments are essential (if absent, no processing can take place), others are optional (if one segment in the middle is missing, an image can be produced, but it will have a gap).

Low Earth Orbit (LEO) example: EARS/VIIRS data are split to M-channel (includes all M-channels) files and DNB-channel files. These files have the same start and end times and coverage, just different data.

Historically this was created to collect SEVIRI segments, which has some impact on the configuration.

In the segment gatherer YAML configuration, the user can define one or more patterns that are collected. The following top level variables may be defined:

patterns

Mapping of pattern names to pattern definitions. Each category definition is itself a mapping that must contain the key pattern and may contain the keys critical_files, wanted_files, all_files, is_critical_set, and variable_tags. When patterns is not defined, the segment gatherer will not do anything useful.

pattern: Defines the pattern used to parse filenames obtained from incoming posttroll messages. The string follows trollsift syntax. The labels channel_name and segment have special meaning. Labels must be defined as string type (for example {segments:4s}) because the segment gatherer formats the filename pattern only after converting numeric segments or segment ranges to strings.
critical_files: Describes the files that must be unconditionally present. If timeout is reached and one or more critical files are missing, no message is published and all further processing ceases. The critical files are describes as a comma-separated string. Each item must contain exactly one colon (:). The part before the colon is a string describing the channel. The channel string may be empty, such as in cases where the filename does not contain a channel label. The part after the colon is a list of segments seperated by a hyphen-minus character (-). If this list contains more than one segment, each item must be parseable as a base-10 integer, and it will be interpreted as a range between the first and the last segment. For each channel, the segments are matched against the segment as extracted from the filename using the pattern defined above. If the filename pattern has no segments or channels, they are matched against the entire filename, with variable_tags (see below) replaced by wildcards.
wanted_files: Describes files that are wanted, but not critical. If one or more wanted files are missing, the segment gatherer will wait for them to appear until the timeout is reached. If timeout is reached and one or more wanted files are missing, a message will be published without the missing files. If all wanted files are present before timeout is reached, collection is finished and a message will be published immediately. The syntax is as for critical_files.
all_files: Describes files that are accepted, but not needed. Any file matching the all_files pattern is included with the published message, but the segment gatherer will not wait for those files.
is_critical_set: A boolean that marks this set of files as critical for the whole collection. Used for example when cloud mask data are required to successfully create a masked image.
variable_tags: List of strings for tags that are expected to vary between segments. Those are replaced with wildcards for the purposes of pattern matching.
group_by_minutes: Optional integer. Group the data for rounded minute interval. For example defining group_by_minutes = 10 all the files from time “201712081120” to time “”201712081129” would go in slot “2017-12-08T11:20:00”. (Can also be defined globally) By default, no grouping by minutes is performed and times are matched exactly or with a tolerance of time_tolerance.
start_time_pattern: Optional. Mapping with the keys start_time, end_time, and delta_time, which are all strings with the format %H:%M. This defines a pattern of time slots that will be considered for processing. Any timeslot that does not match this pattern will be discarded. For example, a start_time of 06:00, end_time of 18:00, and delta_time of 01:00 will result in processing to go ahead only for whole-hour time slots between 06:00 and 18:00. By default, all time slots are processed.
keep_parsed_keys: Optional. The segment gatherer normally combines metadata from the filename and the received posttroll message. The list of keys defined here will be taken from the filename pattern rather than from the message metadata. By default, only the parsed keys hardcoded in the source code are always taken from the filename pattern. (Can also be defined globally)

timeliness

Time in seconds from the first arrived file until timeout. When timeout is reached, all collected files (meaning all files that match the all_files pattern) are broadcast in a posttroll message.

time_name

Name of the time tag used in all patterns.

time_tolerance

Time difference in seconds for which start times are considered to belong to the same time slot.

posttroll

Configuration related to posttroll messaging, with the keys topics (list of topics to listen to) publish_topic (topic used for published messages), publish_port, nameservers, and addresses.

bundle_datasets

Optional. Merge datasets within a collection to be a single dataset.

num_files_premature_publish

Optional. Define a number of received files after that an event will be published although there are still some missing files. After publishing such event, the segment gatherer still waits for further file messages for this timeslot.

providing_server

Optional. Affects posttroll listening in a multicast environment. In a multicast environment, messages may come in from different servers. By setting a server name here, only messages from that server will be considered.

check_existing_files_after_start

Optional. When the first postroll message arrives after the segment gatherer has started, check the file system if there are existing files that should also be added to this time slot. Currently does not support (remote) S3 filesystems. Defaults to False.

all_files_are_local

Optional. If set to True (defaults to False), segment gatherer will handle all files as locally accessible. That is, it will drop the transport protocol/scheme and host name from the URI of the incoming messages. The use case is for protocols that fsspec do not recognize and can’t handle, such as scp://.

The YAML format supports collection of several different data together. As an example: SEVIRI data and NWC SAF GEO products.

Configuration for segment_gatherer can be either in ini or yaml files. There are several examples in the examples/ directory in the pytroll-collectors source tree.

Example ini config:

[ears-viirs]
# Pattern used to identify time slots and segments.  NOTE: this
# pattern is not used in forming the metadata sent forward, the
# published metadata comes directly from eg. trollstalker.
pattern = SV{segment}C_{orig_platform_name}_d{start_time:%Y%m%d_t%H%M%S}{start_decimal:1d}_e{end_time:%H%M%S}{end_decimal:1d}_b{orbit_number:5d}_c{proctime:s}_eum_ops.h5
# Segments critical to production
critical_files =
# These segments we want to have, but it's still ok if they are missed
wanted_files = :M,:DNB
# All possible segments
all_files = :M,:DNB
# Listen to messages with this topic
topics = /EARS/Suomi-NPP/viirs/1b
# Publish the dataset with this topic
publish_topic = /segment-EARS/Suomi-NPP/viirs/1b
# Time to wait after the first segment, in seconds
timeliness = 240
# Name of a time field in the pattern above
time_name = start_time
# Comma separated list of tag names in the pattern that vary between different
# segments of the same time slot
variable_tags = proctime,proc_decimal
# Listen to messages coming from these extra IP addresses and port
# addresses = tcp://192.168.0.101:12345 tcp://192.168.0.102:12345
# Publish messages via this port.  If not set, random free port is used
# publish_port = 12345
# Force all files to be local. That is, drop scheme and host from the URIs before handling the messages
# all_files_are_local = True

# nameserver host to register publisher
# WARNING: 
# if nameserver option is set, address broadcasting via multicasting is not used any longer.
# The corresponding nameserver has to be started with command line option "--no-multicast".
#nameserver = localhost

[ears-pps]
pattern = W_XX-EUMETSAT-Darmstadt,SING+LEV+SAT,{orig_platform_name}+{segment}_C_EUMS_{start_time:%Y%m%d%H%M%S}_{orbit_number:5d}.nc
critical_files = 
wanted_files = :CTTH,:CT,:CMA
all_files = :CTTH,:CT,:CMA
topics = /test-stalker/4/dev
publish_topic = /test-geo_gatherer/4/dev
timeliness = 1200
time_name = start_time

[msg]
pattern = H-000-{orig_platform_name:4s}__-{orig_platform_name:4s}________-{channel_name:_<9s}-{segment:_<9s}-{start_time:%Y%m%d%H%M}-__
critical_files = :PRO,:EPI
wanted_files = VIS006:000006-000008,VIS008:000006-000008,IR_016:000006-000008,IR_039:000006-000008,WV_062:000006-000008,WV_073:000006-000008,IR_087:000006-000008,IR_097:000006-000008,IR_108:000006-000008,IR_120:000006-000008,IR_134:000006-000008,HRV:000022-000024
all_files = VIS006:000001-000008,VIS008:000001-000008,IR_016:000001-000008,IR_039:000001-000008,WV_062:000001-000008,WV_073:000001-000008,IR_087:000001-000008,IR_097:000001-000008,IR_108:000001-000008,IR_120:000001-000008,IR_134:000001-000008,HRV:000001-000024
topics = /foo/bar
publish_topic = /pub/foo/bar
timeliness = 900
time_name = start_time

[rss]
pattern = H-000-{orig_platform_name:4s}__-{orig_platform_name:4s}_RSS____-{channel_name:_<9s}-{segment:_<9s}-{start_time:%Y%m%d%H%M}-__
critical_files = :PRO,:EPI
wanted_files = VIS006:000006-000008,VIS008:000006-000008,IR_016:000006-000008,IR_039:000006-000008,WV_062:000006-000008,WV_073:000006-000008,IR_087:000006-000008,IR_097:000006-000008,IR_108:000006-000008,IR_120:000006-000008,IR_134:000006-000008,HRV:000022-000024
all_files = VIS006:000006-000008,VIS008:000006-000008,IR_016:000006-000008,IR_039:000006-000008,WV_062:000006-000008,WV_073:000006-000008,IR_087:000006-000008,IR_097:000006-000008,IR_108:000006-000008,IR_120:000006-000008,IR_134:000006-000008,HRV:000022-000024
topics = /foo/bar
publish_topic = /pub/foo/bar
timeliness = 300
time_name = start_time

[hrptl0]
# Example to collect multiple direct readout instruments for a single
# timeslot.  This may be needed for some downstream software, such
# as NWC/PPS.  It does not collect multiple timeslots for a single
# overpass, that's what gatherer is for.
pattern = {segment:4s}_HRP_00_{orig_platform_name}_{start_time:%Y%m%d%H%M%S}Z_{end_time:%Y%m%d%H%M%S}Z_N_O_{processing_time:%Y%m%d%H%M%S}Z
wanted_files = :AVHR,:AMSA,:HIRS,:MHSx
all_files = :AVHR,:AMSA,:HIRS,:MHSx
critical_files = :AVHR,:AMSA,:HIRS,:MHSx
topics = /new/file/yay
publish_topic = /data/for/aapp
timeliness = 20
time_name = start_time
# Allowed time tolerance in seconds in the "time_name" for segments to be
# included in the same time slot.  Default: 0
time_tolerance = 30
# Here all the time fields are varying
variable_tags = start_time,end_time,processing_time

Example yaml config:

# Example how to collect several filesets which are segmented

patterns:
  # First pattern different segments for it
  msg:
    pattern:
      "H-000-{orig_platform_name:4s}__-{orig_platform_name:4s}________-{channel_name:_<9s}-{segment:_<9s}-{start_time:%Y%m%d%H%M}-__"
    critical_files: :EPI,:PRO
    wanted_files: VIS006:000001-000008,:PRO,:EPI
    all_files: VIS006:000001,VIS006:000002,VIS006:000003,VIS006:000004,VIS006:000005,VIS006:000006,VIS006:000007,VIS006:000008,:PRO,:EPI
    # This set of files is crucial
    is_critical_set: true
    # The platform name is different between different filesets, so needs to
    # be defined as variable
    variable_tags: ['orig_platform_name', ]
  iodc:
    pattern:
      "H-000-{orig_platform_name:4s}__-{orig_platform_name:4s}_IODC___-{channel_name:_<9s}-{segment:_<9s}-{start_time:%Y%m%d%H%M}-__"
    critical_files: :EPI,:PRO
    wanted_files: VIS006:000001-000008,:PRO,:EPI
    all_files: VIS006:000001,VIS006:000002,VIS006:000003,VIS006:000004,VIS006:000005,VIS006:000006,VIS006:000007,VIS006:000008,:PRO,:EPI
    # This set of files is not crucial, but we'll wait until timeout in any case
    is_critical_set: false
    variable_tags: ['orig_platform_name', ]


# Time in seconds from the first arrived file until timeout
timeliness:
  900
# Time tag used in ALL of the patterns
time_name:
  start_time
# The time, shown by "time_name", can differ by time_tolerance seconds
time_tolerance:
  30

posttroll:
  topics:
    - /foo/bar
  publish_topic:
    /segment-foo/bar
  publish_port:
    0
  nameservers:
    null
  addresses:
    null

If the collected segments are in an S3 object store, the check_existing_files_after_start feature needs some additional configuration. All the connection configurations and such are done using the fsspec configuration system.

An example configuration could be for example placed in ~/.config/fsspec/s3.json:

{
    "s3": {
        "client_kwargs": {"endpoint_url": "https://s3.server.foo.com"},
        "secret": "VERYBIGSECRET",
        "key": "ACCESSKEY"
    }
}

trollstalker

Trollstalker is an alternative for users who do not use the trollmoves client/server system. If file transfers are done through trollmoves, there is no need for trollstalker. If file transfers are done through any other software, trollstalker can be used to detect file arrival.

It is typically run as a daemon or via a process control system such as supervisord or daemontools. When such a file is detected, a pytroll message is sent on the network via the posttroll nameserver (which must be running) to notify other interested processes.

In order to start trollstalker:

$ cd pytroll-collectors/bin/
$ ./trollstalker.py -c ../examples/trollstalker_config.ini -C noaa_hrpt

Now you can test if the messaging works by copying a data file to your input directory. Trollstalker should send a message, and depending on the configuration, also print the message on the terminal. If there’s no message, check the configuration files that the input directory and file pattern are set correctly.

The config determines what file patterns are monitored and what posttroll messages will be sent, among other things. Listeners to this message may be, for example, segment_gatherer or aapp-runner.

Configuration files have one section per file type that is listened to. To listen to multiple file types, start trollstalker multiple times. The message sent by trollstalker contains a dictionary which contains:

All fields from the filepattern, and
Any keys starting with var_ in the configuration file and their values.

The additional keys may be essential if the package listening to trollstalker messages expects an entry in the posttroll message that is normally extracted from the filename. For example, geographic_gatherer needs a platform_name to be present at all times. If a filename does not contain a platform name or is for some other reason not matched with a trollsift pattern, it may need to be sent explicitly with var_platform_name.

# This config is used in Trollstalker.

[noaa_hrpt]

# posttroll message topic that provides information on new files
# This could follow the pytroll standard: 
# https://github.com/mraspaud/pytroll/wiki/Metadata
topic=/HRPT/l1b/dev/mystation

# input directory that trollstalker watches
directory=/path/to/satellite/data/

# filepattern of the input files for trollstalker
# uses the trollsift syntax:
# http://trollsift.readthedocs.org/en/latest/index.html
filepattern={path}hrpt_{platform_name}_{start_time:%Y%m%d_%H%M}_{orbit_number:05d}.l1b

# instrument names for mpop
instruments=avhrr/3,mhs,amsu-b,amsu-a,hirs/3,hirs/4

# logging config for trollstalker. Comment out to log to console instead.
stalker_log_config=/usr/local/etc/pytroll/trollstalker_logging.ini

# logging level, if stalker_log_config is not set above. Possible values are:
#  DEBUG, INFO, WARNING, ERROR, CRITICAL
loglevel=DEBUG

# inotify events that trigger trollstalker to send messages
event_names=IN_CLOSE_WRITE,IN_MOVED_TO

# port to send the posttroll messages to, optional so use "0" to take a random
# free port.
posttroll_port=0

# nameserver hosts to register publisher
# WARNING: 
# if nameservers option is set, address broadcasting via multicasting is not used any longer.
# The corresponding nameserver has to be started with command line option "--no-multicast".
#nameservers=localhost

# use an alias to convert from platform in the filename to OSCAR naming
alias_platform_name = noaa18:NOAA-18|noaa19:NOAA-19

# Keep 10 last events in history, and process only if the new event
# isn't in this history.  If option not given, or set to zero (0), all
# matching events will be processed
history=10

# Uncomment if the files you want to stalker will be created in a
# directory in the directory you are watching.
# For example if the base dir for inotify to watch is 2 levels up from
# the origin sift match.
# origin_inotify_base_dir_skip_levels=-2

[hrit]
topic=/HRIT/topic/or/something/
directory=/path/to/satellite/data/
filepattern={path}H-000-{platform_name:4s}__-{orig_platform_name:4s}________-{channel_name:_<9s}-{segment:_<9s}-{start_time:%Y%m%d%H%M}-__
instruments=seviri
stalker_log_config=/usr/local/etc/pytroll/trollstalker_logging.ini
loglevel=DEBUG
event_names=IN_CLOSE_WRITE,IN_MOVED_TO
posttroll_port=0
alias_platform_name = MSG2:Meteosat-9|MSG3:Meteosat-10


[himawari8]
topic=/H8/topic/or/something/
directory=/path/to/satellite/data/
filepattern = {path}IMG_{platform_name:4s}{channel:3s}_{time:%Y%m%d%H%M}_{segment}

# The posttroll message sent by trollstalker will include the keys from the
# filepattern with their corresponding values, but some clients may need
# additional information; for example, gatherer needs a "platform_name".
# This is usually included in the filepattern, but not always.  Even when
# it is, we may need to override it.  Therefore, we define new variable
# "platform_name" here.
var_platform_name = himawari8

# define new datetime variable aligned/ceiled to 15 minutes intervals 
# (Himawari filename timestampes are not constant for a timeslot)
var_h8_gather_time={time:%Y%m%d%H%M|align(15)}

# override start_time and end_time because default values are derived from {time} in 
# filepattern and that is not constant for all files of a timeslot
# "end_time" should be 1 interval after start_time (3rd parameter of align function)
var_start_time = {time:%Y%m%d%H%M%S|align(15)}
var_end_time = {time:%Y%m%d%H%M%S|align(15,0,1)}

instruments = ahi
stalker_log_config=/usr/local/etc/pytroll/trollstalker_logging.ini
loglevel=DEBUG
event_names=IN_CLOSE_WRITE,IN_MOVED_TO
posttroll_port=0


[avhrrl0]
topic=/new/file/yay
# /HRPT/L1b/dev/hki
directory=/home/lahtinep/data/satellite/polar/hrpt
publish_port=
event_names=IN_CLOSE_WRITE,IN_MOVED_TO
loglevel=DEBUG
stalker_log_config=/home/lahtinep/Software/pytroll/config_files/stalker_logging.ini
filepattern={path}AVHR_HRP_00_{platform_name}_{start_time:%Y%m%d%H%M%S}Z_{end_time:%Y%m%d%H%M%S}Z_N_O_{processing_time:%Y%m%d%H%M%S}Z
instruments=avhrr-3
alias_platform_name=M01:Metop-B|M02:Metop-A|noaa15:NOAA-15|noaa16:NOAA-16|noaa18:NOAA-18|noaa19:NOAA-19|metop01:Metop-B
history=0

[amsual0]
topic=/new/file/yay
# /HRPT/L1b/dev/hki
directory=/home/lahtinep/data/satellite/polar/hrpt
publish_port=
event_names=IN_CLOSE_WRITE,IN_MOVED_TO
loglevel=DEBUG
stalker_log_config=/home/lahtinep/Software/pytroll/config_files/stalker_logging.ini
filepattern={path}AMSA_HRP_00_{platform_name}_{start_time:%Y%m%d%H%M%S}Z_{end_time:%Y%m%d%H%M%S}Z_N_O_{processing_time:%Y%m%d%H%M%S}Z
instruments=amsu-a
alias_platform_name=M01:Metop-B|M02:Metop-A|noaa15:NOAA-15|noaa16:NOAA-16|noaa18:NOAA-18|noaa19:NOAA-19|metop01:Metop-B
history=0

[mhsl0]
topic=/new/file/yay
# /HRPT/L1b/dev/hki
directory=/home/lahtinep/data/satellite/polar/hrpt
publish_port=
event_names=IN_CLOSE_WRITE,IN_MOVED_TO
loglevel=DEBUG
stalker_log_config=/home/lahtinep/Software/pytroll/config_files/stalker_logging.ini
filepattern={path}MHSx_HRP_00_{platform_name}_{start_time:%Y%m%d%H%M%S}Z_{end_time:%Y%m%d%H%M%S}Z_N_O_{processing_time:%Y%m%d%H%M%S}Z
instruments=mhs
alias_platform_name=M01:Metop-B|M02:Metop-A|noaa15:NOAA-15|noaa16:NOAA-16|noaa18:NOAA-18|noaa19:NOAA-19|metop01:Metop-B
history=0

[hirsl0]
topic=/new/file/yay
# /HRPT/L1b/dev/hki
directory=/home/lahtinep/data/satellite/polar/hrpt
publish_port=
event_names=IN_CLOSE_WRITE,IN_MOVED_TO
loglevel=DEBUG
stalker_log_config=/home/lahtinep/Software/pytroll/config_files/stalker_logging.ini
filepattern={path}HIRS_HRP_00_{platform_name}_{start_time:%Y%m%d%H%M%S}Z_{end_time:%Y%m%d%H%M%S}Z_N_O_{processing_time:%Y%m%d%H%M%S}Z
instruments=hirs
alias_platform_name=M01:Metop-B|M02:Metop-A|noaa15:NOAA-15|noaa16:NOAA-16|noaa18:NOAA-18|noaa19:NOAA-19|metop01:Metop-B
history=0

trollstalker2

New, alternative implementation of trollstalker. Not really needed, as trollstalker works fine and is actively maintained.

s3stalker

A counterpart to trollstalker for polling for new files on an s3 bucket. This is thought to be run regularly from eg. cron. For a daemon version of this, check the next item. Example configuration: https://github.com/pytroll/pytroll-collectors/blob/main/examples/s3stalker.yaml

s3stalker_daemon

The daemon version of s3stalker, that stays on and polls until stopped (preferably with a SIGTERM). Example configuration: https://github.com/pytroll/pytroll-collectors/blob/main/examples/s3stalker_runner.yaml_template

See also https://s3fs.readthedocs.io/en/latest/#credentials on options how to define the S3 credentials.

zipcollector_runner

To be documented.

Interface to other packages in the Pytroll ecosystem

posttroll

The pytroll-collection scripts use posttroll to exchange messages with other pytroll packages. For example, that message might be “input file available”. Therefore, posttroll must be running for the processing with pytroll-collectors to function.

pytroll-aapp-runner

aapp-runner may be listening to messages from cat.py or segment_gatherer.

trollflow2

Trollflow2 is the successor of the now-retired trollduction package. Some of the scripts in pytroll-collectors, such as trollstalker, segment_gatherer, and gatherer, were previously part of trollduction, but are now here rather than in trollflow2. Today trollflow2 may be listening to messages sent by scripts from pytroll-collectors.

trollsift

Used for filename pattern matching, see trollsift documentation.