To connect to S3 buckets see the CSV S3 Collector video or read this article.
This article contains sections on:
Providing read access to Anodot and creating a cross-account AWS role
Creating an S3 Source
Creating a Stream Query
Editing a Stream Query
Recommended Practices
Providing read access for Anodot using a cross-account AWS role
Anodot can access the files on your S3 bucket only after you explicitly allow it. To provide read access to Anodot and to create a cross-account AWS role to provide Anodot with access to the S3 files, refer to this section (note that when creating the cross-account AWS role, you should replace the references to CUR with S3).
IMPORTANT: New S3 data sources should use the role-based access method mentioned above. The previous method of updating the bucket policy (described in Creating an S3 Source) is currently supported for existing data sources, but is no longer available for new sources. Contact us to ensure safe migration of existing sources without losing data.
Creating an S3 Source
- On the Streams page, click Sources +.
- In Source Types, click START on the AWS S3 tile. The S3 dialog box is displayed.
- From the dropdown menu select the AWS regional bucket location.
- Enter the bucket name you created in your AWS account.
[Optional] To restrict access to a specific folder in the bucket, enter the name of the folder in the Folder Name field. - Click Continue to display the Stream Query window.
Creating an S3 Stream Query
If you have just created an S3 data source, skip to step 3.
- In the Data Manager Source panel, choose the source to which you want to add a stream query; the streams associated to that source are displayed in the Streams panel.
Note: If the Streams panel is empty, no stream queries exist for that source. - Click Streams + to display the Stream Query window.
- In the Files Path field enter a path relative to the data source.
[Optional] You can further restrict access to the source by continuing the path to a specific folder within the bucket. - Choose a File Name Date Pattern.
Note: The file name indicates the start of the collection interval. See the examples below.
Filename Timestamp Pattern
Example
Supported Intervals
yyyyMMdd test_20180715_daily.csv.gz
Daily
yyyyMMddHH
test_2018071503_hourly.csv.gz
test_2018071504_hourly.csv.gz
test_2018071505_hourly.csv.gzDaily, hourly [1]
yyyyMMddHHmm
test_201807150305_5min.csv.gz
test_201807150310_5min.csv.gz
test_201807150300_15min.csv.gz
test_201807150315_15min.csv.gz
test_201807150330_15min.csv.gz
test_201807150345_15min.csv.gzMinute, Daily, hourly,
15 minutes, 5 minutes, 1 minute [2]
[1] If daily - the most recent file will be used
[2] For daily, hourly, 15 minutes, 5 minutes, 1 minute - the most recent file will be used
File Name Examples
File Name Holds data from...to test_2018071503_hourly.csv.gz
test_2018071504_hourly.csv.gz
test_2018071505_hourly.csv.gzhourly data from 03:00 to 03:59
hourly data from 04:00 to 04:59
hourly data from 05:00 to 05:59test_201807150305_5min.csv.gz
test_201807150310_5min.csv.gz
test_201807150300_15min.csv.gz
test_201807150345_15min.csv.gzminute-based data from 03:05 to 03:09
minute-based data from 03:10 to 03:14
15-minutes of data from 03:00 to 03:14
15-minutes of data from 03:45 to 03:59 - In the File name prefix field, enter the prefix (e.g. test_ )
- In the File name suffix field, enter suffix and extension (e.g. demo.csv.gz)
Notes:
- In each of the Timestamp/Prefix/Suffix fields enter the specified data only.
- Files may be compressed [.gz]
- If you get an error message, fix the error in the bucket or files, and click RETRY to re-fetch the files.
- To preview the file, click
. [Optional]
- To change the Parsing and Import Settings, click
.
See CSV - Parsing and Importing Settings. - Click GO! The imported Measures & Dimensions are displayed.
Note: If there are no Measures in the Stream Query, an editable Events Count column is automatically inserted in the Stream Table.
Editing a Stream Query
- To edit the Stream Query, click the pen icon in the Measures & Dimensions panel.
- To add an item, drag from the repository on the left to the relevant Measure or Dimension panel. At least one dimension and one measure [ a numerical value] are required.
- To clear an item, click the X icon on the item.
- Choose a time format from the drop-down menu.
- Choose a time zone.
2. Click X to accept the edits and return to the Stream Query window.
3. To edit the Schedule File Collection, click the pen icon in the Schedule panel.
In Schedule File Collection, define the following:
- From the Collect files every menu, set the intraday or daily collection interval.
Note The timestamp at the start of the source file name indicates the period of the minute, hourly or daily data. Each row in a file will increase incrementally, either by the intraday or daily interval chosen. - An Ignore files older than time span.
Note "Older than" refers to back filing of historical data and does not affect ongoing collection. - A File Time Zone sets the scheduling time zone according to the data records time zone.
- A Lagging files policy to set the number of intervals to waits for lagging files.
Note: The system behavior, in general, is to keep the data ordered, and create as minimal gaps as possible. Therefore, the heuristics are the following:
- While there is no new data, meaning no new files, the system will retry to collect from the last file collected. In this case, the lag policy parameter does not affect anything.
- When there is a gap followed by new files, the Lagging files parameter is used to state "how many intervals the system should wait before processing the new files". So that the file hold processing is not too long, due to missing files that may never or are too-long in arriving.
Example
File 2018032806_test.csv was collected OK - Anodot has collected the data up to 2018/03/28 at 06:59
The files for 07,08, 09 have not arrived to the relevant folder.
The file for 10, called 2018032810_test.csv arrived at 11:15 - therefore there is a gap.
The Lagging files parameter determines the processing of the 10 o'clock file:
Lag policy = 0 --> The 10 file will be processed upon arrival, and a gap in the previous hours is created.
Lag policy = 1 --> The 10 file will not be processed yet, it will wait to 12:15 and then be processed. [If during the waiting time, older files arrive, they will be processed before.]
Lag policy = 2 --> The 10 file will not be processed yet, it will wait to 13:15 and then be processed. [If during the waiting time, older files arrive, they will be processed before.]
4. Click X to accept the scheduling settings.
5. Click NEXT. The Stream Table is displayed.
Recommended Practices
Before uploading a data set file to AWS S3 source type:
- Prepare the file in a test folder; check that the file format meets the S3 data source file parameters.
- Records within files must be sorted chronologically.
Note Files smaller than 16MB are automatically sorted by timestamp by Anodot. - Move the file to the production folder only after you have verified that the file is complete.
Notes:
- If the file timestamp is epoch seconds or epoch milliseconds, don't define a timezone - it will be treated as an error.
- The file must be formatted using UTF-8 or ASCII character sets.
- An S3 bucket can be accessed from one Anodot account.
- To ensure there is no impact on performance, the S3 folder should be configured to contain up to a maximum of 5000 files.
See Also:
Using Data Collectors
Collecting and Streaming Data
Stream Tables
Stream Summaries