AWS S3

 To connect to S3 buckets see the CSV S3 Collector video or read this article.

This articles contains sections on:
Creating an S3 Source
Creating a Stream Query

TO CREATE AN S3 SOURCE

  1. In the Data Manager window, click Sources +.
  2. In the Connect to Source dialog box, choose S3_icon.ico. The S3 dialog box is displayed.
    S3_bucket_01.png
  3. From the drop-down menu select the AWS regional  bucket location.
  4. Enter the bucket name you created in your AWS account.
    [Optional] To restrict access to a specific folder in the bucket, enter the name of the folder in the Folder Name field.
    Note  To make sure that the bucket policy is updated to enable Anodot to access the files:
    i. Navigate to Bucket Policy in the  Permissions > Bucket Policy in your AWS account.
    New_bucket_policy.png
    ii. Copy the policy listed below, change the "Resource" to your own bucket name.

{
    "Id": "Policy1520872584746",
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1520872574434",
            "Action": [
                "s3:Get*",
                "s3:List*"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::<bucket-name>",
                "arn:aws:s3:::<bucket-name>/*"
            ],

            "Principal": {
                "AWS": [
                    "arn:aws:iam::340481513670:user/anodot-bc"
                ]
            }
        }
    ]
}

For more details, see AWS Using Bucket Policies and User Policies

5. Click Continue, to display the Stream Query window.

CREATING AN S3 STREAM QUERY

If you have just created an S3 data source, skip to step 3.

    1.  In the Data Manager Source panel, choose the source to which you want to add a stream query; the streams associated to that source are displayed in the Streams panel.
      Note If the Streams panel is empty, no stream queries exist for that source.
    2. Click Streams + to display the Stream Query window.
    3. In the Files Path field enter a path relative to the data source. 
      Note [Optional] You can further restrict access to the source by continuing the path to a specific folder within the bucket.
      filestocollect.png
    4. In the Filename suffix field, enter the name of a file. The file name must start with the time format YYYYMMDDHH but delete it and the path from the file name, if copied from AWS S3. Only the file name+suffix should appear in the Filename suffix box:
      Example Filename 2018031711_sampledata.csv enter _sampledata.csv
      Files may be compressed using gz. In this case include the gz in the suffix as well:
      Example  filename 2018031711_test.csv.gz enter _test.csv.gz

    5. [Optional] To preview the file, clickEyeIcon.ico.
    6. To change the Parsing and Import Settings, click File_settings_icon.ico.
      See CSV - Parsing and Importing Settings.
    7. Click GO!
    8. To edit the Measures & Dimensions, click the  Pen icon in the Measures & Dimensions panel. 
      S3_MeasuresDimensionsPanel.png
  • To add an item, drag from the repository on the left to the relevant Measure or Dimension  panel. At least one dimension and one measure [ a numerical value] are required.
  • To clear an item, click the X icon on the item.
  • Choose a time format from the drop-down menu.
  • Choose a time zone. The time zone may be either from your browser, your file or a specific location. The time stamp definition and the time pattern are required.

9. Click X to accept the edits and return to the Stream Query window.

10. To edit the Schedule File Collection, click the Pen icon in the Schedule panel.
S3_Schedule_File_Collection.png

In Schedule File Collection, choose: 

  • From the Collect files every menu, set the collection interval [hourly /daily].
    Note The timestamp at the start of the source file name indicates the the period of the hourly or daily data. Each row in a file will increase incrementally, either hourly or daily, accordingly. 
  • An Ignore files older than time span.
  • A File Time Zone for Anodot to know how to relate to the internal timestamp.
  • A Lagging files policy to set the number of intervals to waits for lagging files.

Note  The system behavior in general is to keep the data ordered, and create as minimal gaps as possible. Therefore, the heuristics are the following:

1. While there is no new data, meaning no new files, the system will retry to collect from the last file collected. In this case, the lag policy parameter does not affect anything.
2. When there is a gap followed by new files, the Lagging files parameter is used to state "how many intervals the system should wait before processing the new files". So that the file hold processing is not too long, due to missing files that may never or are too-long in arriving.

Example
File  2018032806_test.csv was collected OK - Anodot has collected the data up to 2018/03/28 at 06:59
The files for 07,08, 09 have not arrived to the relevant folder.
The file for 10, called 2018032810_test.csv arrived at 11:15 - therefore there is a gap.

The Lagging files parameter determines the processing of the 10 o'clock file:

Lag policy = 0 --> The 10 file will be processed upon arrival, and a gap in the previous hours is created.

Lag policy = 1 --> The 10 file will not be processed yet, it will wait to 12:15 and then be processed. [If during the waiting time, older files arrive, they will be processed before.]

Lag policy = 2 --> The 10 file will not be processed yet, it will wait to 13:15 and then be processed. [If during the waiting time, older files arrive, they will be processed before.]

11. Click X to accept the scheduling settings.

12. Click NEXT. The Stream Table is displayed.

SEE
Using Data Collectors
Data Manager
Stream Tables
Viewing Stream Summaries

Was this article helpful?
0 out of 0 found this helpful