This article describes the three main steps required to create an S3 Parquet Collector:
- Providing read access for Anodot using a cross-account AWS role
- Creating an S3 Parquet Collector data source
- Creating an S3 Parquet Collector stream query
Providing read access for Anodot using a cross-account AWS role
Anodot can access the files on your S3 Parquet Collector bucket only after you explicitly allow it. To provide read access to Anodot and to create a cross-account AWS role to provide Anodot with access to the S3 Parquet Collector files, refer to this section (note that when creating the cross-account AWS role, you should replace the references to CUR with S3 Parquet Collector).
IMPORTANT: New S3 Parquet Collector data sources should use the role-based access method mentioned above. The previous method of updating the bucket policy (described in Creating an S3 Parquet Collector Data Source) is currently supported for existing data sources, but is no longer available for new sources. Contact us to ensure safe migration of existing sources without losing data.
Creating an S3 Parquet Collector Data Source
- In the Streams page, click Sources +.
- In Source Types, click START on the S3 Parquet tile.
- In the displayed window, select a bucket region, enter the bucket name, and enter a folder name (optional).
- Click Continue to display the Stream Query window, as shown in the following section.
Creating an S3 Parquet Collector Stream Query
- In the Streams page, choose the S3 Parquet source for which you want to create a stream query.
Note: The streams associated with that source are displayed. If the Streams panel is empty, no stream queries exist for that source. - Click Streams + to display the Stream Query window.
- At the top of the Stream Query window, set a name for the stream.
- In the Files to Collect section, define an (Optional) Root Path (such as example-month-il) and the Daily Partition Pattern (such as {{DD}}-{{MM}}-{{YYYY}}).
- In the File Properties section, define a Format (by default, Parquet is selected), and Compression (choose from None or Snappy).
Note: Non-Parquet files cannot be uploaded to the Parquet files directory. - Click the Measures & Dimensions Edit icon to define measures for the stream.
- To add an item, drag from the Available fields repository on the left to the relevant Measures or Dimensions panel. At least one dimension and one measure [a numerical value] are required.
- Select a time format from the dropdown menu, and then select a time zone.
- When finished, click X to accept the edits and return to the Stream Query window. - Click the Schedule File Collection Edit icon to define file collection settings for the stream.
Define the following:
- Set the collection interval in the Collect Files Every field.
- Set the scheduling time zone of the data records time zone in the Files Time Zone field.
- Set the delay according to the time it takes the data to be available in the Delay (Minutes) field.
- Set a time span in the Ignore Files Older Than field.
- Set the number of intervals to waits for lagging files in the Lagging Files Policy field.
For further details on the above fields and their settings, see Editing a Stream Query. - Click X to confirm the scheduling settings.
- Click NEXT. The Stream Table is displayed, as described in Stream Tables.
See Also:
Using Data Collectors
Collecting and Streaming Data
Stream Tables
Stream Summaries