Google Cloud Storage

This article contains sections on:

Creating a Google Cloud Storage Source 
Creating and Editing a Stream Query
Recommended Practices

CREATING A GOOGLE CLOUD STORAGE SOURCE

  1. In the Data Manager window, click Sources +.
  2. In the Source Types window, click START on the Google Storage tile Google_storage.ico. The Google Storage dialog box is displayed.
    Google_storage_start.png
  3. Either enter your Google Service Account, see Google Service Accounts,
    or click Sign in with Google and follow the on screen instructions. The Stream Query Google Cloud Storage window is displayed.
    Stream_Query_Google_Cloud_Storage.png
  4. In the Files to Collect panel:
    a/ Choose a project from the drop-down menu.
    b/ Choose a bucket from the drop-down menu. 
  5. In the Files Path field enter a path relative to the data source. 
    [Optional] You can further restrict access to the source by continuing the path to a specific folder within the bucket.
  6. Choose a File Name Date Pattern and collection interval:

    Filename Timestamp Pattern

    Example

    Supported Intervals

     yyyyMMdd

    test_20180715_daily.csv.gz

    Daily

    yyyyMMddHH 

    test_2018071503_hourly.csv.gz 
    test_2018071504_hourly.csv.gz 
    test_2018071505_hourly.csv.gz

    Daily, hourly [1]

    yyyyMMddHHmm 

    test_201807150305_5min.csv.gz
    test_201807150310_5min.csv.gz
    test_201807150300_15min.csv.gz
    test_201807150315_15min.csv.gz
    test_201807150330_15min.csv.gz
    test_201807150345_15min.csv.gz

    Daily, hourly,
    15 minutes, 5 minutes [2]


    [1] If daily - the most recent file will be used
    [2] For daily, hourly, 15 minutes, 5 minutes, 1 minute - the most recent file will be used


  7. In the File name prefix field, enter the prefix (e.g. test_ )
  8. In the File name suffix field, enter suffix and extension (e.g. _demo.csv.gz)
    Google_Cloud_Stream_Query.png

    Note

    • In each of the Timestamp/Prefix/Suffix fields enter the specified data only.
    • Files may be compressed [.gz]
    • If you get an error message, fix the error in the bucket or files, and click RETRY to re-fetch the files.
  9. To preview the file, clickEyeIcon.ico. [Optional]
  10. To change the Parsing and Import Settings, clickFile_settings_icon.ico
    See CSV - Parsing and Importing Settings.

CREATING AND EDITING A STREAM QUERY

If you just created a Google Cloud Storage data source, skip to step 3.

  1. In the Data Manager source panel, choose the source to which you want to add a stream query; the streams associated to that source are displayed in the Streams panel. 
    Note If the Streams panel is empty, no stream queries exist for that source.
  2. Click Stream+, to open the Stream Query window.
    Note To edit a stream, click the More icon > Edit.
  3. In the Measures & Dimensions panel, click the  Pen icon.
    GCS_MeasuresDimensionsPanel.png
  • To add an item, drag from the Available Fields repository to the relevant Measure or Dimension panel. At least one dimension and one measure [a numerical value] are required.
    Note Use Search if an item is not displayed.
  • To clear an item, click the X icon on the item.
  • Choose a time format from the drop-down menu.
  • Choose a time zone. 

4. Click X to accept the edits and return to the Stream Query window. 
5. In the Schedule File Collection panel, click the  Pen icon.

GCS_Schedule_File_Collection.png

  • From the Collect files every menu, set the collection interval [daily/ hourly/ 15 min/ 5 min].
    Note The timestamp at the start of the source file name indicates the the period of the hourly or daily data. Each row in a file will increase incrementally, either hourly or daily, accordingly.
  • A File Time Zone for Anodot to know how to relate to the internal timestamp.
  • An Ignore files older than time span.
    Note "Older than" refers to back filing of historical data and does not affect ongoing collection.
  • A Lagging files policy to set the number of intervals to waits for lagging files.

Note  The system behavior in general is to keep the data ordered, and create as minimal gaps as possible. Therefore, the heuristics are the following:

1. While there is no new data, meaning no new files, the system will retry to collect from the last file collected. In this case, the lag policy parameter does not affect anything.
2. When there is a gap followed by new files, the Lagging files parameter is used to state "how many intervals the system should wait before processing the new files". So that the file hold processing is not too long, due to missing files that may never or are too-long in arriving.

Example
File  2018032806_test.csv was collected OK - Anodot has collected the data up to 2018/03/28 at 06:59
The files for 07,08, 09 have not arrived to the relevant folder.
The file for 10, called 2018032810_test.csv arrived at 11:15 - therefore there is a gap.

The Lagging files parameter determines the processing of the 10 o'clock file:

Lag policy = 0 --> The 10 file will be processed upon arrival, and a gap in the previous hours is created.

Lag policy = 1 --> The 10 file will not be processed yet, it will wait to 12:15 and then be processed. [If during the waiting time, older files arrive, they will be processed before.]

Lag policy = 2 --> The 10 file will not be processed yet, it will wait to 13:15 and then be processed. [If during the waiting time, older files arrive, they will be processed before.]

6. Click X to accept the scheduling settings.

7. Click NEXT. The Stream Table is displayed.

RECOMMENDED PRACTICES

Before uploading a data set file to a Google Cloud Storage source type:

  1. Prepare the file in a test folder; check that the file format meets the Google Storage data source file parameters. 
  2. Records within files must be sorted chronologically.
    Note Files smaller than 16MB are automatically sorted by timestamp by Anodot.
  3. Move the file to the production folder only after you have verified that the file is complete.

Note

  • If the file timestamp is epoch seconds or epoch milliseconds, don't define a timezone - it will be treated as an error.
  • The file must be formatted using UTF-8 encoding with ASCII character sets.
  • A Google Storage bucket can be accessed from one Anodot account.

SEE
Using Data Collectors
Data Manager
Stream Tables
Stream Summaries

Was this article helpful?
0 out of 0 found this helpful
-->