Connect to Amazon S3 - StarTree Docs

Step 1: In the Data Portal, click Tables and then click Create Table.

Step 2: Select S3 as the Data Source.

Step 3: Create a New Connection.

Click New Connection. If you want to use an existing connection, select the connection from the list and proceed to Step 5.

Enter a Source Name for the new connection.

Select the Authentication Type from the drop-down list.

Step 4: Configure Connection Parameters.

Connecting to S3 Using Basic Authentication

Use the following JSON configuration when S3 is set up with basic authentication using an access key and secret key.

{
  "inputDirURI": "s3://my-bucket/my-data-directory/",
  "input.fs.prop.region": "us-east-1",
  "input.fs.prop.accessKey": "AKIAEXAMPLEACCESSKEY",
  "input.fs.prop.secretKey": "SECRETKEYEXAMPLE12345"
}

Property Descriptions

Property	Required	Description
`inputDirURI`	Yes	URI of the input directory/files for ingestion. It tells Pinot where the data resides.
`input.fs.prop.region`	Yes	Region of the file system (e.g., `us-east-1`).
`input.fs.prop.accessKey`	Yes	Access key for authentication to the file system.
`input.fs.prop.secretKey`	Yes	Secret key for authentication (paired with access key).

Connecting to S3 Using IAM-Based Authentication

Please follow these steps to create IAM role for S3 setup.

Use the following JSON configuration when S3 is set up with IAM-based authentication by assuming an IAM Role for secure access.

{
  "inputDirURI": "s3://my-bucket/my-data-directory/",
  "aws.iamRoleArnToAssume": "arn:aws:iam::123456789012:role/PinotS3AccessRole",
  "input.fs.prop.region": "us-west-2",
  "input.fs.prop.externalId": "env-12abc234xyz",
  "input.fs.className": "org.apache.pinot.plugin.filesystem.S3PinotFS"
}

Property Descriptions

Property	Required	Description
`inputDirURI`	Yes	URI of the input directory/files for ingestion. It tells Pinot where the data resides.
`input.fs.prop.region`	Yes	Region of the file system (e.g., `us-east-1`).
`input.fs.prop.externalId`	Yes	External ID of the AWS Account of your StarTree Cloud.
`input.fs.prop.roleArn`	Yes	The Amazon Resource Name (ARN) of an AWS IAM role to assume for accessing AWS S3. Allows Pinot to securely access resources in different AWS accounts. Example Use Case: If Pinot is running in Account A but the S3 bucket is in Account B, you can assume a role in Account B that grants access to the bucket. The role must have permissions like “s3:List*” and “s3:GetObject” for proper access.
`input.fs.className`	Yes	The file system implementation class to access the input directory. Examples: - `org.apache.pinot.plugin.filesystem.S3PinotFS` (for Amazon S3) - `org.apache.pinot.plugin.filesystem.LocalPinotFS` (for local file systems) - `org.apache.pinot.plugin.filesystem.HadoopPinotFS` (for HDFS)

Step 5: Test the Connection and Configure Data Ingestion.

After you have configured the connection properties, test the connection to ensure it is working.

When the connection is successful, use the following JSON to configure additional data settings:

{
  "inputFormat": "",
  "includeFileNamePattern": "",
  "excludeFileNamePattern": ""
}

Property Descriptions

Property	Required	Description
`inputFormat`	Yes	The format of the input files. Supported values include csv, json, avro, parquet, etc.
`includeFileNamePattern`	Yes	The glob pattern to filter which files to include for ingestion. Used when the input directory contains mixed files and only specific files should be ingested.
`excludeFileNamePattern`	No	The glob pattern to filter which files to exclude from ingestion. Used when the input directory contains mixed files and only specific files should not be ingested.

Configure Record Reader

Configure the record reader to customize how the file format is read during ingestion.

CSV

The CSVRecordReaderConfig is used for handling CSV files with the following customizable options:

header: Provide a header when the input file has no headers.
skipHeader: Provide an alternate header when the input file has a corrupt header.
delimiter: Use an alternate delimiter when fields are not separated by the default delimiter comma.
skipUnParseableLines: Skip records that are not parseable instead of failing ingestion.

Example: Provide a header when the input file has no headers.

{
  "header": "colA,colB,colC"
}

Example: Provide an alternate header when the input file has a corrupt header.

{
 "header": "colA,colB,colC",
 "skipHeader": "true"
}

Example: Use an alternate delimiter when fields are not separated by the default delimiter comma.

{
 "delimiter": ";"
}

{
  "delimiter": "\\t"
}

Example: Skipping records that are not parseable instead of failing ingestion. This option to be used with caution as it can lead to data loss.

{
  skipUnParseableLines: "true"
}

Example: Handling CSV files with no header, tab-separated fields, empty lines, and unparsable records.

{
  "header": "colA,colB,colC",
  "delimiter": "\\t",
  "ignoreEmptyLines": "true",
  "skipUnParseableLines": "true"
}

For a comprehensive list of available CSV record reader configurations, see the Pinot CSV documentation.

AVRO

One configuration option AvroRecordReaderConfig is supported.

enableLogicalTypes: Enable logical type conversions for specific Avro logical types, such as DECIMAL, UUID, DATE, TIME_MILLIS, TIME_MICROS, TIMESTAMP_MILLIS, and TIMESTAMP_MICROS.

{
  "enableLogicalTypes": "true"
}

For example, if the schema type is INT, logical type is DATE, the conversion applied is a TimeConversion, and the value is V; then a date is generated V days from epoch start.

Parquet

For Parquet files, Data Manager provides the ParquetRecordReaderConfig with customizable configurations in Data Manager.

Use Parquet Avro Record Reader:

{
    "useParquetAvroRecordReader" : "true"
}

When this config is used the parquet record reader used is: org.apache.pinot.plugin.inputformat.parquet.ParquetAvroRecordReader

Use Parquet Native Record Reader:

{
    "useParquetNativeRecordReader" : "true"
}

When this config is used the parquet record reader used is: org.apache.pinot.plugin.inputformat.parquet.ParquetNativeRecordReader

Step 6: Preview the Data

Click Show Sample Data to preview the source data before finalizing the configuration.

Next Step

Proceed with Data Modeling.

On this page

Step 1: In the Data Portal, click Tables and then click Create Table.
Step 2: Select S3 as the Data Source.
Step 3: Create a New Connection.
Step 4: Configure Connection Parameters.
Connecting to S3 Using Basic Authentication
Property Descriptions
Connecting to S3 Using IAM-Based Authentication
Property Descriptions
Step 5: Test the Connection and Configure Data Ingestion.
Property Descriptions
Configure Record Reader
Step 6: Preview the Data

Step 1: In the Data Portal, click Tables and then click Create Table.

Step 2: Select S3 as the Data Source.

Step 3: Create a New Connection.

Click New Connection. If you want to use an existing connection, select the connection from the list and proceed to Step 5.

Enter a Source Name for the new connection.

Select the Authentication Type from the drop-down list.

Step 4: Configure Connection Parameters.

Connecting to S3 Using Basic Authentication

Use the following JSON configuration when S3 is set up with basic authentication using an access key and secret key.

{
  "inputDirURI": "s3://my-bucket/my-data-directory/",
  "input.fs.prop.region": "us-east-1",
  "input.fs.prop.accessKey": "AKIAEXAMPLEACCESSKEY",
  "input.fs.prop.secretKey": "SECRETKEYEXAMPLE12345"
}

Property Descriptions

Property	Required	Description
`inputDirURI`	Yes	URI of the input directory/files for ingestion. It tells Pinot where the data resides.
`input.fs.prop.region`	Yes	Region of the file system (e.g., `us-east-1`).
`input.fs.prop.accessKey`	Yes	Access key for authentication to the file system.
`input.fs.prop.secretKey`	Yes	Secret key for authentication (paired with access key).

Connecting to S3 Using IAM-Based Authentication

Please follow these steps to create IAM role for S3 setup.

Use the following JSON configuration when S3 is set up with IAM-based authentication by assuming an IAM Role for secure access.

{
  "inputDirURI": "s3://my-bucket/my-data-directory/",
  "aws.iamRoleArnToAssume": "arn:aws:iam::123456789012:role/PinotS3AccessRole",
  "input.fs.prop.region": "us-west-2",
  "input.fs.prop.externalId": "env-12abc234xyz",
  "input.fs.className": "org.apache.pinot.plugin.filesystem.S3PinotFS"
}

Property Descriptions

Property	Required	Description
`inputDirURI`	Yes	URI of the input directory/files for ingestion. It tells Pinot where the data resides.
`input.fs.prop.region`	Yes	Region of the file system (e.g., `us-east-1`).
`input.fs.prop.externalId`	Yes	External ID of the AWS Account of your StarTree Cloud.
`input.fs.prop.roleArn`	Yes	The Amazon Resource Name (ARN) of an AWS IAM role to assume for accessing AWS S3. Allows Pinot to securely access resources in different AWS accounts. Example Use Case: If Pinot is running in Account A but the S3 bucket is in Account B, you can assume a role in Account B that grants access to the bucket. The role must have permissions like “s3:List*” and “s3:GetObject” for proper access.
`input.fs.className`	Yes	The file system implementation class to access the input directory. Examples: - `org.apache.pinot.plugin.filesystem.S3PinotFS` (for Amazon S3) - `org.apache.pinot.plugin.filesystem.LocalPinotFS` (for local file systems) - `org.apache.pinot.plugin.filesystem.HadoopPinotFS` (for HDFS)

Step 5: Test the Connection and Configure Data Ingestion.

After you have configured the connection properties, test the connection to ensure it is working.

When the connection is successful, use the following JSON to configure additional data settings:

{
  "inputFormat": "",
  "includeFileNamePattern": "",
  "excludeFileNamePattern": ""
}

Property Descriptions

Property	Required	Description
`inputFormat`	Yes	The format of the input files. Supported values include csv, json, avro, parquet, etc.
`includeFileNamePattern`	Yes	The glob pattern to filter which files to include for ingestion. Used when the input directory contains mixed files and only specific files should be ingested.
`excludeFileNamePattern`	No	The glob pattern to filter which files to exclude from ingestion. Used when the input directory contains mixed files and only specific files should not be ingested.

Configure Record Reader

Configure the record reader to customize how the file format is read during ingestion.

CSV

The CSVRecordReaderConfig is used for handling CSV files with the following customizable options:

header: Provide a header when the input file has no headers.
skipHeader: Provide an alternate header when the input file has a corrupt header.
delimiter: Use an alternate delimiter when fields are not separated by the default delimiter comma.
skipUnParseableLines: Skip records that are not parseable instead of failing ingestion.

Example: Provide a header when the input file has no headers.

{
  "header": "colA,colB,colC"
}

Example: Provide an alternate header when the input file has a corrupt header.

{
 "header": "colA,colB,colC",
 "skipHeader": "true"
}

Example: Use an alternate delimiter when fields are not separated by the default delimiter comma.

{
 "delimiter": ";"
}

{
  "delimiter": "\\t"
}

Example: Skipping records that are not parseable instead of failing ingestion. This option to be used with caution as it can lead to data loss.

{
  skipUnParseableLines: "true"
}

Example: Handling CSV files with no header, tab-separated fields, empty lines, and unparsable records.

{
  "header": "colA,colB,colC",
  "delimiter": "\\t",
  "ignoreEmptyLines": "true",
  "skipUnParseableLines": "true"
}

For a comprehensive list of available CSV record reader configurations, see the Pinot CSV documentation.

AVRO

One configuration option AvroRecordReaderConfig is supported.

enableLogicalTypes: Enable logical type conversions for specific Avro logical types, such as DECIMAL, UUID, DATE, TIME_MILLIS, TIME_MICROS, TIMESTAMP_MILLIS, and TIMESTAMP_MICROS.

{
  "enableLogicalTypes": "true"
}

For example, if the schema type is INT, logical type is DATE, the conversion applied is a TimeConversion, and the value is V; then a date is generated V days from epoch start.

Parquet

For Parquet files, Data Manager provides the ParquetRecordReaderConfig with customizable configurations in Data Manager.

Use Parquet Avro Record Reader:

{
    "useParquetAvroRecordReader" : "true"
}

When this config is used the parquet record reader used is: org.apache.pinot.plugin.inputformat.parquet.ParquetAvroRecordReader

Use Parquet Native Record Reader:

{
    "useParquetNativeRecordReader" : "true"
}

When this config is used the parquet record reader used is: org.apache.pinot.plugin.inputformat.parquet.ParquetNativeRecordReader

Step 6: Preview the Data

Click Show Sample Data to preview the source data before finalizing the configuration.

Next Step

Proceed with Data Modeling.

On this page

Step 1: In the Data Portal, click Tables and then click Create Table.
Step 2: Select S3 as the Data Source.
Step 3: Create a New Connection.
Step 4: Configure Connection Parameters.
Connecting to S3 Using Basic Authentication
Property Descriptions
Connecting to S3 Using IAM-Based Authentication
Property Descriptions
Step 5: Test the Connection and Configure Data Ingestion.
Property Descriptions
Configure Record Reader
Step 6: Preview the Data

​Step 1: In the Data Portal, click Tables and then click Create Table.

​Step 2: Select S3 as the Data Source.

​Step 3: Create a New Connection.

​Step 4: Configure Connection Parameters.

​Connecting to S3 Using Basic Authentication

​Property Descriptions

​Connecting to S3 Using IAM-Based Authentication

​Property Descriptions

​Step 5: Test the Connection and Configure Data Ingestion.

​Property Descriptions

​Configure Record Reader

​Step 6: Preview the Data

Next Step

API Documentation

Broker APIs

Table

BatchRestart

ClusterHealth

Connection

ConsistentPush

DedupSnapshot

PerfAdvisor

RateLimiter

AtomicIngestion

Restream

Tuner

AlterTable

UpsertSnapshot

Cluster

User

Application

Broker

AppConfigs

Auth

Health

Logger

PeriodicTask

Database

Instance

Leader

Query

Schema

Segment

Task

Tenant

Upsert

Version

Zookeeper

​Step 1: In the Data Portal, click Tables and then click Create Table.

​Step 2: Select S3 as the Data Source.

​Step 3: Create a New Connection.

​Step 4: Configure Connection Parameters.

​Connecting to S3 Using Basic Authentication

​Property Descriptions

​Connecting to S3 Using IAM-Based Authentication

​Property Descriptions

​Step 5: Test the Connection and Configure Data Ingestion.

​Property Descriptions

​Configure Record Reader

​Step 6: Preview the Data

Next Step

Step 1: In the Data Portal, click Tables and then click Create Table.

Step 2: Select S3 as the Data Source.

Step 3: Create a New Connection.

Step 4: Configure Connection Parameters.

Connecting to S3 Using Basic Authentication

Property Descriptions

Connecting to S3 Using IAM-Based Authentication

Property Descriptions

Step 5: Test the Connection and Configure Data Ingestion.

Property Descriptions

Configure Record Reader

Step 6: Preview the Data

Step 1: In the Data Portal, click Tables and then click Create Table.

Step 2: Select S3 as the Data Source.

Step 3: Create a New Connection.

Step 4: Configure Connection Parameters.

Connecting to S3 Using Basic Authentication

Property Descriptions

Connecting to S3 Using IAM-Based Authentication

Property Descriptions

Step 5: Test the Connection and Configure Data Ingestion.

Property Descriptions

Configure Record Reader

Step 6: Preview the Data