Batch Onboarding Technology Options
Batch: Google Cloud Storage (GCS)
The Banyan data lake is built atop of Google Cloud Storage (GCS). Each merchant brought onto the network will be provisioned their own GCS bucket that only they will be allowed to access. The data within the bucket will be encrypted at rest.
Each bucket will contain 3 folders: Input, Historical, Error. Merchants will write data into the Input folder. Once data is loaded into the folder, an automated ETL process is kicked off by Banyan and the data is moved into the historical folder. This data is not transformed in any way, it is an indicator that it has been processed. If data anomalies are found, those specific records will be written to files and placed in the Error folder for examination. Merchants will have read/write access to the Input folder, read only access to the Error folder and no access to the Historical folder.
Folders and File Naming Formats
Input
-
Files should be placed within the designated input folder on a regular cadence; our preference is at least daily
-
If the data is not sent as one file, please create a folder within the Input folder for each file type. Examples:
- Transactions
- Items
- Payments
- Stores
- Product Catalog
-
Your files may not match the above example, but this is close to how our internal data model stores receipts - a transaction may have multiple items, and it may have multiple payments
-
If the data is sent in more than one file, the files must contain common columns that make it clear how we will join this data, such as primary keys and foreign keys
-
Do not exceed an uncompressed size of 5GB per file
-
Create a subdirectory called "manifest" that will contain a daily manifest file that gets written at the end of your upload process. This file should contain the names of all files that were attempted to be uploaded on that day.
-
Filenames should at least have the date the type of data they contain in the filename
- example: 2022-03-04-transactions-0001.csv
Error
Files will be created in this folder when caught by our row level data checks. This data should be examined for completeness by the merchant and if there is an issue, revised and resent with a subsequent batch. Examples of data checks are:
- Transactions which have no associated items and/or payments
- Items or Payments that do not belong to a transaction
- Extreme outlier amounts such as -$1,000,000
Updated
Files will be placed in this bucket if there was an issue with records/whole files sent previously that had the wrong date/time or unique transaction id. These types of updates cannot go through our normal pipeline and need to be handled in a more precise manner.
Amazon S3
Introduction
In order to integrate with Banyan's AWS solution, you will be using the AWS CLI sync
command.
What we'll need from you
If your data is located in S3
This option is for situations where you have your files already in S3, and would like to move them from your bucket to ours.
- Company Name
- AWS Account ID
- The bucket name of the data that will be copied over
- The KMS Key ARN if your data is encrypted at rest
If the location is other then S3
This option is for situations where your data is on a drive or database and can be copied directly from a server.
- Company Name
- AWS Account ID
Creating a IAM user
Before you can assume the IAM Role created for your company, you have to create an IAM user which will be used by you to assume the set role. Since the role that we make references the IAM username, and the AWS Account ID that you provided us.
Navigate to the IAM console,
- create a new user named
banyan_input_s3
. - Select programmatic access,
- Hit next until you are shown your security credentials. Be sure to record them safely.
- Once the user is created, attach the following
inline policy
.
Important
Make sure to replace
COMPANY_NAME
with your company name.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Resource": "arn:aws:iam::356687812700:role/banyan_input_s3_merchant_COMPANY_NAME"
}
]
}
Preparation before assuming the role
Note
- Install the latest version of the AWS CLI.
- The following work needs to be done under a
*UNIX system
.
First, we need to make sure you have the right environment setup.
If you've used the AWS CLI before
If you have used the AWS CLI in the past, and run the aws configure
command, you will have a folder called .aws
in the home directory of the user you are logged in as in the OS, with all the files needed in place. You can skip the next step.
If you've never used the AWS CLI before
If you have never used the AWS CLI, following the instructions below to create the folder and files.
mkdir ~/.aws
- to create the folder in your home directory.touch ~/.aws/config
- to create an empty file for the AWS CLI configuration.touch ~/.aws/credentials
- to create an empty file where the user credentials will go.
Setup the credentials
With the folder structure and files now in place, add the following content in the ~/.aws/config
file, making sure to replace COMPANY_NAME
with the one you provided:
[profile banyan]
role_arn = arn:aws:iam::356687812700:role/banyan_input_s3_merchant_COMPANY_NAME
source_profile = banyan_credentials
In the ~/.aws/credentials
file, add the following content inside, making sure to replace DATA
with the correct values you've saved when making the IAM user:
[banyan_credentials]
aws_access_key_id=DATA
aws_secret_access_key=DATA
Give us access to your resources
At this point all the policies are set for your role on our side, but since we are dealing with a cross account access, we can't just set in our policy that you can copy data from your bucket to ours, if this was the case anyone could access resources in different accounts. For this reason you have to give our account explicit access to your data for our Role that you are going to assume for it to work.
Bucket policy
- Go to the bucket that you provided the name to
- In the Permission tab scroll down until the
Bucket policy
section. - Click Edit, and add the following policy
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::356687812700:root"
},
"Action": [
"s3:ListBucket",
"s3:GetObject",
"s3:GetObjectTagging"
],
"Resource": [
"arn:aws:s3:::YOUR-BUCKET-NAME/*",
"arn:aws:s3:::YOUR-BUCKET-NAME"
]
}
]
}
KMS Key Policy
If you also provided us a KMS ARN Key, in this case you also have to update the Key Policy to allow our account to use the key to decrypt your data in the bucket by adding the following Policy Document in the already existing Key Policy
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::356687812700:root"
},
"Action": "kms:Decrypt",
"Resource": "THE-FULL-ARN-OF-THE-KEY"
}
Transferring data to our bucket
Important
Make sure to replace
COMPANY_NAME
with your company name.
Now that the data is in the right place, you can run the following command in the terminal which will copy the evaluation data in the folder where you'll run the command using the profile that you just made above. The CLI will take care of the IAM Role assumption.
Bucket to Bucket
In the terminal, run the following command to copy data from your bucket to ours.
aws s3 sync s3://YOUR-BUCKET-NAME s3://by-production-us-east-1-input-s3-COMPANY_NAME --delete --profile banyan
Drive to Bucket
In the terminal, run the following command to copy data from from the local drive to our bucket.
aws s3 sync . s3://by-production-us-east-1-input-s3-COMPANY_NAME --delete --profile banyan
Additionally you can copy, list and delete
Snowflake
Introduction
When sending data to Banyan via Snowflake, you will create a view that is shared with the Banyan Snowflake account hosted on AWS.
Credentials
Schedule
File Format
SFTP
Introduction
Our SFTP solution allows you to send us data to ingest. When you send us the files, we process them in real time and subsequently delete the files from the server.
Connection requirements
You can use any SFTP client (for example, FileZilla) that supports a SSH Private Key for authorization.
Credentials
Once the contract is signed, you will generate an SSH public/private key pair, and share the public key with us. Banyan will then provide you with the hostname and user name for your SFTP server, which you will access with your private SSH key.
In general:
- Server address:
YOUR_COMPANY_NAME.sftp.getbanyan.com
- User:
sftpuser
- Authentication: SSH Private Key
Where to upload
Once you are logged in to the server, please use the input
folder under the data
folder to upload the files. Within this folder, create subdirectories for each "type" of file you are uploading. If you have flattened your data to all be included into one file you do not have to do this step. Examples of file types:
- Transactions
- Items
- Product Catalog
- Tenders
- Stores
Within the /data directory, also include a subdirectory called "manifest" that will contain a daily manifest file that gets written at the end of your upload process. This file should contain the names of all files that were attempted to be uploaded on that day. This will ensure the Banyan ingestion process can check for any missing files and delay the ingestion.
Schedule
Depending on your infrastructure we would prefer to receive the file at least once per day or as otherwise mentioned in your agreement with Banyan.
File format
- We recommend files sent should be no bigger than 2GB in size.
- Files should have the date of upload in the name, file type, as well as a part number if broken into multiple files within a single day.
- example: 2022-03-04-transactions-0001.csv
Caveats
When you’ll write your custom implementation to upload data to our SFTP server, make sure to take the following scenarios in to account:
- The server can become unavailable for a short period of time. Make sure to have in place a retry mechanism.
- The signature of the server might change due to hardware failure or changes in the hardware configuration. Make sure to take this into account (the URI won’t change).
- Due to a limitation of the cloud storage backend, files cannot be updated in-place: they must first be deleted, and then re-uploaded.
Updated 3 months ago