The Banyan data lake is built atop of Google Cloud Storage (GCS). Each merchant brought onto the network will be provisioned their own GCS bucket that only they will be allowed to access. The data within the bucket will be encrypted at rest.
Each bucket will contain 3 folders: Input, Historical, Error. Merchants will write data into the Input folder. Once data is loaded into the folder, an automated ETL process is kicked off by Banyan and the data is moved into the historical folder. This data is not transformed in any way, it is an indicator that it has been processed. If data anomalies are found, those specific records will be written to files and placed in the Error folder for examination. Merchants will have read/write access to the Input folder, read only access to the Error folder and no access to the Historical folder.
Files should be placed within the designated input folder on a regular cadence; our preference is at least daily
If the data is not sent as one file, please create a folder within the Input folder for each file type. Examples:
- Product Catalog
Your files may not match the above example, but this is close to how our internal data model stores receipts - a transaction may have multiple items, and it may have multiple payments
If the data is sent in more than one file, the files must contain common columns that make it clear how we will join this data, such as primary keys and foreign keys
Do not exceed an uncompressed size of 5GB per file
Create a subdirectory called "manifest" that will contain a daily manifest file that gets written at the end of your upload process. This file should contain the names of all files that were attempted to be uploaded on that day.
Filenames should at least have the date the type of data they contain in the filename
- example: 2022-03-04-transactions-0001.csv
Files will be created in this folder when caught by our row level data checks. This data should be examined for completeness by the merchant and if there is an issue, revised and resent with a subsequent batch. Examples of data checks are:
- Transactions which have no associated items and/or payments
- Items or Payments that do not belong to a transaction
- Extreme outlier amounts such as -$1,000,000
Files will be placed in this bucket if there was an issue with records/whole files sent previously that had the wrong date/time or unique transaction id. These types of updates cannot go through our normal pipeline and need to be handled in a more precise manner.
Updated 8 months ago