Overview & Challenge
Being in an industry where you receive a lot of data in different formats requires using integration to enable seamless flow and transformation. Despite the integration, sometimes you might find the process to be taking too long to transform all the data and you want that to be faster.
One of our clients in the retail industry communicates with many manufacturers by receiving data in several formats (FTP, API, e-mail, etc.). MuleSoft flow transforms the received data in a unified file format. One file can contain several hundred to millions of records or usually about ~ 5 – 20 GB worth of data. In order to speed up the data load to the database in a secured manner, we decided to set up a data pipeline to automate the process and use S3 Buckets for file uploads from MuleSoft flow.
⋮IWConnect’s analysis and development team proposed and built Data Load solution that provides parallel data load from multiple files into Aurora RDS with keeping track of file status. The AWS services that were involved in the solution are:
- S3 bucket structure for receiving files from different vendors;
- Trigger that publishes a message to SNS topic about new file in S3;
- Lambda function that loads the data to Aurora RDS (MySQL engine);
- Lambda function that indexes data in ElasticSearch;
- Process that keeps track of number of processed files using DynamoDB;
- Database and ElasticSearch are placed under Private subnet and available only to internal Apps with special permissions. The data is completely secured.
- Fast data transformation using MuleSoft and fast data load using AWS managed services (S3, Lambda function, and Aurora RDS).
- Unifying data using MuleSoft provides AWS solution to be built once and used for different client’s providers.
- Data load can be done to Aurora RDS as a transactional or operation database, but also to ETL ready databases as Redshift.
- We build up a tracking process that tracks and notifies the client for file processing status and problems.