PowerPoint Presentation PRASHIDDHA RAJ JOSHI E-Commerce ETL Processing Using Spark Straps Case Study PRASHIDDHA RAJ JOSHI GBD SCRAP YARD PRASHIDDHA RAJ JOSHI GBD SCRAP PROCESS THE LOAD PHASE Load Data into a staging area Break Down the data into various ENTITIES Straps - we had to write a custom Excel Parser RGW - we had to create a product catalog extract information from the database PRASHIDDHA RAJ JOSHI GBD SCRAP PROCESS E-Commerce Entities - BUCKETS - Product - Attributes - Categories - Images - Manufacturer - Pricing E-Commerce scrap tools - SQL - PYTHON - MANUAL WORKFORCE ETL Phases Parse Data Load Dissemble Transform Assemble Destination Mapping Export PRASHIDDHA RAJ JOSHI EXPECTED OUTPUT PRASHIDDHA RAJ JOSHI STRAPWORKS PLAN Executed for Optimizations. We had 60 sets to process. Choose random set based on MIN-MAX. Divided total SKUs in 4 batches of approx. Run script for each batch in parallel. It took 4 hours in approx. Checked the generated data for accuracy in terms of redundancy. If found data redundancy error start from step 1. TOTAL TIME FOR 60 SETS = 4 * 60 = 240 hours = 10 days ALTERNATIVE SOLUTIONS Use Big Data Tools like ReduceMap, Spark. PRASHIDDHA RAJ JOSHI SPARK Apache Spark⢠is a fast and general engine for large-scale data processing. PRASHIDDHA RAJ JOSHI SPARK STANDALONE DEPLOYMENT PRASHIDDHA RAJ JOSHI WordCount Problem âHadoop MapReduce PRASHIDDHA RAJ JOSHI WordCount Problem â Spark SCALA PRASHIDDHA RAJ JOSHI References http://spark.apache.org/ https://databricks.com/ https://www.linkedin.com/pulse/apache-spark-game-changer-mohan-krishna-mannava