An unusual ETL problem - Excel, SQL Server, Java and Powershell

Traditional ETL scenarios will typically be like this - you get data from multiple sources. You extract data, cleanse it and transform it and finally load it into a data warehouse for analytics. The reporting systems can plug to these warehouses to build reports that are good to look at and easy to understand etc. What if you get a bunch of finished reports with pivoted data and you are asked to extract that data and build dimensions from it? That is an unusual ETL problem and that's exactly what we ran into for one of our customers recently.

The Problem

We have a customer (who shall not be named) and they work with huge brands who are in beverages space. So this customer gets yearly reports of spend, TRP, impressions etc per brand, per country, per year. Considering the said brand has presence in 70 odd countries, we are looking at lot of excel files consisting data that seemingly has no reason or rhyme.

We were tasked with designing a data mart for this. Our Head of Analytics spent a couple of days understanding the business and with the help of our customer, we were able identify facts and dimensions and built we were ready with the design pretty quick.

Let the data be extracted

And now the challenge. How do we go about extracting this data? There were a few challenges

  • The source excel files were encrypted.
  • The data has no fixed pattern. Each excel report was different and files typically had multiple sheets with different kinds of data.
  • Due to the reason stated above, it was difficult to use a tool like SSIS or Talend to extract data.

Sure we could do an as-is load to multiple tables and sanitize data later - we tried that with an SSIS package. But getting SSIS to ignore rows with images on it, merged columns etc was really cumbersome and we were getting nowhere with it.

Java to the rescue

Since we had a deadline to meet, we decided to start talking to each excel file, each sheet and each row and pickup data that we need. We looked at excel libraries for Python and finally decided to go with Java. We fortunately have rock star java programmers in-house, Ram (Director of Engineering) being one of them. The source files were encrypted, password protected in .xlsx format. The encryption can be removed in .xls format using Java so we first had to convert all files to .xls.

Enter Powershell

Since there were more than a thousand files, automating file conversion was the way to go and what better choice than Powershell to do this. This is what we came up with.

function ConvertTo-XLS {  
<#  
.USAGE   
ConvertTo-XLS -Path 'C:\BlackGate' -ToFolder 'C:\Gotham'  
#>
    [cmdletbinding()]
    param (
        [parameter(mandatory=$true)][string]$Path,
        [parameter(mandatory=$false)][switch]$Visible,
        [parameter(mandatory=$false)][string]$ToFolder
    )
    begin {

        $Excel = New-Object -ComObject excel.application
        if ($Visible -eq $true) {
            $Excel.visible = $true
        } else {
            $Excel.visible = $false
        }
        $filetype = "*xlsx"
    } process {
        if (Test-Path -Path $Path) {
            Get-ChildItem -Path $Path -Include '*.xlsx' -recurse | ForEach-Object {
                if ($ToFolder -ne '') {
                    $FilePath = Join-Path $ToFolder $_.BaseName
                    $FilePath += ".xls"
                } else {
                    $FilePath = ($_.fullname).substring(0, ($_.FullName).lastindexOf("."))
                    $FilePath += ".xls"
                }
                $Excel.DisplayAlerts = $false
                $WorkBook = $Excel.workbooks.Open($_.fullname,[Type]::Missing)
                $WorkBook.SaveAs($FilePath, 56)
                $WorkBook.close()
                $OldFolder = $Path.substring(0, $Path.lastIndexOf("\")) + "\old"
            }
        } else {
            return 'No path provided or access has been denied.'
        }
    } end {
        $Excel.Quit()
        $Excel = $null
        [gc]::collect()
        [gc]::WaitForPendingFinalizers()
    }
}

Each excel file had a first sheet which had selector values like Brands, collectionDates, Markets etc, so we decided to extract data based on selector values to avoid the redundant and dummy data. The XSSFWorkbook class in Java comes handy to process excel sheets.

The sample code snippets below read the brands from Sheet1 and based on that extract the values from Sheet2

To read brands from Sheet1:

String sheetName = "Media Investment Data";  
            String sheetNameRef = "Market | Brands";
            String searchStr = "Brand";

            HSSFSheet sheet1 = workbook1.createSheet(sheetName);
            CellStyle cellStyle1 = workbook1.createCellStyle();

            ArrayList<String> brands = new ArrayList<String>();

            org.apache.poi.ss.usermodel.Sheet sheet0 = workbook.getSheet(sheetNameRef);

            Iterator rowIterator0 = sheet0.iterator();
            while (rowIterator0.hasNext()) {

                Row row0 = (Row) rowIterator0.next();
                Iterator cellIterator0 = row0.cellIterator();
                while (cellIterator0.hasNext()) {
                    Cell cell0 = (Cell) cellIterator0.next();
                    if (Cell.CELL_TYPE_STRING == cell0.getCellType()) {
                        if (cell0.getStringCellValue().contains(searchStr)) {
                            cell0 = (Cell) cellIterator0.next();
                            brands.add(cell0.getStringCellValue());
                        }
                    }

                }
            }

Extract data from Sheet2 and update it in a new file without formatting:

search: while (rowIterator.hasNext()) {

                Row row = (Row) rowIterator.next();
                HSSFRow row1 = sheet1.createRow((short) rowCount);
                Iterator cellIterator = row.cellIterator();
                int cellCount = 0;
                boolean hasData = false;
                while (cellIterator.hasNext()) {
                    Cell cell = (Cell) cellIterator.next();
                    HSSFCell cell1 = row1.createCell((short) cellCount);

                    if (Cell.CELL_TYPE_STRING == cell.getCellType()) {
                        if (cell.getStringCellValue().equals("Jan")) {
                            cell1 = row1.createCell((short) cellCount + 2);
                            cellCount += 2;
                        } else if (cell.getStringCellValue().contains("Enter Currency")) {
                            hasData = false;
                            break;
                        }
                        System.out.println(cell.getStringCellValue());
                        cell1.setCellValue(cell.getStringCellValue());
                        hasData = true;
                    } else if (Cell.CELL_TYPE_NUMERIC == cell.getCellType()) {
                        System.out.println(cell.getNumericCellValue());
                        cell1.setCellValue(cell.getNumericCellValue());
                        hasData = true;
                    } else if (Cell.CELL_TYPE_FORMULA == cell.getCellType()) {
                        System.out.println("^^^^^" + cell.getCellFormula());
                        if (cell.getCellFormula().contains(sheetNameRef)
                                && !cell.getCellFormula().contains("VLOOKUP")) {
                            switch (cell.getCachedFormulaResultType()) {
                            case HSSFCell.CELL_TYPE_STRING:
                                rowHeader = cell.getStringCellValue();
                                if (!brands.contains(rowHeader)) {
                                    break search;
                                }
                                CellStyle cellStyle = cell.getCellStyle();
                                System.out.println(cell.getStringCellValue());
                                // cell1.setCellValue(cell.getStringCellValue());
                                // cellStyle1.cloneStyleFrom(cellStyle);
                                // cell1.setCellStyle(cellStyle1);
                                break;
                            case HSSFCell.CELL_TYPE_NUMERIC:
                                System.out.println(cell.getNumericCellValue());
                                // cell1.setCellValue(cell.getNumericCellValue());
                                break;
                            default:
                                // cell1.setCellValue("");
                                break;
                            }
                        } else if (cell.getCellFormula().contains("SUM")) {
                            cell1.setCellValue(cell.getNumericCellValue());
                        }
                    } else {
                        if (cellCount == 0) {
                            cell1.setCellValue(rowHeader);
                        } else {
                            cell1.setCellValue("");
                        }
                    }
                    cellCount++;
                }
                if (hasData == false)
                    sheet1.removeRow(sheet1.getRow(rowCount));
                else {
                    HSSFCell cellx = row1.createCell((short) cellCount);
                    cellx.setCellValue(country);
                    cellx = row1.createCell((short) cellCount + 1);
                    cellx.setCellValue(year);
                    rowCount++;
                }
            }
            // }
            workbook1.write(fop);
            fop.close();
            fis.close();

The Transpose Problem

After writing a lot of java code to deal with different types of files, we finally had data in a format consumable by traditional ETL tools like Talend and SSIS. We quickly came up with a Talend package to load data from processed files to a stage database.

The next challenge was to make this data in stage database fit into our dimensional model. As you would expect from most of the finished reports, we still had data in a heavily hierarchical manner.

  • You can see that the media investment data is present row-wise in months. The dimension are in rows here, time should be moved to a column to represent the relational table.
  • Data also contains NULL and blanks spaces which need to be ignored.
  • Data is sliced by different dimensions for different metrics.

So we had to transpose the data

SELECT camp  
    ,brand
    ,'TIME1' AS TIME
    ,time1 AS value
FROM transpose_test  
WHERE time1 IS NOT NULL

UNION

SELECT camp  
    ,brand
    ,'TIME2'
    ,time2
FROM transpose_test  
WHERE time2 IS NOT NULL  

Once this is done, we were almost done -

  • The warehouse design was that of a hybrid where there are multiple fact tables + normalized dimensions.
  • Dimension loading wasn’t a challenge considering the relational constraints were in place to avoid duplicates and automatically creating surrogate keys.
  • Fact load required mapping back to dimension values to avoid any Null values for the corresponding dimension keys.

We had loads of fun trying to solve this problem. Hope this helps someone. If you have a similar problem, poke us and we will be happy to help.

Arun Britto

Britto is Director, Analytics - does awesome stuff in enterprise ETL, Bigdata and Machine Learning space. To top this all, he is also a marathon runner in his free time.

comments powered by Disqus