These tutorials will provide a few technical tutorials that discuss some of the underlying technologies used by Omni Customer Support and other related areas. These tutorials are meant to give back some knowledge to the vast community of software developers who (with Google’s help) have enabled each other to learn a great many things and save many frustrating hours of debugging. These are not production ready examples and may demonstrate poor software practices in an effort to focus in the core message of the tutorial.
This is a quick tutorial to show you how you can leverage two open source components from Apache to create a powerful index of your excel documents. This tutorial assumes you have a basic understanding of Lucene. According to the Lucene wiki in order to index an Excel document, you must first extract the text that you want to index. Apache POI provides a Microsoft Excel Parser to handle this. So out of the box it is pretty easy to index your excel document.
What the parser will do is extract all the text from every field in the Excel document and put it all in one large string allowing you insert it for indexing. In a lot of cases, this may be all you need. But if you are dealing with a known style of document, say inventory list, invoice, purchase orders, or any structured document where you know the format of the excel document, you can index the file in a much smarter way.
I’m going to use a simple invoice, you can see an example here. I downloaded this particular template from Microsoft, so there is a good chance someone is really using it for their invoicing needs. Now I want to be able to search all of my excel documents quickly but be able to search specific fields. To do this I am going to extract specific cells into its own field in the Lucene document.
The Apache POI Project provides the ability to create, read, modify, and write XLS spreadsheets. In this case all we are interested in is the ability to read spreadsheets. It is really quite simple to do.
First we open the xls file using the POFFileSystem and create the workbook. A workbook represents the xls file and contains 1 or more worksheets. If you don’t know a work sheet is represented as a tab at the bottom of Excel. For simplicity, we are assuming we are only dealing with the first work sheet.
POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream(file));
HSSFWorkbook wb = new HSSFWorkbook(fs);
//Hard coding the first worksheet
HSSFSheet sheet = wb.getSheetAt(0);
Since we know the format of the document, we can pull out the values in each cell that we want to index. To do that, first get the HSSFRow from the sheet. The rows are indexed started at 0, so if you are looking at the excel row number, remember to subtract one. This is an easy mistake to fall into. Once we have the correct row, get the desired cell, again it is 0 based counting. Now that we have the cell, we add a Field to our Lucene Document. If it is a string value, using the getStringCellValue() method, if it is a numeric value, use the getNumericCellValue() method.
//We know the format of the document, so start pulling out cell
//data and adding to our lucene document
//Grab the customer name
HSSFRow row = sheet.getRow(9);
HSSFCell customer = row.getCell((short)0);
doc.add(new Field("customer", customer.getStringCellValue(),
Field.Store.NO, Field.Index.TOKENIZED));
//grab the company
row = sheet.getRow(10);
HSSFCell company = row.getCell((short)0);
doc.add(new Field("company", company.getStringCellValue(),
Field.Store.NO, Field.Index.TOKENIZED));
One thing to note is that even if cells are merged, all the cells are still in the row. You can see this when we are pulling out the amount field from the invoice. The A and B cells are merged, but to pull out the amount cell, the index is still 2 and not 1.
//Loop through the items in this invoice. This has an arbitrary
//limit of 16 items, based on the template.
//Logic should go here to test to see if there are more items or not.
DecimalFormat dollarFormat = new DecimalFormat("#.00");
for (int i =17; i < 34; i++) {
row = sheet.getRow(i);
HSSFCell description = row.getCell((short)0);
HSSFCell amount = row.getCell((short)2);
if (description.getStringCellValue().length() > 0) {
//We are overloading a field by adding multiple instances of that field.
doc.add(new Field("description", description.getStringCellValue(),
Field.Store.NO, Field.Index.TOKENIZED));
doc.add(new Field("amount", dollarFormat.format(amount.getNumericCellValue()),
Field.Store.NO, Field.Index.TOKENIZED));
}
}
Now you have a customized Lucene Document that you can add to your index and be able to search it for specific fields from your invoice. There are still a few loose ends to tie up, but I’ll leave that as an exercise for the readers.
You can download all the source code and sample files under resources section below. When you run the tutorial, it will parse the three sample files and create an index. Then it will prompt you on the console for your query. You can query individual fields by using the Lucene Query Syntax, "customer:mary" for example. It comes with a maven POM file to build and run the tutorial. You must have Maven2 installed. To run the tutorial, unzip the file and run "mvn exec:java"