Document Processors

Specialized parsing tools for all types of documents.

The agent SDK has built-in processes for many of your standard types of documents including CSV, Excel, Powerpoint, Word and plain text. Sometimes though, you may have need to get more specific in how you're processing your data before you store it in the Vector Store - no one wants bad data!

To help facilitate this - you can create a new Spring Bean that implements the DocumentProcessor interface show below:

DocumentProcess.java
public interface DocumentProcessor {
    public List<Document> processDocument(Path file, String tenant, String userId) throws Exception;
    public boolean canProcess(Path file, boolean othersHaveProcessed);
}

These two methods are fairly self-explanatory. The processDocument method allows you to take an input and transform it into as many RAG Documents objects as you would like. The example below shows a very simple processor that just splits a text file by a Page delimiter and adds some special metadata.

Additionally, it handles the canProcessor method where it says, "I should only process documents that are contain the name -review-export- ".

@Component
@Order(0) // Run first
public class PerformanceReviewProcessor implements DocumentProcessor {
    @Override
    public List<Document> processDocument(Path file, String tenant, String userId) {
        List<Document> documents = new ArrayList<>();
        try (BufferedReader reader = Files.newBufferedReader(file)) {
            // Load the entire file so we can then work on splitting it
            stillLoading = true;
            StringBuilder wholeFile = new StringBuilder();
            String line;
            while ((line = reader.readLine()) != null) {
                wholeFile.append(line).append("\n");
            }

            String[] pages = wholeFile.toString().split("Page \\d+:\n" +
                    "===");
            int pageNumber = 0;
            for (String page : pages) {
                Map<String, Object> metadata = new HashMap<>();
                metadata.put("tenant", tenant);
                metadata.put("user", userId);
                metadata.put("absolute_directory_path", file.getParent().toAbsolutePath().toString());
                metadata.put("file_name", file.getFileName().toString());
                metadata.put("page", ++pageNumber);  
                
                documents.add(Document.from(page, Metadata.from(metadata)));
            }
        } catch (Exception e) {
            log.error("Error reading file: " + e.getMessage());
        }
        
        return documents;
    }
    
    @Override
    public boolean canProcess(Path file, boolean othersHaveProcessed) {
        // Be sure to convert toString first or it uses Path.endsWith which is weird
        return file.toString().contains("-review-export-");
    }
}

While the actual implementation of this processor above is a throw away, the key concepts to pay attention to here are adding Metadata to your Document and ensuring that you are cleaning the Document content. Having the appropriate Metadata is critical to fine-tuning your Content Retrievers to get just the content you want in your search. Let's see how.

Last updated

Was this helpful?