What does Apache PDFBox offer:
- Apache PDFBox is an open-source Java library that allows to work with PDF documents programmatically.
- It provides a wide range of features for creating, manipulating, and extracting data from PDF files.
- PDFBox supports various operations, including text extraction, image extraction, metadata extraction, and more.
Setting up Apache PDFBox:
To get started, you need to set up Apache PDFBox in your Java project.
- You can download the latest version of PDFBox from the Apache PDFBox website (pdfbox.apache.org) and include the necessary JAR files in your project’s classpath.
- Use maven to include the necessary dependencies in your project’s build configuration. Here is the Maven dependency for PDFBox:
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.28</version>
</dependency>
Code for Extracting Text from PDF:
PDDocument document = PDDocument.load(new File(“input.pdf”));
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
document.close();
Conclusion:
You can similarly try out PDFBox for image extraction, metadata extraction from the PDF.