Extracting text from PDF in Java using apache PDFBox

What does Apache PDFBox offer:

Apache PDFBox is an open-source Java library that allows to work with PDF documents programmatically.
It provides a wide range of features for creating, manipulating, and extracting data from PDF files.
PDFBox supports various operations, including text extraction, image extraction, metadata extraction, and more.

Setting up Apache PDFBox:

To get started, you need to set up Apache PDFBox in your Java project.

You can download the latest version of PDFBox from the Apache PDFBox website (pdfbox.apache.org) and include the necessary JAR files in your project’s classpath.
Use maven to include the necessary dependencies in your project’s build configuration. Here is the Maven dependency for PDFBox:

<groupId>org.apache.pdfbox</groupId>

<artifactId>pdfbox</artifactId>

</dependency>

Code for Extracting Text from PDF:

PDDocument document = PDDocument.load(new File(“input.pdf”));

PDFTextStripper stripper = new PDFTextStripper();

String text = stripper.getText(document);

document.close();

Conclusion:

You can similarly try out PDFBox for image extraction, metadata extraction from the PDF.

This Article is TAGGED in ApachePDFBox, JavaPDFExtract, pdfbox. BOOKMARK THE permalink.