How to Read PDF File in Java

By James Cooper

It is not difficult to read PDF files in Java using libraries that are readily available. Reading PDF files allows you to write Java programs that can process the text in those files. One option for reading PDF files is the free, open-source PDFBox library available from Apache. The Eclipse Java development platform makes this job easier and manages the libraries you will be downloading. You need to be familiar with Java programming to make use of these Java libraries.

Gather Needed Libraries

Step 1

Download the Java JDK from Sun's website. This is an executable file which you can install on your system by running it. Versions are available for Windows, Mac and Linux. Click on the red \"Download\" button. Save a file called \"jdk-6uxx-windows-xxx.exe\" when prompted. Save this file and then double-click on it to launch the Java installer.

Step 2

Download the Eclipse development system and unzip it into a top-level directory. Select \"Eclipse IDE for Java Developers.\" This will start the download of \"eclipse-java-galileo-SR2-win32.zip.\" Double-click on the file to unzip it after the download is complete. Select the \"C:\" root directory location to unzip Eclipse.

Step 3

Start Eclipse by double-clicking on \"eclipse.exe\" in the directory you just created by unzipping the eclipse zip file. In the Eclipse system, create a project named \"PrintPdf.\" Select \"File\" then \"New\" then \"Java project.\" Type in the project name \"PrintPdf\" in the dialog box that appears. Be sure that the radio button is selected that says \"Create separate folders for source and class files.\" Click \"Finish.\"

Step 4

Create a \"lib\" folder in your \"PrintPdf\" project. Right-click on the \"PrintPdf\" project and select \"New\" and then \"Folder.\" Enter the name \"lib\" and click on \"Finish.\"

Step 5

Download Apache \"PDFBox.jar\" from the Apache site and copy it in the the lib directory you just created. On the same web page, download the \"fontbox-nn.jar\" file and the \"jempbox-nn.jar\" file. In each case, when you click on that jar file, it will take you to a page where you can select one of several servers that can provide this file. Pick each of them and each jar file will download. Copy each jar file into the lib directory you just created.

Step 6

Download the Apache log4j.jar package in the same fashion and copy the log4j.jar file into the directory. The Apache PDFBox library uses this Apache logging library, so this file needs to be present.

Step 7

Download the Apache Commons Discovery package as a zip file. Double-click on the zip file, select the \"commons-discovery-nn.jar\" and extract it into the lib directory.

Step 8

In Eclipse, click on the \"lib\" directory and press \"F5.\" Make sure that all the jar files you added are displayed.

Step 9

Right-click on the PrintPDF project and select \"Properties.\" Select \"Java Build Path\" and select the \"Libraries\" tab. Click on \"Add jars\" and go to the lib directory you have just created, and add \"commons-logging-nn.jar\" \"fontbox-nn.jar,\" \"jempbox-nn.jar,\" \"log4j-nn.jar,\" and \"pdfbox-nn.jar.\" Click on \"OK.\"

Write the Code to Read PDFs

Step 1

Right-click on the \"src\" folder of your \"PrintPDF\" project and select \"New\" and the \"Package.\" Create a package using any meaningful name. For example, name the package \"com.pdf.util.\" Click \"Finish.\"

Step 2

Right-click on the package name you just created and select \"New\" and then \"Class.\" Create a class named \"PDFTextParser.\" Be sure to click the check box marked \"public static void main...\" so that the system will create a \"main\" method.

Step 3

Edit the \"main\" method in the \"PDFTextParser\" class to contain the following code:

public static void main(String args[]){
PDFTextParser pdf = new PDFTextParser(\"data/javaPDF.pdf\")
GO
//print out results
System.out.println(pdf.getParsedText())
GO
}

Note that the file you wish to print out is spelled out in the constructor to PDFTextParser (\"data/JavaPDF.pdf\"). It could just as easily be a command line argument:

PDFTextParser pdf = new PDFTextParser(argv[0])
GO

or selected from a GUI interface.

It creates an instance of the PDFTextParser class, and then calls its \"getParsedText\" method.

Step 4

Insert the following code just below the top class line \"public class PDFTextParser\" that was created for you.

private PDFParser parser = null
GO

// Extract text from PDF Document
public PDFTextParser(String fileName) {
File file = new File(fileName)
GO
if (!file.isFile()) {
System.err.println(\"File \" + fileName + \" does not exist.\")
GO
}
//Set up instance of PDF parser
try {
parser = new PDFParser(new FileInputStream(file))
GO
} catch (IOException e) {
System.err.println(\"Unable to open PDF Parser. \" + e.getMessage())
GO
}
}
//-------------------------------
public String getParsedText() {
PDDocument pdDoc = null
GO
COSDocument cosDoc = null;
String parsedText = null;
GO

try {
PDFTextStripper pdfStripper = new PDFTextStripper()
GO
parser.parse()
GO
cosDoc = parser.getDocument()
GO
pdDoc = new PDDocument(cosDoc)
GO

//get list of all pages
Listlist = pdDoc.getDocumentCatalog().getAllPages()
GO

//note that you can print out any pages you want
//by choosing different values of the start and end page
pdfStripper.setStartPage(1); //1-based
int length = list.size(); //total number of pages
pdfStripper.setEndPage(length); //last page

//get the text for the pages selected
parsedText = pdfStripper.getText(pdDoc)
GO
} catch (IOException e) {
System.err
.println(\"An exception occured in parsing the PDF Document.\"
+ e.getMessage())
GO
} finally {
try {
if (cosDoc != null)
cosDoc.close()
GO
if (pdDoc != null)
pdDoc.close()
GO
} catch (IOException e) {
e.printStackTrace()
GO
}
}
return parsedText
GO
}

Step 5

Run the program. Right-click on the PDFTextParser class and click on \"Run As\" and then on \"Java program.\" The program should run and print out the text contents of the PDF file you entered in your code.

Suppress Log4j Startup Error Message

Step 1

Create a configuration file to suppress the Java logging system log4j error message created when it cannot find a configuration file when it starts up. Right click on the \"src\" folder of the PrintPDF project and select \"New\" and then \"File.\" Name the file \"log4j.properties\" Eclipse will display an empty screen for this new file.

Step 2

Paste the following lines into the empty screen representing the \"log4j.properties\" file.

# Set root logger level to DEBUG and its only appender to A1.
log4j.rootLogger=WARN, A1

# A1 is set to be a ConsoleAppender.
log4j.appender.A1=org.apache.log4j.ConsoleAppender

# A1 uses PatternLayout.
log4j.appender.A1.layout=org.apache.log4j.PatternLayout
log4j.appender.A1.layout.ConversionPattern=%-4r [%t] %-5p %c %x - %m%n

Step 3

Save the \"log4j.properties\" file. The presence of this file in the top-level \"src\" directory will suppress the log4j startup message and any trivial logging messages. The log4j system will print out only actual errors.

Tips & Warnings

  • There are also a number of commercial packages that you can use to extract text from PDF files, but they are not inexpensive.