How to Read PDF File in Java

It is not difficult to read PDF files in Java using libraries that are readily available. Reading PDF files allows you to write Java programs that can process the text in those files. One option for reading PDF files is the free, open-source PDFBox library available from Apache. The Eclipse Java development platform makes this job easier and manages the libraries you will be downloading. You need to be familiar with Java programming to make use of these Java libraries.

Woman working in an office
credit: Pixland/Pixland/Getty Images

Gather Needed Libraries

Step

Download the Java JDK from Sun's website. This is an executable file which you can install on your system by running it. Versions are available for Windows, Mac and Linux. Click on the red \"Download\" button. Save a file called \"jdk-6uxx-windows-xxx.exe\" when prompted. Save this file and then double-click on it to launch the Java installer.

Step

Download the Eclipse development system and unzip it into a top-level directory. Select \"Eclipse IDE for Java Developers.\" This will start the download of \"eclipse-java-galileo-SR2-win32.zip.\" Double-click on the file to unzip it after the download is complete. Select the \"C:\" root directory location to unzip Eclipse.

Step

Start Eclipse by double-clicking on \"eclipse.exe\" in the directory you just created by unzipping the eclipse zip file. In the Eclipse system, create a project named \"PrintPdf.\" Select \"File\" then \"New\" then \"Java project.\" Type in the project name \"PrintPdf\" in the dialog box that appears. Be sure that the radio button is selected that says \"Create separate folders for source and class files.\" Click \"Finish.\"

Step

Create a \"lib\" folder in your \"PrintPdf\" project. Right-click on the \"PrintPdf\" project and select \"New\" and then \"Folder.\" Enter the name \"lib\" and click on \"Finish.\"

Step

Download Apache \"PDFBox.jar\" from the Apache site and copy it in the the lib directory you just created. On the same web page, download the \"fontbox-nn.jar\" file and the \"jempbox-nn.jar\" file. In each case, when you click on that jar file, it will take you to a page where you can select one of several servers that can provide this file. Pick each of them and each jar file will download. Copy each jar file into the lib directory you just created.

Step

Download the Apache log4j.jar package in the same fashion and copy the log4j.jar file into the directory. The Apache PDFBox library uses this Apache logging library, so this file needs to be present.

Step

Download the Apache Commons Discovery package as a zip file. Double-click on the zip file, select the \"commons-discovery-nn.jar\" and extract it into the lib directory.

Step

In Eclipse, click on the \"lib\" directory and press \"F5.\" Make sure that all the jar files you added are displayed.

Step

Right-click on the PrintPDF project and select \"Properties.\" Select \"Java Build Path\" and select the \"Libraries\" tab. Click on \"Add jars\" and go to the lib directory you have just created, and add \"commons-logging-nn.jar\" \"fontbox-nn.jar,\" \"jempbox-nn.jar,\" \"log4j-nn.jar,\" and \"pdfbox-nn.jar.\" Click on \"OK.\"

Write the Code to Read PDFs

Step

Right-click on the \"src\" folder of your \"PrintPDF\" project and select \"New\" and the \"Package.\" Create a package using any meaningful name. For example, name the package \"com.pdf.util.\" Click \"Finish.\"

Step

Right-click on the package name you just created and select \"New\" and then \"Class.\" Create a class named \"PDFTextParser.\" Be sure to click the check box marked \"public static void main...\" so that the system will create a \"main\" method.

Step

Edit the \"main\" method in the \"PDFTextParser\" class to contain the following code:

Step

public static void main(String args[]){ PDFTextParser pdf = new PDFTextParser(\"data/javaPDF.pdf\") GO //print out results System.out.println(pdf.getParsedText()) GO }

Step

Note that the file you wish to print out is spelled out in the constructor to PDFTextParser (\"data/JavaPDF.pdf\"). It could just as easily be a command line argument:

Step

GO

Step

or selected from a GUI interface.

Step

It creates an instance of the PDFTextParser class, and then calls its \"getParsedText\" method.

Step

Insert the following code just below the top class line \"public class PDFTextParser\" that was created for you.

Step

private PDFParser parser = null GO

Step

GO if (!file.isFile()) { System.err.println(\"File \" + fileName + \" does not exist.\") GO } //Set up instance of PDF parser try { parser = new PDFParser(new FileInputStream(file)) GO } catch (IOException e) { System.err.println(\"Unable to open PDF Parser. \" + e.getMessage()) GO } } //------------------------------- public String getParsedText() { PDDocument pdDoc = null GO COSDocument cosDoc = null;
String parsedText = null; GO

Step

GO parser.parse() GO cosDoc = parser.getDocument() GO pdDoc = new PDDocument(cosDoc) GO

Step

GO

Step

GO } catch (IOException e) { System.err .println(\"An exception occured in parsing the PDF Document.\" + e.getMessage()) GO } finally { try { if (cosDoc != null) cosDoc.close() GO if (pdDoc != null) pdDoc.close() GO } catch (IOException e) { e.printStackTrace() GO } }
return parsedText GO }

Step

Run the program. Right-click on the PDFTextParser class and click on \"Run As\" and then on \"Java program.\" The program should run and print out the text contents of the PDF file you entered in your code.

Suppress Log4j Startup Error Message

Step

Create a configuration file to suppress the Java logging system log4j error message created when it cannot find a configuration file when it starts up. Right click on the \"src\" folder of the PrintPDF project and select \"New\" and then \"File.\" Name the file \"log4j.properties\" Eclipse will display an empty screen for this new file.

Step

Paste the following lines into the empty screen representing the \"log4j.properties\" file.

Step

Set root logger level to DEBUG and its only appender to A1. log4j.rootLogger=WARN, A1

Step

A1 is set to be a ConsoleAppender. log4j.appender.A1=org.apache.log4j.ConsoleAppender

Step

A1 uses PatternLayout. log4j.appender.A1.layout=org.apache.log4j.PatternLayout log4j.appender.A1.layout.ConversionPattern=%-4r [%t] %-5p %c %x - %m%n

Step

Save the \"log4j.properties\" file. The presence of this file in the top-level \"src\" directory will suppress the log4j startup message and any trivial logging messages. The log4j system will print out only actual errors.