Customize pdfgrep output

I am trying to use pdfgrep to search a keyword among multiple pdf files within a folder. My command is this:

pdfgrep -in keyword *.pdf

This command will output the pdf file names and page numbers on the terminal. I would like each pdf file name to only show once if its contents contain the keyword. That will make it easier to read.

Also, I would like the output to be in a txt file in the current directory instead of showing on the terminal.

Try piping the output to some other utilities. I believe the following should work.

 pdfgrep -in keyword *.pdf | sed  ' s/\.pdf.*/.pdf/' | sort --unique 

This worked for me in a directory that had many pdf files, many of which had several repetitions of the keyword.

Putting the output to a file is as simple as redirecting it using “command > file”

This code worked for me:

pdfgrep -in keyword *.pdf | sed ' s/\.pdf.*/.pdf/' | sort --unique > note.txt

However, it runs very slow. Are there any ways to speed it up?

The amount of data determines the speed. Each line returned from pdfgrep is piped through both sed and sort, so just be patient.

Your definition of “very slow” may also be very different from mine.

1 Like

Ok, if you only want the pdf file name to show once, what about

pdfgrep -i keyword *.pdf

This will of course not give you the page number, but might be faster because pdfgrep can stop scanning once it has found an occurrence.

Edit: I just checked, it seems to be faster, but only a small bit in my test case.

this command returns file names repeatedly.

Yes, because my command is misspelled, sorry about that.

Correct would be

pdfgrep -l keyword *.pdf

Note, -l instead of -i.
This command seems to return every matching pdf only once.

I will go with pdfgrep -il keyword *.pdf to ignore case.