Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Welcome to the CollectiveAccess support forum! Here the developers and community answer questions related to use of the software. Please include the following information in every new issue posted here:

  1. Version of the software that is used, along with browser and version

  2. If the issue pertains to Providence, Pawtucket or both

  3. What steps you’ve taken to try to resolve the issue

  4. Screenshots demonstrating the issue

  5. The relevant sections of your installation profile or configuration including the codes and settings defined for your local elements.


If your question pertains to data import or export, please also include:

  1. Data sample

  2. Your mapping


Answers may be delayed for posts that do not include sufficient information.

Viewing PDFs as PDFs (not as images)

Hello,

I am not able to view PDFs as PDFs.
When PDFs are loaded, images of all pages are generated (which can take a very long time to complete). These images are what are displayed in both Providence and Pawtucket2, not the PDFs themselves. This has a number of problems: loading pages to view takes a long time; image quality of text is not good; text is not searchable; screenreaders can not read the pages.

From previous discussions in this site, though, it looks like I should be able to load the PDFs themselves.

The steps I have tried so far are as follows.

Switching between the viewers - I have tried the following;

viewer = UniversalViewer
viewer = Mirador
viewer = TileViewer

I have done this in each of the following sections of the media_display.conf files (for both Providence and Pawtucket2):

detail = {
pdf = {...

media_overlay = {
pdf = {...

default_viewers = {
pdf = {...

I have tried changing the display_version to:

    display_version = original,

In and earlier posts on this topic there is also the suggestion to include:

    use_book_viewer = 1

(Is the 'book_viewer' an additional piece of software I need to get from somewhere?)

So, for example, in the media_overlay section of media_display.conf, I have tried the following:

pdf = {
mimetypes = {application/pdf},
display_version = original,
alt_display_version = mediumlarge,
width = 580, height = 450,
use_book_viewer = 1
},  

But, I am still getting images loading instead of the PDFs.

Are there any suggestions on what I am/could be doing incorrectly in the steps noted above, other settings that I have missed, or software that is required that I may have forgotten to install? Or any other suggestions?

[I am using: Providence 1.7.5; Pawtucket2 1.7.5; PHP 7.1; Centos 7.6]

With thanks,
Clifford.

Comments

  • You can't view PDFs in-browser. You will only be able to download them.

  • Now I am confused.
    To me, the following suggest that CollectiveAccess can handle PDFs.

    "CollectiveAccess provides a growing list of tools to enable nuanced and detailed media viewing and playback. These include [...] PDF viewers that allow you to scroll and search through multiple pages." [...]
    "There are several PDF viewers available to CollectiveAccess users, [...] The advantage of this viewer ["Document viewer"] is that it provides an in-document search feature with search result navigation. "
    https://docs.collectiveaccess.org/wiki/Media_Viewer

    And:
    "The Bookviewer allows all users to read multi-page documents or scroll through multiple images in a single record in a clean, user-friendly manner." [...]
    "The Bookviewer is enabled on a per media-type basis, allowing you to isolate its use to images, PDFs, xcel documents, and other types if you wish to do so."
    https://docs.collectiveaccess.org/wiki/Bookviewer

    Am I misinterpreting what is being said here?
    Or are these discontinued features?

    Clifford.

  • edited January 30

    Yes it handles PDFs. But it doesn't include a PDF native viewer. If you want to look at the actual PDF, as opposed to a JPEG or PNG rendered version of it you'll need to download it. We've been considering integrating PDF.js, which would let you look at the actual PDF in a browser, but when we tried a few years back it wasn't nearly stable enough. All of the "book viewers" out there (Mirador, Universal Viewer, Internet Archive book viewer) display pre-rendered images for pages. Supposedly UniversalViewer can use PDF.js but I've never been able to get it to work reliably.

  • Thank you for your explanation, Seth.

    So, in the attached screen capture the search box in the viewer is something that would be working if we could get PDF.js working, and stable?

    (In this example, we have added transcriptions in order to make the documents searchable, etc.; so my dismay at the pdfs not working in the way I had expected/hoped for is largely because that effort looks like it is being undone, with the text, ironically, being turned back into unsearchable images...)

    But, as things stand at the moment, PDF.js just doesn't work satisfactorily. So, the best solution at present is to let users know the way to use pdfs (if they want to search them, etc) is to download them.

    I wonder if there is an (easy) way of de-activating the search box in the viewer if it doesn't work - so as not to confuse users?

    Clifford.

  • The search box will work with the interface you show above if you install PDFMiner (https://github.com/euske/pdfminer) on your server. If you're installing PDFMiner after you've already uploaded PDFs then you'll also need to run caUtils reindex-pdfs once to retroactively index them for search. Subsequent PDFs would be indexed for search as they are uploaded.

  • Thank you, Seth.
    That is interesting.
    Will give it a try.

Sign In or Register to comment.