Large media import

benben
edited November 2020 in General Support

Is there a command-line option for media import as there is for metadata, so that I can use tmux or something similar to ensure that my browser/client's connection doesn't impact the process

When doing media import in-browser, what happens if my client/laptop loses its connection to the server? Is the client connection irelevant once the process is started? If so, what/where are the relevant logs I can monitor on the server to keep tabs on progress?

Comments

  • Ok I found the import-media command for caUtils.

    I'm getting the following error "You must specify a directory to import media from" There is no help info or documentation anywhere that I can find that explains the propper usage or syntax.

    Here's what I'm doing: $ sudo /var/www/html/support/bin/caUtils import-media [path here]

  • edited November 2020

    ETA: I misread part of your question and realized it as soon as I hit send. I'm leaving the original response in the case it helps others.

    Going back to your original question, have you enabled background the background queue? Here's the link from the instructions in the installation guide.

    You will probably still need to be connected to the server in order to upload the files, as far as I know.

    One thought would be to upload the batches of images using ftp/sftp, perform a media import via the GUI, and then let have the server run the processing at a later time if the files are too large. Let me try this option out and let you know if it works.

  • benben
    edited November 2020

    Yes, I've enabled the background queue.

    The media already lives on the server in a folder, awaiting import.

    I have tried three import tactics so far with limited success:
    1. Import via the browser from a folder on the server, not in the background – fails (silently, no error messages) after about 3 large videos – not ideal but at least it kind of works?
    2. Import via the browser with background processing enabled – seems more ideal for my use case of a large volume of large ish videos – I've enabled background processing, and it adds them to the queue, but when I run the "process queue" command, it finishes almost instantly, but doesn't do anything.
    3. Import via the CLI using the above commands – this strikes me as really the most ideal for my use case as I would assume the CLI tool has some kind of progress output, so I could leave this running in a tmux session on the server for as long as it needs, and check back in later – I can't get this to work at all. When I use the above syntax, it fails with the error message about needing to specify a directory but with no example of the correct syntax. caUtils doesn't appear to be documented at all. Any pointers, other than me sifting through the 1,000s of lines of code to figure it out?

  • @ben Have you tried running bin/caUtils import-media help ? This is the output:

    Help for "import-media":
    
        Import media from a directory or directory tree.
    
    Options for import-media are:
    
        --source (-s)            Data to import. For files provide the path; for database, OAI and other
                                 non-file sources provide a URL.
    
        --username (-u)          User name of user to log import against.
    
        --log (-l)               Path to directory in which to log import details. If not set no logs will
                                 be recorded.
    
        --log-level (-d)         Logging threshold. Possible values are, in ascending order of important:
                                 DEBUG, INFO, NOTICE, WARN, ERR, CRIT, ALERT. Default is INFO.
    
        --add-to-set (-S)        Optional identifier of set to add all imported items to.
    
        --log-to-tmp-directory-as-fallback Use the system temporary directory for the import log if the application
                                 logging directory is not writable. Default report an error if the
                                 application log directory is not writeable.
    
        --include-subdirectories Process media in sub-directories. Default is false.
    
        --match-type             Sets how match between media and target record identifier is made. Valid
                                 values are: STARTS, ENDS, CONTAINS, EXACT. Default is EXACT.
    
        --match-mode             Determines how matches are made between media and records. Valid values are
                                 DIRECTORY_NAME, FILE_AND_DIRECTORY_NAMES, FILE_NAME. Set to DIRECTORY_NAME
                                 to match media directory names to target record identifiers; to
                                 FILE_AND_DIRECTORY_NAMES to match on both file and directory names; to
                                 FILE_NAME to match only on file names. Default is FILE_NAME.
    
        --import-mode            Determines if target records are created for media that do not match
                                 existing target records. Set to TRY_TO_MATCH to create new target records
                                 when no match is found. Set to ALWAYS_MATCH to only import media for
                                 existing records. Default is TRY_TO_MATCH.
    
        --allow-duplicate-media  Import media even if it already exists in CollectiveAccess. Default is
                                 false – skip import of duplicate media.
    
        --import-target          Table name of record to import media into. Should be a valid
                                 representation-taking table such as ca_objects, ca_entities,
                                 ca_occurrences, ca_places, etc. Default is ca_objects.
    
        --import-target-type (-t)Type to use for all newly created target records. Default is the first type
                                 in the target's type list.
    
        --import-target-idno (-i)Identifier to use for all newly created target records.
    
        --import-target-idno-mode (-m)Sets how identifiers of newly created target records are set. Valid values
                                 are AUTO, FILENAME, FILENAME_NO_EXT, DIRECTORY_AND_FILENAME. Set to AUTO to
                                 use an identifier calculated according to system numbering settings; set to
                                 FILENAME to use the file name as identifier; set to FILENAME_NO_EXT to use
                                 the file name stripped of extension as the identifier; use
                                 DIRECTORY_AND_FILENAME to set the identifer to the directory name and file
                                 name with extension. Default is AUTO.
    
        --import-target-access (-a)Set access for newly created target records. Possible values are 0 (not
                                 accessible to public), 1 (accessible to public), 2 (restricted public
                                 access). Default is 0 (not accessible to public).
    
        --import-target-status (-w)Set status for newly created target records. Possible values are 0 (new), 1
                                 (editing in progress), 2 (editing complete), 3 (review in progress), 4
                                 (completed). Default is 0 (new).
    
        --representation-type (-T)Type to use for all newly created representations. Possible values are
                                 after_treatment (Image AT), analysis (Analysis), archical (Archival),
                                 before_treatment (Image BT), collection_item (Conservation),
                                 collection_management (Collection Management), diagram (Diagram),
                                 during_treatment (Image DT), non_treatment (Image Non-Treatment), other
                                 (Contextual), primary (Primary), publication (Publication). Default is .
    
        --representation-idno (-I)Identifier to use for all newly created representation records.
    
        --representation-idno-mode (-M)Sets how identifiers of newly created representations are set. Valid values
                                 are AUTO, FILENAME, FILENAME_NO_EXT, DIRECTORY_AND_FILENAME. Set to AUTO to
                                 use an identifier calculated according to system numbering settings; set to
                                 FILENAME to use the file name as identifier; set to FILENAME_NO_EXT to use
                                 the file name stripped of extension as the identifier; use
                                 DIRECTORY_AND_FILENAME to set the identifer to the directory name and file
                                 name with extension. Default is AUTO.
    
        --representation-access (-A)Set access for newly created representations. Possible values are 0 (not
                                 accessible to public), 1 (accessible to public), 2 (restricted public
                                 access). Default is 0 (not accessible to public).
    
        --representation-status (-W)Set status for newly created representations. Possible values are 0 (new),
                                 1 (editing in progress), 2 (editing complete), 3 (review in progress), 4
                                 (completed). Default is 0 (new).
    
        --remove-media-on-import (-R)Remove media from directory after it has been successfully imported.
                                 Default is false.
    
  • In general all caUtils commands are documented by invoking the command followed by "help". You can get a list of all commands by running caUtils followed by "help"

    Note that import-media is basically a CLI version of the web UI for media imports. Just about every option in the web UI is available in the CLI version and operates similarly. If you're doing fuzzy-ish matching of file names against record identifiers, the same app.conf config used to define matching behaviors for the web UI is used for the CLI as well.

    Also, when you run things on the command line you should mind permissions. The media directories must be writeable by the web server. The user you're running caUtils as may not have enough privs on some systems. Running as the web server user via sudo or some other mechanism may be called for.

    I hope this helps.

  • That helps (pun intended)! I wasn't able to find that anywhere in the docs and had been trying -h.

  • Ok, gave it my first shot, and I'm getting the following error:

    CollectiveAccess 1.7.8 (158/RELEASE) Utilities
    (c) 2013-2019 Whirl-i-Gig
    
    PHP Fatal error:  Uncaught Error: Call to a member function get() on null in /var/www/html/app/lib/Utils/CLIUtils.php:4340
    Stack trace:
    #0 /var/www/html/support/bin/caUtils(167): CLIUtils::import_media(Object(Zend_Console_Getopt))
    #1 {main}
      thrown in /var/www/html/app/lib/Utils/CLIUtils.php on line 4340
    

    Here's my command:

    sudo /var/www/html/support/bin/caUtils import-media --source /mnt/075b500f-49ed-4d7c-b83d-6594a9c1be82/import/CLItest/ --username ben  --log ~/ --log-level DEBUG --add-to-set "batch_1" --include-subdirectories --match-mode DIRECTORY_NAME --import-mode ALWAYS_MATCH --import-target ca_objects --import-target-type "Video"
    
  • Hmm. This isn't an issue with current code. I'll check the release and try to reproduce.

  • benben
    edited December 2020

    After upgrading to CA 1.7.9 I'm getting the following more informative error, but it still doesn't add up…

    $ sudo /var/www/html/support/bin/caUtils import-media --source /mnt/075b500f-49ed-4d7c-b83d-6594a9c1be82/import/CLItest/ --username [username]  --log ~/ --log-level DEBUG --add-to-set "batch_1" --include-subdirectories --match-mode DIRECTORY_NAME --import-mode ALWAYS_MATCH --import-target ca_objects --import-target-type "Video"
    CollectiveAccess 1.7.9 (158/RELEASE) Utilities
    (c) 2013-2019 Whirl-i-Gig
    
    Setting match type to default value EXACT
    Setting target identifier type to default value AUTO
    Setting representation identifier type to default value AUTO
    Setting target type to default film
    Setting representation type to default representation_image
    Setting target access to default internal staff only
    Setting representation access to default internal staff only
    Setting target status to default new
    Setting representation status to default new
    Found 1 files in 1 directories
    Processing media                                                  0.0% 0/1 ETC: ???. Elapsed: < 1 sec [>                              ]
    
    
    Could not import media from /mnt/075b500f-49ed-4d7c-b83d-6594a9c1be82/import/CLItest/: Specified import directory '/mnt/075b500f-49ed-4d7c-b83d-6594a9c1be82/import/CLItest/' is invalid
    

    Any ideas? The directory is absoutely there, no typos in command, and is readable by www-data

  • You're running this on the command line so it would need to be readable by the user you're logged in as. I'd look at that first.

  • Oh oops I see you're running this as sudo. I'll assume this means root, so permissions would not be in issue no matter who you're logged in as. In that case my only guess is that the path is wrong...

  • I mean it's not… I can copy/paste the same path and ls both with and without sudo, and see the contents. The path is valid, and I've even set permissions to 777 temporarily to rule that out. I've tried running the command with and without sudo to no avail. Doesn't seem like a permissions issue, or a typo issue.

  • Ok I'm an idiot. I forgot a really basic limitation: the media importer will only import directories within the directory structure defined by batch_media_import_root_directory in app.conf. This is to provide a security sandbox of sorts so imports can't snoop files from the entire directory hierarchy.

    Your options to address this are:

    1. Change batch_media_import_root_directory in app.conf to /mnt/075b500f-49ed-4d7c-b83d-6594a9c1be82/import
    2. Create a symlink for /mnt/075b500f-49ed-4d7c-b83d-6594a9c1be82/import in whatever your import directory is set to in batch_media_import_root_directory
    3. Set batch_media_import_root_directory to / and enjoy pulling any path at all.

    Don't do #3. We are in the habit of doing #2. #1 is fine too, but it means changing app.conf every time you change where import data is mounted.

  • Ha, yeah that'll do it. It's now seeing the folder.

    New issue however has cropped up. I'm getting the following error:

    [1/1] Processing CLItest/AV.00189/AV.00189-me.mov (1)       100.0% 1/1 ETC: < 1 sec. Elapsed: < 1 sec [===============================]
    
    PHP Fatal error: 
    
    Uncaught Error: [] operator not supported for strings in /var/www/html/app/lib/Parsers/getid3/module.audio.mp3.php:464
    Stack trace:
    #0 /var/www/html/app/lib/Parsers/getid3/module.audio.mp3.php(1085): getid3_mp3->decodeMPEGaudioHeader(3163606, Array, false)
    #1 /var/www/html/app/lib/Parsers/getid3/module.audio.mp3.php(878): getid3_mp3->RecursiveFrameScanning(3163605, 3163606, true)
    #2 /var/www/html/app/lib/Parsers/getid3/module.audio.mp3.php(1417): getid3_mp3->decodeMPEGaudioHeader(3163605, Array, true)
    #3 /var/www/html/app/lib/Parsers/getid3/module.audio-video.quicktime.php(84): getid3_mp3->getOnlyMPEGaudioInfo(3163597, false)
    #4 /var/www/html/app/lib/Parsers/getid3/getid3.php(428): getid3_quicktime->Analyze()
    #5 /var/www/html/app/helpers/avHelpers.php(129): getID3->analyze('/mnt/075b500f-4...')
    #6 /var/www/html/app/helpers/avHelpers.php(69): caExtractMetadataWithGetID3('/mnt/075b500f-4...')
    #7 /var/www/html/app/lib/Plugins/Media/Video.php(211): caGetID3GuessFileFormat('/mnt/075b500f-4...')
    #8 /var/www/html/ap in /var/www/html/app/lib/Parsers/getid3/module.audio.mp3.php on line 464
    

    Any ideas?

  • Ok at least that's not our code :-)

    These are all QuickTime files? What version of PHP are you running? That error is one I haven't seen before in GetID3, the library we use to identify what format a file is based upon content.

  • Yes these are Quicktime MOVs

  • I suspect the solution here will be to switch to the very latest version of GetID3. I'll get that into another 1.7.9 package soon.

  • And updated getID3 is in Git for 1.7.9 and will be in the next 1.7.9rc. That'll probably be available tomorrow.

  • benben
    edited December 2020

    Looks like there's no release for this fix yet? If I want to do a quick patch should I just pull from GitHub?

  • There's a new 1.7.9rc that integrates the latest version of getID3. Let me know how it works for you.

    https://github.com/collectiveaccess/providence/releases/tag/1.7.9rc4

  • app/lib/Plugins/Media/Audio.php ~L189 change $ID3 = new getid3(); to $ID3 = new getID3();

  • Thanks for that! I've integrated the change.

  • Hi Seth – just ran and so far no errors, but I'm having trouble finding a way to monitor the progress of the import… I used the same exact command as above, which asked for "DEBUG" logging, to be stored in my home folder, but there's no log showing up. The script is running according to htop, and I can see ffmpeg running, but still no way to know what it is doing other than the output of CaUtils, which looks like the attached screenshot.

    So… two questions:
    1) how can I get the log working?
    2) when I do a big import with 100s of objects, will the output of CaUtils update as it gets to each new file at the very least?

    1. Make sure your log directory is writeable by the user you're running the script as. If it's not no log will be written.
    2. The progress bar will reflect progress across multiple files.
Sign In or Register to comment.