Help needed importing vocabulary list items

I'm giving up.

I've read *all* Data Importer articles in the wiki:

[](https://docs.collectiveaccess.org/wiki/Data_Importer)

[](https://docs.collectiveaccess.org/wiki/Cookbook_Chapter_3:_Data_Import#Importing_a_list)

[](https://docs.collectiveaccess.org/wiki/Data_Import:_Creating_and_Running_a_Mapping)

https://docs.collectiveaccess.org/wiki/Basic_data_import_tutorial


It feels like a carrot dangling on a stick: The docs definitely hint at it being possible to import ca_list_items, but the text is very vague and the given examples only show a detail of a detail...

I'm dealing with hierarchical vocabularies, but do I need the listItemSplitter, the listItemHierarchyBuilder or the listItemIndentedHierarchyBuilder? The difference between the latter 2 seems to be only the data layout (indent or not), but only the "listItemSplitter" docs show where to assign idno and item attributes...


I have complete control over my source data layout (I'm defining/writing it), so I'd take the easiest option.

btw: If I just copy/paste a link here, I get the error "Request failed with status code 404".


Grateful for example files!

Comments

  • Hi,

    listItemSplitter splits values and relates records being imported to list items with those values. You'd use this, for example, to related subjects in a list to objects you are importing. You would not generally use this to build a list from a file.

    listItemIndentedHierarchyBuilder builds list item hierarchies from column-based data (Eg. Excel, CSV) where the value's column placement indicates level in the hierarchy. Its primary use is to allow import of "visually" formatted lists. Unless you have data formatted like this then it's not useful.

    listItemHierarchyBuilder will build a list hierarchy by allowing you to specify parents for a given record. Depending upon the data you can define the immediate parent or several levels. This is probably what you want.

    Attached is a sample mapping that imports hierarchical lists using listItemHierarchyBuilder as well as some test data. This mapping assumes that there is a list with code = "thesaurus" in your system and that your data is in XLSX (Excel format). Tab or CSV data can be used by setting the mapping's inputFormats setting. If you want to import data into a list with a different list code change the mapping's code setting.

    Regarding pasting of links, maybe try adding the link to existing text in your post using the chain-link icon shown whenever you select text? I'm afraid I don't know why you're getting errors on this forum pasting in links. It has worked for me, but I've only copy-pasted links from browsers.

  • Thank you very much for the explanation and sample files!

    --big smile!--


    About the links: Will try in a different browser. Maybe it's some script-restriction addons that confuse the forum ;)

  • edited March 9

    It worked!

    I've finally got all my Lego-stuff in CA! :)

    How can I enable the imported list items? In your example, all items are disabled by default, although the refinery parameter contains "is_enabled:1". I couldn't find that setting in the documentation.


    Thanks again!

  • Oh, and how can I define that created parent items are "hierarchy names" - not "concepts"?

  • To set list items "enabled" map a constant "1" to the is_enabled field in the mapping.

  • @peter_b i had the same problem recently with importing large amounts of information to the list system of CA. The solution i found the best for me was to study the database, in particular the "ca_lists, "ca_list_labels", "ca_list_items" and "ca_list_item_labels" and produce a customized code to do the direct inserts in the database replicating the table structure and fields. I'm not sure how comfortably you are with programming, but i understand you point, the data importer documentation, is severely lacking when it comes to real examples with real data.

    A sugestion for @seth, in the documentation, every type of refinery or splitter should be coupled with a pair of a real data example and a mapping file that demonstrates the use of that particular splitter/refinery.

  • @PedroLadino: I support your suggestion about adding real data examples/mapping to the docs. It would also be nice to have all options/switches documented, such as "is_enabled" (Is it documented somwhere? I couldn't find it).

  • @seth: Setting the `is_enabled`constant to 1 worked. Thank you!

    There was an error message though (already yesterday, so independent of "is_enabled"), but I find it impossible to map that error to any corresponding source data entry:

    • (2021-03-10 11:51:37) [] Could not add default label Item name (plural) must be at least 1 characters
    • (2021-03-10 11:51:37) [] Could not add preferred label to ca_list_items. Record was deleted because no preferred label could be applied: Item name (plural) must be at least 1 characters

    Would it be possible to have any hint in the import log messages which line/value caused an error?

  • Unfortunately, the source contains duplicate entries, so I've configured: `Setting > existingRecordPolicy > skip_on_idno`, but existing items are not matched, but created again (with duplicate idnos).

    How is the "existingRecordPolicy" setting to be used in this case?

  • existingRecordPolicy is supposed work with lists the same as with other imports. I just checked and there is a list-specific problem that breaks deduping. There's a push in develop to fix this. With the fix in place duplicates will be skipped based upon identifier.

  • Regarding your error, the idno of the record that threw the error will be between the brackets. That it's blank implies that you're importing blank rows at some point.

    We could add line numbers to the log entries, but not all sources are line oriented, so it would be of variable utility. The logged idno usually gets you to the part of the data you need to look at quickly, unless of course it's blank.

  • idno between brackets: Ah, okay! Thanks.

    There shouldn't be any blank lines, but I'll check if I spot something suspicious.

    Would it be possible to add some message saying the the "idno is blank"? Because the error message only tells me that a plural name was missing.


    I'm curious: Which supported data sources are not line oriented?

  • XML and JSON. How they're broken up into records for import is variable, so an index is often not the easiest way to find where the problem is. The identifier is usually the quickest way to get to the relevant data. Of course that assumes the the identifier is defined, which it should be (until it isn't I guess).

    We can add line numbers and a blank indicator to the logs if it would be helpful to you.

  • edited March 12

    But XML and JSON also have line numbers, don't they? (unless they don't contain linebreaks, but is that case so popular?)

    The idno is definitely a good search reference to find issues, but IMO it's a bit of a "boot" problem: The importer mapping needs to "find" the idno - if it can't you're blind...

    I'd totally be for line numbers, unless of course you have good reasons not to.

  • XML and JSON are parsed whole into data structures using PHP's XML parser, not line by line. By the time the importer gets them there are no line numbers, just a content tree.

    For data where line numbers are available (Excel, tab, csv) having them in the logs would clearly be helpful in cases where idnos are not being resolve for some reason, and I agree they should be added.

  • @seth: Thanks again for the voc-import worksheet and data example!

    I've been using it as a base for hundreds of successfully imported terms :)

    However, I've now tried to declare level-2+ hierarchies (so, there are sub-sub-[sub-...]-items), and the 1st hierarchy level is honored, but the 2nd hierarchy level is ignored. I silently assumed that assigning a parent to a parent would work identically.

    I'm declaring hierarchy terms as new items right above before I assign them to following terms.


    I've attached the mapping worksheet and the vocabulary data. The problem happens e.g. at "building_construction" which has several sub-levels.

    Warning: Contains gruesome vocabularies, since it's for a holocaust history project.


    Grateful for any suggestions or worksheet examples.

  • I've now also noticed that other settings don't work when defining nested hierarchies.

    For example, I've added a column "enabled" (0/1) - which works fine, except it's ignored for all terms that would be assigned to a level2+ hierarchy. They are all enabled, regardless of the value in the sheet.

  • Out of curiosity: Is this actually a bug of the "listItemHierarchyBuilder", or is it designed to do only 1 hierarchy level?

  • *sigh*

    My colleagues just informed me that the CA Profile XML now invalidated itself:

    The exported profile XML cannot be used, because it fails to validate against the profile.xsd during installation.


    The reason for that is that the above importer nested-hierarchy issue causes it to create duplicate identifier entries - which then invalidate the XML.

    Luckily, we still have the instance running, so I deleted the broken-by-importer list, but: now the list doesn't show up in the UI anymore, but is still in the profile XML with a long number at the end, like this: "vhh_PlaceTypeIMP_1620223962".


    Thanks for yet another Catch-22.

  • edited May 5

    I'd suggest editing the exported profile by hand to remove the list and/or resolving the repeating identifiers. If there's invalid data in the running system it's going to be included in the exported profile, but these are all resolvable. Catch-22's generally aren't.

  • edited May 5

    I've just pushed a change to omit deleted lists from profile exports. This is appropriate behavior (I've no idea why we weren't filtering these already) and should help with the XML issue.

    https://collectiveaccess.atlassian.net/browse/PROV-3127

    https://github.com/collectiveaccess/providence/pull/937

  • I'll try to make time to look at your hierarchy builder issue, but I'm not sure when that will be.

Sign In or Register to comment.