mex.extractors.datenkompass package

Subpackages

Submodules

mex.extractors.datenkompass.extract module

mex.extractors.datenkompass.extract.get_extracted_item_stable_target_ids(entity_type: list[str], referenced_identifier: list[str] | None) list[MergedIdentifier]

Fetch extracted items from backend and return their stableTargetId.

Parameters:
  • entity_type – List of entity types.

  • referenced_identifier – list of MergedIdentifiers to filter for

Returns:

List of stableTargetIds of extracted items of the given entity type(s).

mex.extractors.datenkompass.extract.get_filtered_primary_source_ids(filtered_primary_sources: list[str] | str | None) list[str]

Get a list of MergedIdentifier of filtered primary sources.

Parameters:

filtered_primary_sources – List of primary sources.

Returns:

List of IDs of the filtered relevant primary sources.

mex.extractors.datenkompass.extract.get_merged_items(*, query_string: str | None = None, entity_type: list[str] | None = None, referenced_identifier: list[str] | None = None, reference_field: str | None = None) list[AnyMergedModel]

Fetch merged items from backend.

Parameters:
  • query_string – Query string.

  • entity_type – List of entity types.

  • referenced_identifier – List of Identifier.

  • reference_field – List of fields accepting identifiers.

Returns:

List of merged items.

mex.extractors.datenkompass.filter module

mex.extractors.datenkompass.filter.filter_activities_by_organization(datenkompass_merged_activities_by_primary_source: list[MergedActivity]) list[MergedActivity]

Filter the merged activities based on the mapping specifications.

Parameters:

datenkompass_merged_activities_by_primary_source – merged activities by unit.

Returns:

filtered list of merged activities by unit.

mex.extractors.datenkompass.filter.filter_merged_items_for_primary_source(merged_items_by_primary_source: dict[str, list[MergedResource]], entity_type: str) dict[str, list[MergedResource]]
mex.extractors.datenkompass.filter.filter_merged_items_for_primary_source(merged_items_by_primary_source: dict[str, list[MergedActivity]], entity_type: str) dict[str, list[MergedActivity]]

Filter the merged items for primary source as defined in settings.

Special treatment for items which were created/edited in editor: filter those merged items out, which are referenced via stableTargetID by an extracted item, to keep only those merged items which consist only of rules

Parameters:
  • merged_items_by_primary_source – merged items dictionary by primary source.

  • entity_type – entity type to of merged items

Settings: primary source which needs to be filtered

Returns:

dictionary with list of filtered merged items

mex.extractors.datenkompass.filter.filter_merged_resources_by_unit(merged_resources_by_primary_source: dict[str, list[MergedResource]], resource_filter_mapping: DatenkompassFilterMapping, merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit]) dict[str, dict[str, list[MergedResource]]]

Filter the merged resources by (unit and its childunits) in field unitInCharge.

Parameters:
  • merged_resources_by_primary_source – merged resources by primary source.

  • resource_filter_mapping – Datenkompass resource filter mapping

  • merged_organizational_units_by_id – all merged units by their id

Returns:

filtered list of merged resources by primary source by unit.

mex.extractors.datenkompass.filter.find_descendant_units(merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit], parent_unit_name: str) list[str]

Based on filter settings find descendant unit ids.

Parameters:
  • merged_organizational_units_by_id – merged organizational units by identifier.

  • parent_unit_name – name of the parent unit for which to find all descendants

Returns:

identifier of units which are descendants of the unit filter setting.

mex.extractors.datenkompass.main module

mex.extractors.datenkompass.settings module

class mex.extractors.datenkompass.settings.DatenkompassSettings(*, schedule: str | None = None, organization_filter: str = 'Organization', cutoff_number_authors: int = 3, list_delimiter: str = '; ', min_keyword_item_length: int = 2, max_keyword_str_length: int = 50, mapping_path: AssetsPath = AssetsPath('mappings/mapping-to-external-schema/datenkompass'))

Bases: BaseModel

Settings submodel for the datenkompass extractor.

cutoff_number_authors: int
list_delimiter: str
mapping_path: AssetsPath
max_keyword_str_length: int
min_keyword_item_length: int
model_config = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_by_alias': True, 'validate_by_name': True, 'validate_default': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

organization_filter: str
schedule: str | None

mex.extractors.datenkompass.transform module

mex.extractors.datenkompass.transform.filter_schlagworte(words: list[str | None], delim: str, min_word_length: int, max_string_length: int) str

Filter out certain words and limit final string to maximum length.

Parameters:
  • words – list of entries.

  • delim – list delimiter for joining the strings in list

  • min_word_length – minimal length of each word.

  • max_string_length – maximal length of final string of joined words.

Returns:

combined string.

mex.extractors.datenkompass.transform.fix_quotes(string: str) str

Fix quote characters in titles or descriptions.

Removes surrounding (leading and trailing) double quotes and replaces in-string double quotes with single quotes.

Parameters:

string – The string to fix quotes for.

Returns:

The fixed string.

mex.extractors.datenkompass.transform.get_abstract_or_description(abstracts: list[Text], delim: str) str

Get German list entries, join them and reformat html-formated links.

Parameters:
  • abstracts – list of mixed language strings with possible html-formated links

  • delim – list delimiter for joining the strings in list

Returns:

joined german strings with reformated plain text urls.

mex.extractors.datenkompass.transform.get_datenbank(item: MergedBibliographicResource) str | None

Get first doi url or first repository URL.

Parameters:

item – MergedBibliographicResource item.

Returns:

url as string.

mex.extractors.datenkompass.transform.get_email(responsible_unit_ids: list[MergedOrganizationalUnitIdentifier], merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit]) str | None

Get the first email address of referenced responsible units.

Parameters:
  • responsible_unit_ids – List of responsible unit identifiers

  • merged_organizational_units_by_id – dict of all merged organizational units by id

Returns:

first found email of a responsible unit as string, or None if no email is found.

mex.extractors.datenkompass.transform.get_german_text(text_entries: list[Text]) list[str]

Get german entries of list as strings, if any exist.

If no german entry exists, return original list entries as strings. Always fix quotes in entries.

Parameters:

text_entries – list of text entries

Returns:

list of entries as strings

mex.extractors.datenkompass.transform.get_german_vocabulary(entries: list[VocabularyT] | None) list[str | None]

Get german prefLabel for Vocabularies.

Parameters:

entries – list of vocabulary type entries.

Returns:

list of german Vocabulary entries as strings.

mex.extractors.datenkompass.transform.get_resource_email(responsible_reference_ids: list[MergedOrganizationalUnitIdentifier | MergedPersonIdentifier | MergedContactPointIdentifier], merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit], merged_contact_points_by_id: dict[MergedContactPointIdentifier, MergedContactPoint]) str | None

Get the first email address of referenced responsible units or contact points.

Ignore referenced Persons.

Parameters:
  • responsible_reference_ids – List of referenced unit, contact point or person ids

  • merged_organizational_units_by_id – dict of all merged organizational units by id

  • merged_contact_points_by_id – Dict of all merged contact points by id

Returns:

first found email of a unit or contact as string, or None if no email is found.

mex.extractors.datenkompass.transform.get_title(item: MergedActivity) list[str]

Get shortName and title from merged activity item.

Parameters:

item – MergedActivity item.

Returns:

List of short name and title of units as strings.

mex.extractors.datenkompass.transform.get_unit_shortname(responsible_unit_ids: list[MergedOrganizationalUnitIdentifier], merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit], delim: str) str | None

Get shortName of merged units.

Parameters:
  • responsible_unit_ids – List of responsible unit identifiers

  • merged_organizational_units_by_id – dict of all merged organizational units by id

  • delim – delimiter for joining short name entries

Returns:

List of short names of contact units as strings.

mex.extractors.datenkompass.transform.handle_setval(set_value: list[str] | str | None) str

Return value of mapping setValues as string, even if setValues is a list.

Parameters:

set_value – setValues value of mapping

Returns:

stringified value of setValues.

mex.extractors.datenkompass.transform.mapping_lookup_default(model: type[BaseModel], mapping: DatenkompassMapping) dict[str, DatenkompassMappingField]

Create a dictionary of fields by field name of Datenkompass mappings.

For this the alias name needs to be used as intermediate step, because the alias (not the field name) is the identifier in the mapping.

Parameters:
  • model – Datenkompass model.

  • mapping – Datenkompass mapping.

Returns:

dictionary of mapping field names to values.

mex.extractors.datenkompass.transform.transform_activities(filtered_merged_activities: list[MergedActivity], merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit]) list[DatenkompassActivity]

Transform merged to datenkompass activities.

Parameters:
  • filtered_merged_activities – List of merged activities

  • merged_organizational_units_by_id – dict of merged organizational units by id

Returns:

list of DatenkompassActivity instances.

mex.extractors.datenkompass.transform.transform_bibliographic_resources(merged_bibliographic_resources: list[MergedBibliographicResource], merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit], datenkompass_person_str_by_id: dict[MergedPersonIdentifier, str]) list[DatenkompassBibliographicResource]

Transform merged to datenkompass bibliographic resources.

Parameters:
  • merged_bibliographic_resources – List of merged bibliographic resources

  • merged_organizational_units_by_id – dict of merged organizational units by id

  • datenkompass_person_str_by_id – dictionary of merged person names by id

  • bibliographic_resource_mapping – Datenkompass mapping.

Returns:

list of DatenkompassBibliographicResource instances.

mex.extractors.datenkompass.transform.transform_resources(merged_resources_by_primary_source_by_unit: dict[str, dict[str, list[MergedResource]]], merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit], merged_contact_points_by_id: dict[MergedContactPointIdentifier, MergedContactPoint]) dict[str, dict[str, list[DatenkompassResource]]]

Transform merged to datenkompass resources.

Parameters:
  • merged_resources_by_primary_source_by_unit – dictionary of merged resources

  • merged_organizational_units_by_id – dict of merged organizational units by id

  • merged_contact_points_by_id – dict of merged contact points

Returns:

list of DatenkompassResource instances.

Module contents