mex.extractors.datenkompass package

Subpackages

Submodules

mex.extractors.datenkompass.extract module

mex.extractors.datenkompass.extract.get_filtered_primary_source_ids(filtered_primary_sources: list[str] | None) list[str]

Get the IDs of the relevant primary sources.

Parameters:

filtered_primary_sources – List of primary sources.

Returns:

List of IDs of the filtered relevant primary sources.

mex.extractors.datenkompass.extract.get_merged_items(*, query_string: str | None = None, entity_type: list[str] | None = None, referenced_identifier: list[str] | None = None, reference_field: str | None = None) list[MergedAccessPlatform | MergedActivity | MergedBibliographicResource | MergedConsent | MergedContactPoint | MergedDistribution | MergedOrganization | MergedOrganizationalUnit | MergedPerson | MergedPrimarySource | MergedResource | MergedVariable | MergedVariableGroup]

Fetch merged items from backend.

Parameters:
  • query_string – Query string.

  • entity_type – List of entity types.

  • referenced_identifier – List of Identifier.

  • reference_field – List of fields accepting identifiers.

Returns:

List of merged items.

mex.extractors.datenkompass.filter module

mex.extractors.datenkompass.filter.filter_for_organization(fetched_merged_activities: Sequence[MergedActivity], filtered_merged_organization_ids: set[MergedOrganizationIdentifier]) list[MergedActivity]

Filter the merged activities based on the mapping specifications.

Parameters:
  • fetched_merged_activities – merged activities as sequence.

  • filtered_merged_organization_ids – relevant merged organization ids.

Returns:

filtered list of merged activities.

mex.extractors.datenkompass.filter.find_descendant_units(merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit]) list[str]

Based on filter settings find descendant unit ids.

Parameters:

merged_organizational_units_by_id – merged organizational units by identifier.

Returns:

identifier of units which are descendants of the unit filter setting.

mex.extractors.datenkompass.main module

mex.extractors.datenkompass.settings module

class mex.extractors.datenkompass.settings.DatenkompassSettings(*, unit_filter: str = 'e.g. unit', organization_filter: str = 'Organization', cutoff_number_authors: int = 3, list_delimiter: str = '; ', mapping_path: AssetsPath = AssetsPath('mappings/mapping-to-external-schema/datenkompass'))

Bases: BaseModel

Settings submodel for the datenkompass extractor.

cutoff_number_authors: int
list_delimiter: str
mapping_path: AssetsPath
model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'cutoff_number_authors': FieldInfo(annotation=int, required=False, default=3, description='Maximum number of extracted authors for Bibliographic resources'), 'list_delimiter': FieldInfo(annotation=str, required=False, default='; ', description='Seperator for different entries in a datenkompass model field.'), 'mapping_path': FieldInfo(annotation=AssetsPath, required=False, default=AssetsPath("mappings/mapping-to-external-schema/datenkompass"), description='Path to the directory with the datenkompass mapping files containing the default values, absolute path or relative to `assets_dir`.'), 'organization_filter': FieldInfo(annotation=str, required=False, default='Organization', description='Filter for organization'), 'unit_filter': FieldInfo(annotation=str, required=False, default='e.g. unit', description='Filter for unit')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

organization_filter: str
unit_filter: str

mex.extractors.datenkompass.transform module

mex.extractors.datenkompass.transform.fix_quotes(string: str) str

Fix quote characters in titles or descriptions.

Removes surrounding (leading and trailing) double quotes and replaces in-string double quotes with single quotes.

Parameters:

string – The string to fix quotes for.

Returns:

The fixed string.

mex.extractors.datenkompass.transform.get_abstract_or_description(abstracts: list[Text], delim: str) str

Get German list entries, join them and reformat html-formated links.

Parameters:
  • abstracts – list of mixed language strings with possible html-formated links

  • delim – list delimiter for joining the strings in list

Returns:

joined german strings with reformated plain text urls.

mex.extractors.datenkompass.transform.get_datenbank(item: MergedBibliographicResource) str | None

Get first doi url or first repository URL.

Parameters:

item – MergedBibliographicResource item.

Returns:

url as string.

mex.extractors.datenkompass.transform.get_email(responsible_unit_ids: list[MergedOrganizationalUnitIdentifier], merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit]) str | None

Get the first email address of referenced responsible units.

Parameters:
  • responsible_unit_ids – List of responsible unit identifiers

  • merged_organizational_units_by_id – dict of all merged organizational units by id

Returns:

first found email of a responsible unit as string, or None if no email is found.

mex.extractors.datenkompass.transform.get_german_text(text_entries: list[Text]) list[str]

Get german entries of list as strings, if any exist.

If no german entry exists, return original list entries as strings. Always fix quotes in entries.

Parameters:

text_entries – list of text entries

Returns:

list of entries as strings

mex.extractors.datenkompass.transform.get_german_vocabulary(entries: list[_VocabularyT] | None) list[str | None]

Get german prefLabel for Vocabularies.

Parameters:

entries – list of vocabulary type entries.

Returns:

list of german Vocabulary entries as strings.

mex.extractors.datenkompass.transform.get_resource_email(responsible_reference_ids: list[MergedOrganizationalUnitIdentifier | MergedPersonIdentifier | MergedContactPointIdentifier], merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit], merged_contact_points_by_id: dict[MergedContactPointIdentifier, MergedContactPoint]) str | None

Get the first email address of referenced responsible units or contact points.

Ignore referenced Persons.

Parameters:
  • responsible_reference_ids – List of referenced unit, contact point or person ids

  • merged_organizational_units_by_id – dict of all merged organizational units by id

  • merged_contact_points_by_id – Dict of all merged contact points by id

Returns:

first found email of a unit or contact as string, or None if no email is found.

mex.extractors.datenkompass.transform.get_title(item: MergedActivity) list[str]

Get shortName and title from merged activity item.

Parameters:

item – MergedActivity item.

Returns:

List of short name and title of units as strings.

mex.extractors.datenkompass.transform.get_unit_shortname(responsible_unit_ids: list[MergedOrganizationalUnitIdentifier], merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit], delim: str) str | None

Get shortName of merged units.

Parameters:
  • responsible_unit_ids – List of responsible unit identifiers

  • merged_organizational_units_by_id – dict of all merged organizational units by id

  • delim – delimiter for joining short name entries

Returns:

List of short names of contact units as strings.

mex.extractors.datenkompass.transform.handle_setval(set_value: list[str] | str | None) str

Return value of mapping setValues as string, even if setValues is a list.

Parameters:

set_value – setValues value of mapping

Returns:

stringified value of setValues.

mex.extractors.datenkompass.transform.mapping_lookup_default(model: type[BaseModel], mapping: DatenkompassMapping) dict[str, DatenkompassMappingField]

Create a dictionary of fields by field name of Datenkompass mappings.

For this the alias name needs to be used as intermediate step, because the alias (not the field name) is the identifier in the mapping.

Parameters:
  • model – Datenkompass model.

  • mapping – Datenkompass mapping.

Returns:

dictionary of mapping field names to values.

mex.extractors.datenkompass.transform.transform_activities(filtered_merged_activities: list[MergedActivity], merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit], activity_mapping: DatenkompassMapping) list[DatenkompassActivity]

Transform merged to datenkompass activities.

Parameters:
  • filtered_merged_activities – List of merged activities

  • merged_organizational_units_by_id – dict of merged organizational units by id

  • activity_mapping – Datenkompass mapping.

Returns:

list of DatenkompassActivity instances.

mex.extractors.datenkompass.transform.transform_bibliographic_resources(merged_bibliographic_resources: list[MergedBibliographicResource], merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit], person_name_by_id: dict[MergedPersonIdentifier, str], bibliographic_resource_mapping: DatenkompassMapping) list[DatenkompassBibliographicResource]

Transform merged to datenkompass bibliographic resources.

Parameters:
  • merged_bibliographic_resources – List of merged bibliographic resources

  • merged_organizational_units_by_id – dict of merged organizational units by id

  • person_name_by_id – dictionary of merged person names by id

  • bibliographic_resource_mapping – Datenkompass mapping.

Returns:

list of DatenkompassBibliographicResource instances.

mex.extractors.datenkompass.transform.transform_resources(merged_resources_by_primary_source: dict[str, list[MergedResource]], merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit], merged_contact_points_by_id: dict[MergedContactPointIdentifier, MergedContactPoint], resource_mapping: DatenkompassMapping) list[DatenkompassResource]

Transform merged to datenkompass resources.

Parameters:
  • merged_resources_by_primary_source – dictionary of merged resources

  • merged_organizational_units_by_id – dict of merged organizational units by id

  • merged_contact_points_by_id – dict of merged contact points

  • resource_mapping – Datenkompass mapping.

Returns:

list of DatenkompassResource instances.

Module contents