mex.extractors.datenkompass package

Subpackages

Submodules

mex.extractors.datenkompass.extract module

mex.extractors.datenkompass.extract.get_merged_items(*, query_string: str | None = None, entity_type: list[str] | None = None, referenced_identifier: list[str] | None = None, reference_field: str | None = None) list[MergedAccessPlatform | MergedActivity | MergedBibliographicResource | MergedConsent | MergedContactPoint | MergedDistribution | MergedOrganization | MergedOrganizationalUnit | MergedPerson | MergedPrimarySource | MergedResource | MergedVariable | MergedVariableGroup]

Fetch merged items from backend.

Parameters:
  • query_string – Query string.

  • entity_type – List of entity types.

  • referenced_identifier – List of Identifier.

  • reference_field – List of fields accepting identifiers.

Returns:

List of merged items.

mex.extractors.datenkompass.extract.get_relevant_primary_source_ids(relevant_primary_sources: list[str]) list[str]

Get the IDs of the relevant primary sources.

Parameters:

relevant_primary_sources – List of primary sources.

Returns:

List of IDs of the relevant primary sources.

mex.extractors.datenkompass.filter module

mex.extractors.datenkompass.filter.filter_for_organization(fetched_merged_activities: Sequence[MergedActivity], filtered_merged_organization_ids: set[MergedOrganizationIdentifier]) list[MergedActivity]

Filter the merged activities based on the mapping specifications.

Parameters:
  • fetched_merged_activities – merged activities as sequence.

  • filtered_merged_organization_ids – relevant merged organization ids.

Returns:

filtered list of merged activities.

mex.extractors.datenkompass.filter.find_descendant_units(merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit]) list[str]

Based on filter settings find descendant unit ids.

Parameters:

merged_organizational_units_by_id – merged organizational units by identifier.

Returns:

identifier of units which are descendants of the unit filter setting.

mex.extractors.datenkompass.load module

mex.extractors.datenkompass.load.start_s3_client() BaseClient

Start up S3 session.

Returns:

BaseClient of a S3 session.

mex.extractors.datenkompass.load.write_items_to_xlsx(datenkompassitems: Sequence[DatenkompassActivity | DatenkompassBibliographicResource | DatenkompassResource], s3: BaseClient) None

Write Datenkompass items to xlsx.

Parameters:
  • datenkompassitems – List of Datenkompass items.

  • s3 – S3 session.

mex.extractors.datenkompass.main module

mex.extractors.datenkompass.settings module

class mex.extractors.datenkompass.settings.DatenkompassSettings(*, unit_filter: str = 'e.g. unit', organization_filter: str = 'Organization', cutoff_number_authors: int = 3, list_delimiter: str = '; ')

Bases: BaseModel

Settings submodel for the datenkompass extractor.

cutoff_number_authors: int
list_delimiter: str
model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'cutoff_number_authors': FieldInfo(annotation=int, required=False, default=3, description='Maximum number of extracted authors for Bibliographic resources'), 'list_delimiter': FieldInfo(annotation=str, required=False, default='; ', description='Seperator for different entries in a datenkompass model field.'), 'organization_filter': FieldInfo(annotation=str, required=False, default='Organization', description='Filter for organization'), 'unit_filter': FieldInfo(annotation=str, required=False, default='e.g. unit', description='Filter for unit')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

organization_filter: str
unit_filter: str

mex.extractors.datenkompass.transform module

mex.extractors.datenkompass.transform.fix_quotes(string: str) str

Fix quote characters in titles or descriptions.

Removes surrounding (leading and trailing) double quotes and replaces in-string double quotes with single quotes.

Parameters:

string – The string to fix quotes for.

Returns:

The fixed string.

mex.extractors.datenkompass.transform.get_datenbank(item: MergedBibliographicResource) str | None

Get first doi url or first repository URL.

Parameters:

item – MergedBibliographicResource item.

Returns:

url as string.

mex.extractors.datenkompass.transform.get_email(responsible_unit_ids: list[MergedOrganizationalUnitIdentifier], merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit]) str | None

Get the first email address of referenced responsible units.

Parameters:
  • responsible_unit_ids – List of responsible unit identifiers

  • merged_organizational_units_by_id – dict of all merged organizational units by id

Returns:

first found email of a responsible unit as string, or None if no email is found.

mex.extractors.datenkompass.transform.get_german_text(text_entries: list[Text]) list[str]

Get german entries of list as strings, if any exist.

If no german entry exists, return original list entries as strings. Always fix quotes in entries.

Parameters:

text_entries – list of text entries

Returns:

list of entries as strings

mex.extractors.datenkompass.transform.get_resource_email(responsible_reference_ids: list[MergedOrganizationalUnitIdentifier | MergedPersonIdentifier | MergedContactPointIdentifier], merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit], merged_contact_points_by_id: dict[MergedContactPointIdentifier, MergedContactPoint]) str | None

Get the first email address of referenced responsible units or contact points.

Ignore referenced Persons.

Parameters:
  • responsible_reference_ids – List of referenced unit, contact point or person ids

  • merged_organizational_units_by_id – dict of all merged organizational units by id

  • merged_contact_points_by_id – Dict of all merged contact points by id

Returns:

first found email of a unit or contact as string, or None if no email is found.

mex.extractors.datenkompass.transform.get_title(item: MergedActivity) list[str]

Get shortName and title from merged activity item.

Parameters:

item – MergedActivity item.

Returns:

List of short name and title of units as strings.

mex.extractors.datenkompass.transform.get_unit_shortname(responsible_unit_ids: list[MergedOrganizationalUnitIdentifier], merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit]) list[str]

Get shortName of merged units.

Parameters:
  • responsible_unit_ids – List of responsible unit identifiers

  • merged_organizational_units_by_id – dict of all merged organizational units by id

Returns:

List of short names of contact units as strings.

mex.extractors.datenkompass.transform.get_vocabulary(entries: list[_VocabularyT]) list[str | None]

Get german prefLabel for Vocabularies.

Parameters:

entries – list of vocabulary type entries.

Returns:

list of german Vocabulary entries as strings.

mex.extractors.datenkompass.transform.transform_activities(filtered_merged_activities: list[MergedActivity], merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit]) list[DatenkompassActivity]

Transform merged to datenkompass activities.

Parameters:
  • filtered_merged_activities – List of merged activities

  • merged_organizational_units_by_id – dict of merged organizational units by id

Returns:

list of DatenkompassActivity instances.

mex.extractors.datenkompass.transform.transform_bibliographic_resources(merged_bibliographic_resources: list[MergedBibliographicResource], merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit], person_name_by_id: dict[MergedPersonIdentifier, str]) list[DatenkompassBibliographicResource]

Transform merged to datenkompass bibliographic resources.

Parameters:
  • merged_bibliographic_resources – List of merged bibliographic resources

  • merged_organizational_units_by_id – dict of merged organizational units by id

  • person_name_by_id – dictionary of merged person names by id

Returns:

list of DatenkompassBibliographicResource instances.

mex.extractors.datenkompass.transform.transform_resources(merged_resources_by_primary_source: dict[str, list[MergedResource]], merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit], merged_contact_points_by_id: dict[MergedContactPointIdentifier, MergedContactPoint]) list[DatenkompassResource]

Transform merged to datenkompass resources.

Parameters:
  • merged_resources_by_primary_source – dictionary of merged resources

  • merged_organizational_units_by_id – dict of merged organizational units by id

  • merged_contact_points_by_id – dict of merged contact points

Returns:

list of DatenkompassResource instances.

Module contents