mex.extractors.datenkompass package¶
Subpackages¶
- mex.extractors.datenkompass.models package
- Submodules
- mex.extractors.datenkompass.models.item module
DatenkompassActivity
DatenkompassActivity.beschreibung
DatenkompassActivity.datenbank
DatenkompassActivity.datenerhalt
DatenkompassActivity.datenhalter
DatenkompassActivity.datennutzungszweck
DatenkompassActivity.entityType
DatenkompassActivity.format
DatenkompassActivity.frequenz
DatenkompassActivity.hauptkategorie
DatenkompassActivity.herausgeber
DatenkompassActivity.identifier
DatenkompassActivity.kommentar
DatenkompassActivity.kontakt
DatenkompassActivity.model_computed_fields
DatenkompassActivity.model_config
DatenkompassActivity.model_fields
DatenkompassActivity.organisationseinheit
DatenkompassActivity.rechtsgrundlage
DatenkompassActivity.schlagwort
DatenkompassActivity.status
DatenkompassActivity.titel
DatenkompassActivity.unterkategorie
DatenkompassActivity.voraussetzungen
DatenkompassBibliographicResource
DatenkompassBibliographicResource.beschreibung
DatenkompassBibliographicResource.datenbank
DatenkompassBibliographicResource.datenerhalt
DatenkompassBibliographicResource.datenhalter
DatenkompassBibliographicResource.datennutzungszweck
DatenkompassBibliographicResource.datennutzungszweck_erweitert
DatenkompassBibliographicResource.dk_format
DatenkompassBibliographicResource.entityType
DatenkompassBibliographicResource.frequenz
DatenkompassBibliographicResource.hauptkategorie
DatenkompassBibliographicResource.herausgeber
DatenkompassBibliographicResource.identifier
DatenkompassBibliographicResource.kommentar
DatenkompassBibliographicResource.kontakt
DatenkompassBibliographicResource.model_computed_fields
DatenkompassBibliographicResource.model_config
DatenkompassBibliographicResource.model_fields
DatenkompassBibliographicResource.organisationseinheit
DatenkompassBibliographicResource.rechtsgrundlage
DatenkompassBibliographicResource.rechtsgrundlagen_benennung
DatenkompassBibliographicResource.schlagwort
DatenkompassBibliographicResource.status
DatenkompassBibliographicResource.titel
DatenkompassBibliographicResource.unterkategorie
DatenkompassBibliographicResource.voraussetzungen
DatenkompassResource
DatenkompassResource.beschreibung
DatenkompassResource.datenbank
DatenkompassResource.datenerhalt
DatenkompassResource.datenhalter
DatenkompassResource.datennutzungszweck
DatenkompassResource.datennutzungszweck_erweitert
DatenkompassResource.dk_format
DatenkompassResource.entityType
DatenkompassResource.frequenz
DatenkompassResource.hauptkategorie
DatenkompassResource.herausgeber
DatenkompassResource.identifier
DatenkompassResource.kommentar
DatenkompassResource.kontakt
DatenkompassResource.model_computed_fields
DatenkompassResource.model_config
DatenkompassResource.model_fields
DatenkompassResource.organisationseinheit
DatenkompassResource.rechtsgrundlage
DatenkompassResource.rechtsgrundlagen_benennung
DatenkompassResource.schlagwort
DatenkompassResource.status
DatenkompassResource.titel
DatenkompassResource.unterkategorie
DatenkompassResource.voraussetzungen
- Module contents
Submodules¶
mex.extractors.datenkompass.extract module¶
- mex.extractors.datenkompass.extract.get_merged_items(*, query_string: str | None = None, entity_type: list[str] | None = None, referenced_identifier: list[str] | None = None, reference_field: str | None = None) list[MergedAccessPlatform | MergedActivity | MergedBibliographicResource | MergedConsent | MergedContactPoint | MergedDistribution | MergedOrganization | MergedOrganizationalUnit | MergedPerson | MergedPrimarySource | MergedResource | MergedVariable | MergedVariableGroup] ¶
Fetch merged items from backend.
- Parameters:
query_string – Query string.
entity_type – List of entity types.
referenced_identifier – List of Identifier.
reference_field – List of fields accepting identifiers.
- Returns:
List of merged items.
- mex.extractors.datenkompass.extract.get_relevant_primary_source_ids(relevant_primary_sources: list[str]) list[str] ¶
Get the IDs of the relevant primary sources.
- Parameters:
relevant_primary_sources – List of primary sources.
- Returns:
List of IDs of the relevant primary sources.
mex.extractors.datenkompass.filter module¶
- mex.extractors.datenkompass.filter.filter_for_organization(fetched_merged_activities: Sequence[MergedActivity], filtered_merged_organization_ids: set[MergedOrganizationIdentifier]) list[MergedActivity] ¶
Filter the merged activities based on the mapping specifications.
- Parameters:
fetched_merged_activities – merged activities as sequence.
filtered_merged_organization_ids – relevant merged organization ids.
- Returns:
filtered list of merged activities.
- mex.extractors.datenkompass.filter.find_descendant_units(merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit]) list[str] ¶
Based on filter settings find descendant unit ids.
- Parameters:
merged_organizational_units_by_id – merged organizational units by identifier.
- Returns:
identifier of units which are descendants of the unit filter setting.
mex.extractors.datenkompass.load module¶
- mex.extractors.datenkompass.load.start_s3_client() BaseClient ¶
Start up S3 session.
- Returns:
BaseClient of a S3 session.
- mex.extractors.datenkompass.load.write_items_to_xlsx(datenkompassitems: Sequence[DatenkompassActivity | DatenkompassBibliographicResource | DatenkompassResource], s3: BaseClient) None ¶
Write Datenkompass items to xlsx.
- Parameters:
datenkompassitems – List of Datenkompass items.
s3 – S3 session.
mex.extractors.datenkompass.main module¶
mex.extractors.datenkompass.settings module¶
- class mex.extractors.datenkompass.settings.DatenkompassSettings(*, unit_filter: str = 'e.g. unit', organization_filter: str = 'Organization', cutoff_number_authors: int = 3, list_delimiter: str = '; ')¶
Bases:
BaseModel
Settings submodel for the datenkompass extractor.
- cutoff_number_authors: int¶
- list_delimiter: str¶
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}¶
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'cutoff_number_authors': FieldInfo(annotation=int, required=False, default=3, description='Maximum number of extracted authors for Bibliographic resources'), 'list_delimiter': FieldInfo(annotation=str, required=False, default='; ', description='Seperator for different entries in a datenkompass model field.'), 'organization_filter': FieldInfo(annotation=str, required=False, default='Organization', description='Filter for organization'), 'unit_filter': FieldInfo(annotation=str, required=False, default='e.g. unit', description='Filter for unit')}¶
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
- organization_filter: str¶
- unit_filter: str¶
mex.extractors.datenkompass.transform module¶
- mex.extractors.datenkompass.transform.fix_quotes(string: str) str ¶
Fix quote characters in titles or descriptions.
Removes surrounding (leading and trailing) double quotes and replaces in-string double quotes with single quotes.
- Parameters:
string – The string to fix quotes for.
- Returns:
The fixed string.
- mex.extractors.datenkompass.transform.get_datenbank(item: MergedBibliographicResource) str | None ¶
Get first doi url or first repository URL.
- Parameters:
item – MergedBibliographicResource item.
- Returns:
url as string.
- mex.extractors.datenkompass.transform.get_email(responsible_unit_ids: list[MergedOrganizationalUnitIdentifier], merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit]) str | None ¶
Get the first email address of referenced responsible units.
- Parameters:
responsible_unit_ids – List of responsible unit identifiers
merged_organizational_units_by_id – dict of all merged organizational units by id
- Returns:
first found email of a responsible unit as string, or None if no email is found.
- mex.extractors.datenkompass.transform.get_german_text(text_entries: list[Text]) list[str] ¶
Get german entries of list as strings, if any exist.
If no german entry exists, return original list entries as strings. Always fix quotes in entries.
- Parameters:
text_entries – list of text entries
- Returns:
list of entries as strings
- mex.extractors.datenkompass.transform.get_resource_email(responsible_reference_ids: list[MergedOrganizationalUnitIdentifier | MergedPersonIdentifier | MergedContactPointIdentifier], merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit], merged_contact_points_by_id: dict[MergedContactPointIdentifier, MergedContactPoint]) str | None ¶
Get the first email address of referenced responsible units or contact points.
Ignore referenced Persons.
- Parameters:
responsible_reference_ids – List of referenced unit, contact point or person ids
merged_organizational_units_by_id – dict of all merged organizational units by id
merged_contact_points_by_id – Dict of all merged contact points by id
- Returns:
first found email of a unit or contact as string, or None if no email is found.
- mex.extractors.datenkompass.transform.get_title(item: MergedActivity) list[str] ¶
Get shortName and title from merged activity item.
- Parameters:
item – MergedActivity item.
- Returns:
List of short name and title of units as strings.
- mex.extractors.datenkompass.transform.get_unit_shortname(responsible_unit_ids: list[MergedOrganizationalUnitIdentifier], merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit]) list[str] ¶
Get shortName of merged units.
- Parameters:
responsible_unit_ids – List of responsible unit identifiers
merged_organizational_units_by_id – dict of all merged organizational units by id
- Returns:
List of short names of contact units as strings.
- mex.extractors.datenkompass.transform.get_vocabulary(entries: list[_VocabularyT]) list[str | None] ¶
Get german prefLabel for Vocabularies.
- Parameters:
entries – list of vocabulary type entries.
- Returns:
list of german Vocabulary entries as strings.
- mex.extractors.datenkompass.transform.transform_activities(filtered_merged_activities: list[MergedActivity], merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit]) list[DatenkompassActivity] ¶
Transform merged to datenkompass activities.
- Parameters:
filtered_merged_activities – List of merged activities
merged_organizational_units_by_id – dict of merged organizational units by id
- Returns:
list of DatenkompassActivity instances.
- mex.extractors.datenkompass.transform.transform_bibliographic_resources(merged_bibliographic_resources: list[MergedBibliographicResource], merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit], person_name_by_id: dict[MergedPersonIdentifier, str]) list[DatenkompassBibliographicResource] ¶
Transform merged to datenkompass bibliographic resources.
- Parameters:
merged_bibliographic_resources – List of merged bibliographic resources
merged_organizational_units_by_id – dict of merged organizational units by id
person_name_by_id – dictionary of merged person names by id
- Returns:
list of DatenkompassBibliographicResource instances.
- mex.extractors.datenkompass.transform.transform_resources(merged_resources_by_primary_source: dict[str, list[MergedResource]], merged_organizational_units_by_id: dict[MergedOrganizationalUnitIdentifier, MergedOrganizationalUnit], merged_contact_points_by_id: dict[MergedContactPointIdentifier, MergedContactPoint]) list[DatenkompassResource] ¶
Transform merged to datenkompass resources.
- Parameters:
merged_resources_by_primary_source – dictionary of merged resources
merged_organizational_units_by_id – dict of merged organizational units by id
merged_contact_points_by_id – dict of merged contact points
- Returns:
list of DatenkompassResource instances.