mex.extractors.datscha_web package

Subpackages

Submodules

mex.extractors.datscha_web.connector module

class mex.extractors.datscha_web.connector.DatschaWebConnector

Bases: HTTPConnector

Connector class to handle credentials and parsing of datscha web registry.

_set_authentication() None

Authenticate to the host.

_set_url() None

Set url of the host.

get_item(item_url: str) DatschaWebItem

Load and parse a single datscha item from the given URL.

Parameters:

item_url – URL of the item’s datscha page

Returns:

Parsed datscha item

Return type:

DatschaWebItem

get_item_urls() list[str]

Accumulate datscha item URLs by scraping the page verzeichnis.php.

Returns:

List of item URLs

mex.extractors.datscha_web.extract module

mex.extractors.datscha_web.extract.extract_datscha_web_items() Generator[DatschaWebItem, None, None]

Load datscha source items by scraping datscha-web pages.

Returns:

Generator for datscha web items

mex.extractors.datscha_web.extract.extract_datscha_web_organizations(datscha_web_items: Iterable[DatschaWebItem]) dict[str, MergedOrganizationIdentifier]

Search and extract organization from wikidata.

Parameters:

datscha_web_items – Iterable of DatschaWebItem

Returns:

Dict with keys DatschaWebItem.Auftragsverarbeiter,

DatschaWebItem.Empfaenger_der_Daten_im_Drittstaat, and DatschaWebItem.Empfaenger_der_verarbeiteten_uebermittelten_oder_offengelegten_Daten, and values: MergedOrganizationIdentifier

mex.extractors.datscha_web.extract.extract_datscha_web_source_contacts(datscha_web_items: Iterable[DatschaWebItem]) Generator[LDAPPersonWithQuery, None, None]

Extract LDAP persons with their query string for datscha-web source contacts.

Parameters:

datscha_web_items – Datscha-web items

Returns:

Generator for LDAP persons with query

mex.extractors.datscha_web.main module

mex.extractors.datscha_web.parse_html module

mex.extractors.datscha_web.parse_html.parse_detail_block(detail_block: Tag) tuple[str, str]

Get values of first divs with classes “input_vorgabe” and “input_feld”.

Parameters:

detail_block – BeautifulSoup tag element

Returns:

First values for “input_vorgabe” and “input_feld”

mex.extractors.datscha_web.parse_html.parse_item_urls_from_overview_html(html_data: str, url: str) list[str]

Parse the item url from the overview page.

Parameters:
  • html_data – Raw HTML data

  • url – Datscha URL prefix

Returns:

List of parsed URLs

mex.extractors.datscha_web.parse_html.parse_single_item_html(html_data: str, item_url: str) DatschaWebItem

Parse a single Datscha item from a details page.

Parameters:
  • html_data – Raw HTML

  • item_url – Datscha item url

Raises:

MExError – When the URL does not contain an ID

Returns:

Parsed datscha item

mex.extractors.datscha_web.parse_html.parse_unit_loz(bs4_object: BeautifulSoup) tuple[str, list[str]]

Parse units from single item html.

Parameters:

bs4_object – BeautifulSoup object holding the content of a single item html

Returns:

Tuple of unit key and list of unit values example: “Liegenschaften/Organisationseinheiten (LOZ)”, [“FGx”, “FGy”]

mex.extractors.datscha_web.settings module

class mex.extractors.datscha_web.settings.DatschaWebSettings(*, url: str = 'https://datscha/', vorname: SecretStr = SecretStr('**********'), nachname: SecretStr = SecretStr('**********'), pw: SecretStr = SecretStr('**********'), organisation: str = 'RKI')

Bases: BaseModel

Settings submodel definition for datscha web extractor.

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'nachname': FieldInfo(annotation=SecretStr, required=False, default=SecretStr('**********'), description='Last name for login to datscha web service.'), 'organisation': FieldInfo(annotation=str, required=False, default='RKI', description='Organisation for login to datscha web service.'), 'pw': FieldInfo(annotation=SecretStr, required=False, default=SecretStr('**********'), description='Password for login to datscha web service.'), 'url': FieldInfo(annotation=str, required=False, default='https://datscha/', description='URL of datscha web service.'), 'vorname': FieldInfo(annotation=SecretStr, required=False, default=SecretStr('**********'), description='First name for login to datscha web service.')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

nachname: SecretStr
organisation: str
pw: SecretStr
url: str
vorname: SecretStr

mex.extractors.datscha_web.transform module

mex.extractors.datscha_web.transform.transform_datscha_web_items_to_mex_activities(datscha_web_items: Iterable[DatschaWebItem], primary_source: ExtractedPrimarySource, person_stable_target_ids_by_query_string: dict[Hashable, list[Identifier]], unit_stable_target_ids_by_synonym: dict[str, Identifier], organizations_stable_target_ids_by_query_string: dict[str, MergedOrganizationIdentifier]) Generator[ExtractedActivity, None, None]

Transform datscha-web items to extracted activities.

Parameters:
  • datscha_web_items – Datscha-web items

  • primary_source – MEx primary_source for datscha-web

  • person_stable_target_ids_by_query_string – Mapping from author query to person stable target IDs

  • unit_stable_target_ids_by_synonym – Mapping from unit acronyms and labels to unit stable target IDs

  • organizations_stable_target_ids_by_query_string – Mapping from org queries to org stable target IDs

Returns:

Generator for ExtractedSources

Module contents