mex.extractors.confluence_vvt package

Submodules

mex.extractors.confluence_vvt.connector module

class mex.extractors.confluence_vvt.connector.ConfluenceVvtConnector

Bases: HTTPConnector

Connector class to create a session for all requests to confluence-vvt.

_set_authentication() None

Authenticate to the host.

_set_url() None

Set url of the host.

get_page_by_id(page_id: str) ConfluenceVvtPage | None

Get confluence page data by its id.

Parameters:

page_id – confluence page id

Returns:

ConfluenceVvtPage or None

mex.extractors.confluence_vvt.extract module

mex.extractors.confluence_vvt.extract.extract_confluence_vvt_authors(authors: list[str]) list[LDAPPersonWithQuery]

Extract LDAP persons with their query string for confluence-vvt authors.

Parameters:

authors – list of authors

Returns:

Generator for LDAP persons with query

mex.extractors.confluence_vvt.extract.fetch_all_vvt_pages_ids() Generator[str, None, None]

Fetch all the ids for data pages.

Settings:

confluence_vvt.url: Confluence-vvt base url confluence_vvt.overview_page_id: page id of the overview page

Raises:

MExError – When the pagination limit is exceeded

Returns:

Generator for page IDs

mex.extractors.confluence_vvt.extract.get_all_persons_from_all_pages(pages: list[ConfluenceVvtPage], confluence_vvt_activity_mapping: Any) list[str]

Get a list of all persons from all confluence pages.

Parameters:
  • pages – confluence-vvt page

  • confluence_vvt_activity_mapping – activity mapping for confluence-vvt

Returns:

list of all persons on confluence page

mex.extractors.confluence_vvt.extract.get_all_units_from_all_pages(pages: list[ConfluenceVvtPage], confluence_vvt_activity_mapping: Any) list[str]

Get a list of all units from all confluence pages.

Parameters:
  • pages – all confluence-vvt page

  • confluence_vvt_activity_mapping – activity mapping for confluence-vvt

Returns:

list of all units on a confuence page

mex.extractors.confluence_vvt.extract.get_contact_from_page(page: ConfluenceVvtPage, confluence_vvt_activity_mapping: Any) list[str]

Get contact from confluence page.

Parameters:
  • page – confluence-vvt page

  • confluence_vvt_activity_mapping – activity mapping for confluence-vvt

Returns:

list of contacts

mex.extractors.confluence_vvt.extract.get_involved_persons_from_page(page: ConfluenceVvtPage, confluence_vvt_activity_mapping: Any) list[str]

Get involved persons from confluence page.

Parameters:
  • page – confluence-vvt page

  • confluence_vvt_activity_mapping – activity mapping for confluence-vvt

Returns:

list of involved persons

mex.extractors.confluence_vvt.extract.get_involved_units_from_page(page: ConfluenceVvtPage, confluence_vvt_activity_mapping: Any) list[str]

Get involved unit from confluence page.

Parameters:
  • page – confluence-vvt page

  • confluence_vvt_activity_mapping – activity mapping for confluence-vvt

Returns:

list of involved unit

mex.extractors.confluence_vvt.extract.get_page_data_by_id(page_ids: Iterable[str]) Generator[ConfluenceVvtPage, None, None]

Get confluence page data by its id.

Parameters:

page_ids – list of confluence page ids

Returns:

Generator of ConfluenceVvtPage

mex.extractors.confluence_vvt.extract.get_responsible_unit_from_page(page: ConfluenceVvtPage, confluence_vvt_activity_mapping: Any) list[str]

Get responsible unit from confluence page.

Parameters:
  • page – confluence-vvt page

  • confluence_vvt_activity_mapping – activity mapping for confluence-vvt

Returns:

list of responsible unit

mex.extractors.confluence_vvt.main module

mex.extractors.confluence_vvt.models module

class mex.extractors.confluence_vvt.models.ConfluenceVvtCell

Bases: BaseModel

Base class for cells in a confluence table.

abstract get_texts() list[str]

Returns all texts in this cell.

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

search(pattern: str) list[str]

Returns found strings with matching pattern.

class mex.extractors.confluence_vvt.models.ConfluenceVvtHeading(*, text: str | None)

Bases: ConfluenceVvtCell

Model class for confluence heading.

get_texts() list[str]

Returns all text in this heading.

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'text': FieldInfo(annotation=Union[str, NoneType], required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

text: str | None
class mex.extractors.confluence_vvt.models.ConfluenceVvtPage(*, id: int, title: str, tables: list[ConfluenceVvtTable])

Bases: BaseRawData

Model class for confluence page.

get_end_year() TemporalEntity | None

Return end year from extractor.

get_identifier_in_primary_source() str | None

Return identifier in primary source from extractor.

get_partners() Sequence[str | None]

Return partners from extractor.

get_start_year() TemporalEntity | None

Return start year from extractor.

get_units() Sequence[str | None]

Return units from extractor.

id: int
model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'id': FieldInfo(annotation=int, required=True), 'tables': FieldInfo(annotation=list[ConfluenceVvtTable], required=True), 'title': FieldInfo(annotation=str, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

tables: list[ConfluenceVvtTable]
title: str
class mex.extractors.confluence_vvt.models.ConfluenceVvtRow(*, cells: list[ConfluenceVvtHeading | ConfluenceVvtValue])

Bases: BaseModel

Model class for confluence row.

cells: list[ConfluenceVvtHeading | ConfluenceVvtValue]
get_texts() list[str]

Returns all text in this row.

is_heading() bool

Returns whether all cells in a row are heading.

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'cells': FieldInfo(annotation=list[Union[ConfluenceVvtHeading, ConfluenceVvtValue]], required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

class mex.extractors.confluence_vvt.models.ConfluenceVvtTable(*, rows: list[ConfluenceVvtRow])

Bases: BaseModel

Model class for confluence table.

get_value_row_by_heading(heading: str) ConfluenceVvtRow

If heading is found in a row, return the next row.

Parameters:

heading – Heading string to search for.

Returns:

ConfluenceVvt row instance.

Raises:
  • ValueError – If no row was found matching the given heading.

  • TypeError – If next row is not ConfluenceVvt value row.

  • IndexError – I there is no row after the heading we have found.

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'rows': FieldInfo(annotation=list[ConfluenceVvtRow], required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

rows: list[ConfluenceVvtRow]
class mex.extractors.confluence_vvt.models.ConfluenceVvtValue(*, texts: list[str] | None)

Bases: ConfluenceVvtCell

Model class for confluence value cell.

get_texts() list[str]

Returns all text in this value cell.

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'texts': FieldInfo(annotation=Union[list[str], NoneType], required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

texts: list[str] | None

mex.extractors.confluence_vvt.parse_html module

mex.extractors.confluence_vvt.parse_html.get_clean_current_row_all_cols_data(current_row_all_cols_data: list[str]) list[str]

Get clean data for all cols in current row, removing all unwanted characters.

Parameters:

current_row_all_cols_data – List of all columns of current row

Returns:

list of cleaned strings for all columns of current row

mex.extractors.confluence_vvt.parse_html.get_interne_vorgangsnummer_from_all_rows_data(intnmr_dict: Any | None | list[str]) list[str] | Any

Get Interne Vorgangsnummer from the extracted table.

Parameters:

intnmr_dict – Extracted dict or list of Interne Vorgangsnummer

Returns:

list of extracted Interne Vorgangsnummer

mex.extractors.confluence_vvt.parse_html.get_interne_vorgangsnummer_from_title(interne_vorgangsnummer_title: str) list[str]

Extract Interne Vorgangsnummer from the title row.

Parameters:

interne_vorgangsnummer_title – Interne Vorgangsnummer title

Returns:

list of extracted Interne Vorgangsnummer from the title

mex.extractors.confluence_vvt.parse_html.get_row_data_for_all_rows(table_rows: ResultSet[Any], min_ignorable_cols: int = 1) dict[str, str | list[str]]

Get all the data from the provided rows.

Parameters:
  • table_rows – Table rows ResultSet from bs4

  • min_ignorable_cols – If row has multiple columns, number of columns below this number will be ignored. Defaults to 1.

Returns:

structured dict of all the extracted data

mex.extractors.confluence_vvt.parse_html.get_verantwortlichen(field_name: str, all_rows_data: dict[str, str | list[str]]) tuple[list[str], list[str]]

Get verantworlichen from the extracted all rows data.

Parameters:
  • field_name – Name of the field in the all_rows_data thats is to be extracted

  • all_rows_data – All extracted rows data

Returns:

tuple of names and oes of verantworlicher(in)

mex.extractors.confluence_vvt.parse_html.parse_data_html_page(html: str) tuple[str | list[str] | None, list[str], list[str], list[str], list[str], list[str], list[str], list[str] | Any] | None

Parse required data from html string.

Parameters:

html – Raw html in string format

Returns:

abstract, verantwortliche_studienleiterin, OE names and interne_vorgangsnummer

mex.extractors.confluence_vvt.settings module

class mex.extractors.confluence_vvt.settings.ConfluenceVvtSettings(*, url: str = 'https://confluence.vvt', username: SecretStr = SecretStr('**********'), password: SecretStr = SecretStr('**********'), overview_page_id: str = '123456', template_v1_mapping_path: AssetsPath = AssetsPath('mappings/__final__/confluence-vvt_template_v1'), skip_pages: list[str] = ['123456'])

Bases: BaseModel

Confluence-vvt settings submodule definition for the Confluence-vvt extractor.

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'overview_page_id': FieldInfo(annotation=str, required=False, default='123456', description='Confluence id of the overview page.'), 'password': FieldInfo(annotation=SecretStr, required=False, default=SecretStr('**********'), description='Confluence-vvt password'), 'skip_pages': FieldInfo(annotation=list[str], required=False, default=['123456'], description='List of Confluence-vvt page ids that must be skipped for incomplete or broken data, otherwise it will break the extractor.'), 'template_v1_mapping_path': FieldInfo(annotation=AssetsPath, required=False, default=AssetsPath("mappings/__final__/confluence-vvt_template_v1"), description='Path to the directory with the confluence-vvt mapping files containing the default values, absolute path or relative to `assets_dir`.'), 'url': FieldInfo(annotation=str, required=False, default='https://confluence.vvt', description='URL of Confluence-vvt.'), 'username': FieldInfo(annotation=SecretStr, required=False, default=SecretStr('**********'), description='Confluence-vvt user name')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

overview_page_id: str
password: SecretStr
skip_pages: list[str]
template_v1_mapping_path: AssetsPath
url: str
username: SecretStr

mex.extractors.confluence_vvt.transform module

mex.extractors.confluence_vvt.transform.transform_confluence_vvt_activities_to_extracted_activities(pages: Iterable[ConfluenceVvtPage], extracted_primary_source: ExtractedPrimarySource, confluence_vvt_activity_mapping: Any, merged_ids_by_query_string: dict[str, list[MergedPersonIdentifier]], unit_merged_ids_by_synonym: dict[str, MergedOrganizationalUnitIdentifier]) list[ExtractedActivity]

Transform Confluence-vvt pages to extracted activities.

Parameters:
  • pages – All Confluence-vvt pages

  • extracted_primary_source – Extracted primary source for Confluence-vvt

  • confluence_vvt_activity_mapping – activity mapping for confluence-vvt

  • merged_ids_by_query_string – Mapping from author query to merged IDs

  • unit_merged_ids_by_synonym – Map from unit acronyms and labels to their merged ID

Returns:

List of ExtractedActivity

mex.extractors.confluence_vvt.transform.transform_confluence_vvt_page_to_extracted_activity(page: ConfluenceVvtPage, extracted_primary_source: ExtractedPrimarySource, confluence_vvt_activity_mapping: Any, merged_ids_by_query_string: dict[str, list[MergedPersonIdentifier]], unit_merged_ids_by_synonym: dict[str, MergedOrganizationalUnitIdentifier]) ExtractedActivity | None

Transform Confluence-vvt page to extracted activity.

Parameters:
  • page – Confluence-vvt page

  • extracted_primary_source – Extracted primary source for Confluence-vvt

  • confluence_vvt_activity_mapping – activity mapping for confluence-vvt

  • merged_ids_by_query_string – Mapping from author query to merged IDs

  • unit_merged_ids_by_synonym – Map from unit acronyms and labels to their merged ID

Returns:

ExtractedActivity or None

Module contents