mex.extractors.confluence_vvt package¶
Submodules¶
mex.extractors.confluence_vvt.connector module¶
- class mex.extractors.confluence_vvt.connector.ConfluenceVvtConnector¶
Bases:
HTTPConnectorConnector class to create a session for all requests to confluence-vvt.
- _set_authentication() None¶
Authenticate to the host.
- _set_url() None¶
Set url of the host.
- get_page_by_id(page_id: str) ConfluenceVvtPage | None¶
Get confluence page data by its id.
- Parameters:
page_id – confluence page id
- Returns:
ConfluenceVvtPage or None
mex.extractors.confluence_vvt.extract module¶
- mex.extractors.confluence_vvt.extract.extract_confluence_vvt_authors(authors: list[str]) list[LDAPPersonWithQuery]¶
Extract LDAP persons with their query string for confluence-vvt authors.
- Parameters:
authors – list of authors
- Returns:
Generator for LDAP persons with query
- mex.extractors.confluence_vvt.extract.fetch_all_vvt_pages_ids() Generator[str, None, None]¶
Fetch all the ids for data pages.
- Settings:
confluence_vvt.url: Confluence-vvt base url confluence_vvt.overview_page_id: page id of the overview page
- Raises:
MExError – When the pagination limit is exceeded
- Returns:
Generator for page IDs
- mex.extractors.confluence_vvt.extract.get_all_persons_from_all_pages(pages: list[ConfluenceVvtPage], confluence_vvt_activity_mapping: ActivityMapping) list[str]¶
Get a list of all persons from all confluence pages.
- Parameters:
pages – confluence-vvt page
confluence_vvt_activity_mapping – activity mapping for confluence-vvt
- Returns:
list of all persons on confluence page
- mex.extractors.confluence_vvt.extract.get_all_units_from_all_pages(pages: list[ConfluenceVvtPage], confluence_vvt_activity_mapping: ActivityMapping) list[str]¶
Get a list of all units from all confluence pages.
- Parameters:
pages – all confluence-vvt page
confluence_vvt_activity_mapping – activity mapping for confluence-vvt
- Returns:
list of all units on a confluence page
- mex.extractors.confluence_vvt.extract.get_contact_from_page(page: ConfluenceVvtPage, confluence_vvt_activity_mapping: ActivityMapping) list[str]¶
Get contact from confluence page.
- Parameters:
page – confluence-vvt page
confluence_vvt_activity_mapping – activity mapping for confluence-vvt
- Returns:
list of contacts
- mex.extractors.confluence_vvt.extract.get_involved_persons_from_page(page: ConfluenceVvtPage, confluence_vvt_activity_mapping: ActivityMapping) list[str]¶
Get involved persons from confluence page.
- Parameters:
page – confluence-vvt page
confluence_vvt_activity_mapping – activity mapping for confluence-vvt
- Returns:
list of involved persons
- mex.extractors.confluence_vvt.extract.get_involved_units_from_page(page: ConfluenceVvtPage, confluence_vvt_activity_mapping: ActivityMapping) list[str]¶
Get involved unit from confluence page.
- Parameters:
page – confluence-vvt page
confluence_vvt_activity_mapping – activity mapping for confluence-vvt
- Returns:
list of involved unit
- mex.extractors.confluence_vvt.extract.get_page_data_by_id(page_ids: Iterable[str]) Generator[ConfluenceVvtPage, None, None]¶
Get confluence page data by its id.
- Parameters:
page_ids – list of confluence page ids
- Returns:
Generator of ConfluenceVvtPage
- mex.extractors.confluence_vvt.extract.get_responsible_unit_from_page(page: ConfluenceVvtPage, confluence_vvt_activity_mapping: ActivityMapping) list[str]¶
Get responsible unit from confluence page.
- Parameters:
page – confluence-vvt page
confluence_vvt_activity_mapping – activity mapping for confluence-vvt
- Returns:
list of responsible unit
mex.extractors.confluence_vvt.main module¶
mex.extractors.confluence_vvt.models module¶
- class mex.extractors.confluence_vvt.models.ConfluenceVvtCell¶
Bases:
BaseModelBase class for cells in a confluence table.
- abstractmethod get_texts() list[str]¶
Returns all texts in this cell.
- model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_by_alias': True, 'validate_by_name': True, 'validate_default': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- search(pattern: str) list[str]¶
Returns found strings with matching pattern.
- class mex.extractors.confluence_vvt.models.ConfluenceVvtHeading(*, text: str | None)¶
Bases:
ConfluenceVvtCellModel class for confluence heading.
- get_texts() list[str]¶
Returns all text in this heading.
- model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_by_alias': True, 'validate_by_name': True, 'validate_default': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- text: str | None¶
- class mex.extractors.confluence_vvt.models.ConfluenceVvtPage(*, id: int, title: str, tables: list[ConfluenceVvtTable])¶
Bases:
BaseRawDataModel class for confluence page.
- get_end_year() TemporalEntity | None¶
Return end year from extractor.
- get_identifier_in_primary_source() str | None¶
Return identifier in primary source from extractor.
- get_partners() Sequence[str | None]¶
Return partners from extractor.
- get_start_year() TemporalEntity | None¶
Return start year from extractor.
- get_units() Sequence[str | None]¶
Return units from extractor.
- id: int¶
- model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_by_alias': True, 'validate_by_name': True, 'validate_default': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- tables: list[ConfluenceVvtTable]¶
- title: str¶
- class mex.extractors.confluence_vvt.models.ConfluenceVvtRow(*, cells: list[ConfluenceVvtHeading | ConfluenceVvtValue])¶
Bases:
BaseModelModel class for confluence row.
- cells: list[ConfluenceVvtHeading | ConfluenceVvtValue]¶
- get_texts() list[str]¶
Returns all text in this row.
- is_heading() bool¶
Returns whether all cells in a row are heading.
- model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_by_alias': True, 'validate_by_name': True, 'validate_default': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class mex.extractors.confluence_vvt.models.ConfluenceVvtTable(*, rows: list[ConfluenceVvtRow])¶
Bases:
BaseModelModel class for confluence table.
- get_value_row_by_heading(heading: str) ConfluenceVvtRow¶
If heading is found in a row, return the next row.
- Parameters:
heading – Heading string to search for.
- Returns:
ConfluenceVvt row instance.
- Raises:
ValueError – If no row was found matching the given heading.
TypeError – If next row is not ConfluenceVvt value row.
IndexError – I there is no row after the heading we have found.
- model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_by_alias': True, 'validate_by_name': True, 'validate_default': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- rows: list[ConfluenceVvtRow]¶
- class mex.extractors.confluence_vvt.models.ConfluenceVvtValue(*, texts: list[str] | None)¶
Bases:
ConfluenceVvtCellModel class for confluence value cell.
- get_texts() list[str]¶
Returns all text in this value cell.
- model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_by_alias': True, 'validate_by_name': True, 'validate_default': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- texts: list[str] | None¶
mex.extractors.confluence_vvt.parse_html module¶
- mex.extractors.confluence_vvt.parse_html.get_clean_current_row_all_cols_data(current_row_all_cols_data: list[str]) list[str]¶
Get clean data for all cols in current row, removing all unwanted characters.
- Parameters:
current_row_all_cols_data – List of all columns of current row
- Returns:
list of cleaned strings for all columns of current row
- mex.extractors.confluence_vvt.parse_html.get_interne_vorgangsnummer_from_all_rows_data(intnmr_dict: Any | None | list[str]) list[str] | Any¶
Get Interne Vorgangsnummer from the extracted table.
- Parameters:
intnmr_dict – Extracted dict or list of Interne Vorgangsnummer
- Returns:
list of extracted Interne Vorgangsnummer
- mex.extractors.confluence_vvt.parse_html.get_interne_vorgangsnummer_from_title(interne_vorgangsnummer_title: str) list[str]¶
Extract Interne Vorgangsnummer from the title row.
- Parameters:
interne_vorgangsnummer_title – Interne Vorgangsnummer title
- Returns:
list of extracted Interne Vorgangsnummer from the title
- mex.extractors.confluence_vvt.parse_html.get_row_data_for_all_rows(table_rows: ResultSet[Any], min_ignorable_cols: int = 1) dict[str, str | list[str]]¶
Get all the data from the provided rows.
- Parameters:
table_rows – Table rows ResultSet from bs4
min_ignorable_cols – If row has multiple columns, number of columns below this number will be ignored. Defaults to 1.
- Returns:
structured dict of all the extracted data
- mex.extractors.confluence_vvt.parse_html.get_verantwortlichen(field_name: str, all_rows_data: dict[str, str | list[str]]) tuple[list[str], list[str]]¶
Get verantworlichen from the extracted all rows data.
- Parameters:
field_name – Name of the field in the all_rows_data thats is to be extracted
all_rows_data – All extracted rows data
- Returns:
tuple of names and oes of verantworlicher(in)
- mex.extractors.confluence_vvt.parse_html.parse_data_html_page(html: str) tuple[str | list[str] | None, list[str], list[str], list[str], list[str], list[str], list[str], list[str] | Any] | None¶
Parse required data from html string.
- Parameters:
html – Raw html in string format
- Returns:
abstract, verantwortliche_studienleiterin, OE names and interne_vorgangsnummer
mex.extractors.confluence_vvt.settings module¶
- class mex.extractors.confluence_vvt.settings.ConfluenceVvtSettings(*, url: str = 'https://confluence.vvt', username: SecretStr = SecretStr('**********'), password: SecretStr = SecretStr('**********'), overview_page_id: str = '123456', template_v1_mapping_path: AssetsPath = AssetsPath('mappings/confluence-vvt_template_v1'), skip_pages: list[str] = ['123456'])¶
Bases:
BaseModelConfluence-vvt settings submodule definition for the Confluence-vvt extractor.
- model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_by_alias': True, 'validate_by_name': True, 'validate_default': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- overview_page_id: str¶
- password: SecretStr¶
- skip_pages: list[str]¶
- template_v1_mapping_path: AssetsPath¶
- url: str¶
- username: SecretStr¶
mex.extractors.confluence_vvt.transform module¶
- mex.extractors.confluence_vvt.transform.transform_confluence_vvt_activities_to_extracted_activities(pages: Iterable[ConfluenceVvtPage], confluence_vvt_activity_mapping: ActivityMapping, merged_ids_by_query_string: dict[str, list[MergedPersonIdentifier]], unit_merged_ids_by_synonym: dict[str, MergedOrganizationalUnitIdentifier]) list[ExtractedActivity]¶
Transform Confluence-vvt pages to extracted activities.
- Parameters:
pages – All Confluence-vvt pages
confluence_vvt_activity_mapping – activity mapping for confluence-vvt
merged_ids_by_query_string – Mapping from author query to merged IDs
unit_merged_ids_by_synonym – Map from unit acronyms and labels to their merged ID
- Returns:
List of ExtractedActivity
- mex.extractors.confluence_vvt.transform.transform_confluence_vvt_page_to_extracted_activity(page: ConfluenceVvtPage, confluence_vvt_activity_mapping: ActivityMapping, merged_ids_by_query_string: dict[str, list[MergedPersonIdentifier]], unit_merged_ids_by_synonym: dict[str, MergedOrganizationalUnitIdentifier]) ExtractedActivity | None¶
Transform Confluence-vvt page to extracted activity.
- Parameters:
page – Confluence-vvt page
confluence_vvt_activity_mapping – activity mapping for confluence-vvt
merged_ids_by_query_string – Mapping from author query to merged IDs
unit_merged_ids_by_synonym – Map from unit acronyms and labels to their merged ID
- Returns:
ExtractedActivity or None