mex.extractors.confluence_vvt package¶

Submodules¶

mex.extractors.confluence_vvt.connector module¶

class mex.extractors.confluence_vvt.connector.ConfluenceVvtConnector¶

Bases: HTTPConnector

Connector class to create a session for all requests to confluence-vvt.

_set_authentication() → None¶: Authenticate to the host.

_set_url() → None¶: Set url of the host.

get_page_by_id(page_id: str) → ConfluenceVvtPage | None¶

Get confluence page data by its id.

Parameters:: page_id – confluence page id
Returns:: ConfluenceVvtPage or None

mex.extractors.confluence_vvt.extract module¶

mex.extractors.confluence_vvt.extract.extract_confluence_vvt_authors(authors: list[str]) → list[LDAPPersonWithQuery]¶

Extract LDAP persons with their query string for confluence-vvt authors.

Parameters:: authors – list of authors
Returns:: Generator for LDAP persons with query

mex.extractors.confluence_vvt.extract.fetch_all_vvt_pages_ids() → Generator[str, None, None]¶

Fetch all the ids for data pages.

Settings:: confluence_vvt.url: Confluence-vvt base url confluence_vvt.overview_page_id: page id of the overview page

Raises:: MExError – When the pagination limit is exceeded
Returns:: Generator for page IDs

mex.extractors.confluence_vvt.extract.get_all_persons_from_all_pages(pages: list[ConfluenceVvtPage], confluence_vvt_activity_mapping: ActivityMapping) → list[str]¶

Get a list of all persons from all confluence pages.

Parameters:

pages – confluence-vvt page
confluence_vvt_activity_mapping – activity mapping for confluence-vvt

Returns:

list of all persons on confluence page

mex.extractors.confluence_vvt.extract.get_all_units_from_all_pages(pages: list[ConfluenceVvtPage], confluence_vvt_activity_mapping: ActivityMapping) → list[str]¶

Get a list of all units from all confluence pages.

Parameters:

pages – all confluence-vvt page
confluence_vvt_activity_mapping – activity mapping for confluence-vvt

Returns:

list of all units on a confluence page

mex.extractors.confluence_vvt.extract.get_contact_from_page(page: ConfluenceVvtPage, confluence_vvt_activity_mapping: ActivityMapping) → list[str]¶

Get contact from confluence page.

Parameters:

page – confluence-vvt page
confluence_vvt_activity_mapping – activity mapping for confluence-vvt

Returns:

list of contacts

mex.extractors.confluence_vvt.extract.get_involved_persons_from_page(page: ConfluenceVvtPage, confluence_vvt_activity_mapping: ActivityMapping) → list[str]¶

Get involved persons from confluence page.

Parameters:

page – confluence-vvt page
confluence_vvt_activity_mapping – activity mapping for confluence-vvt

Returns:

list of involved persons

mex.extractors.confluence_vvt.extract.get_involved_units_from_page(page: ConfluenceVvtPage, confluence_vvt_activity_mapping: ActivityMapping) → list[str]¶

Get involved unit from confluence page.

Parameters:

page – confluence-vvt page
confluence_vvt_activity_mapping – activity mapping for confluence-vvt

Returns:

list of involved unit

mex.extractors.confluence_vvt.extract.get_page_data_by_id(page_ids: Iterable[str]) → Generator[ConfluenceVvtPage, None, None]¶

Get confluence page data by its id.

Parameters:: page_ids – list of confluence page ids
Returns:: Generator of ConfluenceVvtPage

mex.extractors.confluence_vvt.extract.get_responsible_unit_from_page(page: ConfluenceVvtPage, confluence_vvt_activity_mapping: ActivityMapping) → list[str]¶

Get responsible unit from confluence page.

Parameters:

page – confluence-vvt page
confluence_vvt_activity_mapping – activity mapping for confluence-vvt

Returns:

list of responsible unit

mex.extractors.confluence_vvt.main module¶

mex.extractors.confluence_vvt.models module¶

class mex.extractors.confluence_vvt.models.ConfluenceVvtCell¶

Bases: BaseModel

Base class for cells in a confluence table.

abstractmethod get_texts() → list[str]¶: Returns all texts in this cell.

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}¶: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {}¶

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

search(pattern: str) → list[str]¶: Returns found strings with matching pattern.

class mex.extractors.confluence_vvt.models.ConfluenceVvtHeading(*, text: str | None)¶

Bases: ConfluenceVvtCell

Model class for confluence heading.

get_texts() → list[str]¶: Returns all text in this heading.

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}¶: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'text': FieldInfo(annotation=Union[str, NoneType], required=True)}¶

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

text: str | None¶

class mex.extractors.confluence_vvt.models.ConfluenceVvtPage(*, id: int, title: str, tables: list[ConfluenceVvtTable])¶

Bases: BaseRawData

Model class for confluence page.

get_end_year() → TemporalEntity | None¶: Return end year from extractor.

get_identifier_in_primary_source() → str | None¶: Return identifier in primary source from extractor.

get_partners() → Sequence[str | None]¶: Return partners from extractor.

get_start_year() → TemporalEntity | None¶: Return start year from extractor.

get_units() → Sequence[str | None]¶: Return units from extractor.

id: int¶

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}¶: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'id': FieldInfo(annotation=int, required=True), 'tables': FieldInfo(annotation=list[ConfluenceVvtTable], required=True), 'title': FieldInfo(annotation=str, required=True)}¶

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

tables: list[ConfluenceVvtTable]¶

title: str¶

class mex.extractors.confluence_vvt.models.ConfluenceVvtRow(*, cells: list[ConfluenceVvtHeading | ConfluenceVvtValue])¶

Bases: BaseModel

Model class for confluence row.

cells: list[ConfluenceVvtHeading | ConfluenceVvtValue]¶

get_texts() → list[str]¶: Returns all text in this row.

is_heading() → bool¶: Returns whether all cells in a row are heading.

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}¶: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'cells': FieldInfo(annotation=list[Union[ConfluenceVvtHeading, ConfluenceVvtValue]], required=True)}¶

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

class mex.extractors.confluence_vvt.models.ConfluenceVvtTable(*, rows: list[ConfluenceVvtRow])¶

Bases: BaseModel

Model class for confluence table.

get_value_row_by_heading(heading: str) → ConfluenceVvtRow¶

If heading is found in a row, return the next row.

Parameters:

heading – Heading string to search for.

Returns:

ConfluenceVvt row instance.

Raises:

ValueError – If no row was found matching the given heading.
TypeError – If next row is not ConfluenceVvt value row.
IndexError – I there is no row after the heading we have found.

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}¶: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'rows': FieldInfo(annotation=list[ConfluenceVvtRow], required=True)}¶

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

rows: list[ConfluenceVvtRow]¶

class mex.extractors.confluence_vvt.models.ConfluenceVvtValue(*, texts: list[str] | None)¶

Bases: ConfluenceVvtCell

Model class for confluence value cell.

get_texts() → list[str]¶: Returns all text in this value cell.

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}¶: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'texts': FieldInfo(annotation=Union[list[str], NoneType], required=True)}¶

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

texts: list[str] | None¶

mex.extractors.confluence_vvt.parse_html module¶

mex.extractors.confluence_vvt.parse_html.get_clean_current_row_all_cols_data(current_row_all_cols_data: list[str]) → list[str]¶

Get clean data for all cols in current row, removing all unwanted characters.

Parameters:: current_row_all_cols_data – List of all columns of current row
Returns:: list of cleaned strings for all columns of current row

mex.extractors.confluence_vvt.parse_html.get_interne_vorgangsnummer_from_all_rows_data(intnmr_dict: Any | None | list[str]) → list[str] | Any¶

Get Interne Vorgangsnummer from the extracted table.

Parameters:: intnmr_dict – Extracted dict or list of Interne Vorgangsnummer
Returns:: list of extracted Interne Vorgangsnummer

mex.extractors.confluence_vvt.parse_html.get_interne_vorgangsnummer_from_title(interne_vorgangsnummer_title: str) → list[str]¶

Extract Interne Vorgangsnummer from the title row.

Parameters:: interne_vorgangsnummer_title – Interne Vorgangsnummer title
Returns:: list of extracted Interne Vorgangsnummer from the title

mex.extractors.confluence_vvt.parse_html.get_row_data_for_all_rows(table_rows: ResultSet[Any], min_ignorable_cols: int = 1) → dict[str, str | list[str]]¶

Get all the data from the provided rows.

Parameters:

table_rows – Table rows ResultSet from bs4
min_ignorable_cols – If row has multiple columns, number of columns below this number will be ignored. Defaults to 1.

Returns:

structured dict of all the extracted data

mex.extractors.confluence_vvt.parse_html.get_verantwortlichen(field_name: str, all_rows_data: dict[str, str | list[str]]) → tuple[list[str], list[str]]¶

Get verantworlichen from the extracted all rows data.

Parameters:

field_name – Name of the field in the all_rows_data thats is to be extracted
all_rows_data – All extracted rows data

Returns:

tuple of names and oes of verantworlicher(in)

mex.extractors.confluence_vvt.parse_html.parse_data_html_page(html: str) → tuple[str | list[str] | None, list[str], list[str], list[str], list[str], list[str], list[str], list[str] | Any] | None¶

Parse required data from html string.

Parameters:: html – Raw html in string format
Returns:: abstract, verantwortliche_studienleiterin, OE names and interne_vorgangsnummer

mex.extractors.confluence_vvt.settings module¶

class mex.extractors.confluence_vvt.settings.ConfluenceVvtSettings(*, url: str = 'https://confluence.vvt', username: SecretStr = SecretStr('**********'), password: SecretStr = SecretStr('**********'), overview_page_id: str = '123456', template_v1_mapping_path: AssetsPath = AssetsPath('mappings/confluence-vvt_template_v1'), skip_pages: list[str] = ['123456'])¶

Bases: BaseModel

Confluence-vvt settings submodule definition for the Confluence-vvt extractor.

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}¶: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'overview_page_id': FieldInfo(annotation=str, required=False, default='123456', description='Confluence id of the overview page.'), 'password': FieldInfo(annotation=SecretStr, required=False, default=SecretStr('**********'), description='Confluence-vvt password'), 'skip_pages': FieldInfo(annotation=list[str], required=False, default=['123456'], description='List of Confluence-vvt page ids that must be skipped for incomplete or broken data, otherwise it will break the extractor.'), 'template_v1_mapping_path': FieldInfo(annotation=AssetsPath, required=False, default=AssetsPath("mappings/confluence-vvt_template_v1"), description='Path to the directory with the confluence-vvt mapping files containing the default values, absolute path or relative to `assets_dir`.'), 'url': FieldInfo(annotation=str, required=False, default='https://confluence.vvt', description='URL of Confluence-vvt.'), 'username': FieldInfo(annotation=SecretStr, required=False, default=SecretStr('**********'), description='Confluence-vvt user name')}¶

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

overview_page_id: str¶

password: SecretStr¶

skip_pages: list[str]¶

template_v1_mapping_path: AssetsPath¶

url: str¶

username: SecretStr¶

mex.extractors.confluence_vvt.transform module¶

mex.extractors.confluence_vvt.transform.transform_confluence_vvt_activities_to_extracted_activities(pages: Iterable[ConfluenceVvtPage], extracted_primary_source: ExtractedPrimarySource, confluence_vvt_activity_mapping: ActivityMapping, merged_ids_by_query_string: dict[str, list[MergedPersonIdentifier]], unit_merged_ids_by_synonym: dict[str, MergedOrganizationalUnitIdentifier]) → list[ExtractedActivity]¶

Transform Confluence-vvt pages to extracted activities.

Parameters:

pages – All Confluence-vvt pages
extracted_primary_source – Extracted primary source for Confluence-vvt
confluence_vvt_activity_mapping – activity mapping for confluence-vvt
merged_ids_by_query_string – Mapping from author query to merged IDs
unit_merged_ids_by_synonym – Map from unit acronyms and labels to their merged ID

Returns:

List of ExtractedActivity

mex.extractors.confluence_vvt.transform.transform_confluence_vvt_page_to_extracted_activity(page: ConfluenceVvtPage, extracted_primary_source: ExtractedPrimarySource, confluence_vvt_activity_mapping: ActivityMapping, merged_ids_by_query_string: dict[str, list[MergedPersonIdentifier]], unit_merged_ids_by_synonym: dict[str, MergedOrganizationalUnitIdentifier]) → ExtractedActivity | None¶

Transform Confluence-vvt page to extracted activity.

Parameters:

page – Confluence-vvt page
extracted_primary_source – Extracted primary source for Confluence-vvt
confluence_vvt_activity_mapping – activity mapping for confluence-vvt
merged_ids_by_query_string – Mapping from author query to merged IDs
unit_merged_ids_by_synonym – Map from unit acronyms and labels to their merged ID

Returns:

ExtractedActivity or None

mex.extractors.confluence_vvt package¶

Submodules¶

mex.extractors.confluence_vvt.connector module¶

mex.extractors.confluence_vvt.extract module¶

mex.extractors.confluence_vvt.main module¶

mex.extractors.confluence_vvt.models module¶

mex.extractors.confluence_vvt.parse_html module¶

mex.extractors.confluence_vvt.settings module¶

mex.extractors.confluence_vvt.transform module¶

Module contents¶

mex-extractors

Navigation

Related Topics