mex.extractors.confluence_vvt package¶
Submodules¶
mex.extractors.confluence_vvt.connector module¶
- class mex.extractors.confluence_vvt.connector.ConfluenceVvtConnector¶
Bases:
HTTPConnector
Connector class to create a session for all requests to confluence-vvt.
- _set_authentication() None ¶
Authenticate to the host.
- _set_url() None ¶
Set url of the host.
- get_page_by_id(page_id: str) ConfluenceVvtPage | None ¶
Get confluence page data by its id.
- Parameters:
page_id – confluence page id
- Returns:
ConfluenceVvtPage or None
mex.extractors.confluence_vvt.extract module¶
- mex.extractors.confluence_vvt.extract.extract_confluence_vvt_authors(authors: list[str]) list[LDAPPersonWithQuery] ¶
Extract LDAP persons with their query string for confluence-vvt authors.
- Parameters:
authors – list of authors
- Returns:
Generator for LDAP persons with query
- mex.extractors.confluence_vvt.extract.fetch_all_vvt_pages_ids() Generator[str, None, None] ¶
Fetch all the ids for data pages.
- Settings:
confluence_vvt.url: Confluence-vvt base url confluence_vvt.overview_page_id: page id of the overview page
- Raises:
MExError – When the pagination limit is exceeded
- Returns:
Generator for page IDs
- mex.extractors.confluence_vvt.extract.get_all_persons_from_all_pages(pages: list[ConfluenceVvtPage], confluence_vvt_activity_mapping: Any) list[str] ¶
Get a list of all persons from all confluence pages.
- Parameters:
pages – confluence-vvt page
confluence_vvt_activity_mapping – activity mapping for confluence-vvt
- Returns:
list of all persons on confluence page
- mex.extractors.confluence_vvt.extract.get_all_units_from_all_pages(pages: list[ConfluenceVvtPage], confluence_vvt_activity_mapping: Any) list[str] ¶
Get a list of all units from all confluence pages.
- Parameters:
pages – all confluence-vvt page
confluence_vvt_activity_mapping – activity mapping for confluence-vvt
- Returns:
list of all units on a confuence page
- mex.extractors.confluence_vvt.extract.get_contact_from_page(page: ConfluenceVvtPage, confluence_vvt_activity_mapping: Any) list[str] ¶
Get contact from confluence page.
- Parameters:
page – confluence-vvt page
confluence_vvt_activity_mapping – activity mapping for confluence-vvt
- Returns:
list of contacts
- mex.extractors.confluence_vvt.extract.get_involved_persons_from_page(page: ConfluenceVvtPage, confluence_vvt_activity_mapping: Any) list[str] ¶
Get involved persons from confluence page.
- Parameters:
page – confluence-vvt page
confluence_vvt_activity_mapping – activity mapping for confluence-vvt
- Returns:
list of involved persons
- mex.extractors.confluence_vvt.extract.get_involved_units_from_page(page: ConfluenceVvtPage, confluence_vvt_activity_mapping: Any) list[str] ¶
Get involved unit from confluence page.
- Parameters:
page – confluence-vvt page
confluence_vvt_activity_mapping – activity mapping for confluence-vvt
- Returns:
list of involved unit
- mex.extractors.confluence_vvt.extract.get_page_data_by_id(page_ids: Iterable[str]) Generator[ConfluenceVvtPage, None, None] ¶
Get confluence page data by its id.
- Parameters:
page_ids – list of confluence page ids
- Returns:
Generator of ConfluenceVvtPage
- mex.extractors.confluence_vvt.extract.get_responsible_unit_from_page(page: ConfluenceVvtPage, confluence_vvt_activity_mapping: Any) list[str] ¶
Get responsible unit from confluence page.
- Parameters:
page – confluence-vvt page
confluence_vvt_activity_mapping – activity mapping for confluence-vvt
- Returns:
list of responsible unit
mex.extractors.confluence_vvt.main module¶
mex.extractors.confluence_vvt.models module¶
- class mex.extractors.confluence_vvt.models.ConfluenceVvtCell¶
Bases:
BaseModel
Base class for cells in a confluence table.
- abstract get_texts() list[str] ¶
Returns all texts in this cell.
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}¶
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {}¶
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
- search(pattern: str) list[str] ¶
Returns found strings with matching pattern.
- class mex.extractors.confluence_vvt.models.ConfluenceVvtHeading(*, text: str | None)¶
Bases:
ConfluenceVvtCell
Model class for confluence heading.
- get_texts() list[str] ¶
Returns all text in this heading.
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}¶
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'text': FieldInfo(annotation=Union[str, NoneType], required=True)}¶
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
- text: str | None¶
- class mex.extractors.confluence_vvt.models.ConfluenceVvtPage(*, id: int, title: str, tables: list[ConfluenceVvtTable])¶
Bases:
BaseRawData
Model class for confluence page.
- get_end_year() TemporalEntity | None ¶
Return end year from extractor.
- get_identifier_in_primary_source() str | None ¶
Return identifier in primary source from extractor.
- get_partners() Sequence[str | None] ¶
Return partners from extractor.
- get_start_year() TemporalEntity | None ¶
Return start year from extractor.
- get_units() Sequence[str | None] ¶
Return units from extractor.
- id: int¶
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}¶
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'id': FieldInfo(annotation=int, required=True), 'tables': FieldInfo(annotation=list[ConfluenceVvtTable], required=True), 'title': FieldInfo(annotation=str, required=True)}¶
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
- tables: list[ConfluenceVvtTable]¶
- title: str¶
- class mex.extractors.confluence_vvt.models.ConfluenceVvtRow(*, cells: list[ConfluenceVvtHeading | ConfluenceVvtValue])¶
Bases:
BaseModel
Model class for confluence row.
- cells: list[ConfluenceVvtHeading | ConfluenceVvtValue]¶
- get_texts() list[str] ¶
Returns all text in this row.
- is_heading() bool ¶
Returns whether all cells in a row are heading.
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}¶
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'cells': FieldInfo(annotation=list[Union[ConfluenceVvtHeading, ConfluenceVvtValue]], required=True)}¶
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
- class mex.extractors.confluence_vvt.models.ConfluenceVvtTable(*, rows: list[ConfluenceVvtRow])¶
Bases:
BaseModel
Model class for confluence table.
- get_value_row_by_heading(heading: str) ConfluenceVvtRow ¶
If heading is found in a row, return the next row.
- Parameters:
heading – Heading string to search for.
- Returns:
ConfluenceVvt row instance.
- Raises:
ValueError – If no row was found matching the given heading.
TypeError – If next row is not ConfluenceVvt value row.
IndexError – I there is no row after the heading we have found.
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}¶
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'rows': FieldInfo(annotation=list[ConfluenceVvtRow], required=True)}¶
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
- rows: list[ConfluenceVvtRow]¶
- class mex.extractors.confluence_vvt.models.ConfluenceVvtValue(*, texts: list[str] | None)¶
Bases:
ConfluenceVvtCell
Model class for confluence value cell.
- get_texts() list[str] ¶
Returns all text in this value cell.
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}¶
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'texts': FieldInfo(annotation=Union[list[str], NoneType], required=True)}¶
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
- texts: list[str] | None¶
mex.extractors.confluence_vvt.parse_html module¶
- mex.extractors.confluence_vvt.parse_html.get_clean_current_row_all_cols_data(current_row_all_cols_data: list[str]) list[str] ¶
Get clean data for all cols in current row, removing all unwanted characters.
- Parameters:
current_row_all_cols_data – List of all columns of current row
- Returns:
list of cleaned strings for all columns of current row
- mex.extractors.confluence_vvt.parse_html.get_interne_vorgangsnummer_from_all_rows_data(intnmr_dict: Any | None | list[str]) list[str] | Any ¶
Get Interne Vorgangsnummer from the extracted table.
- Parameters:
intnmr_dict – Extracted dict or list of Interne Vorgangsnummer
- Returns:
list of extracted Interne Vorgangsnummer
- mex.extractors.confluence_vvt.parse_html.get_interne_vorgangsnummer_from_title(interne_vorgangsnummer_title: str) list[str] ¶
Extract Interne Vorgangsnummer from the title row.
- Parameters:
interne_vorgangsnummer_title – Interne Vorgangsnummer title
- Returns:
list of extracted Interne Vorgangsnummer from the title
- mex.extractors.confluence_vvt.parse_html.get_row_data_for_all_rows(table_rows: ResultSet[Any], min_ignorable_cols: int = 1) dict[str, str | list[str]] ¶
Get all the data from the provided rows.
- Parameters:
table_rows – Table rows ResultSet from bs4
min_ignorable_cols – If row has multiple columns, number of columns below this number will be ignored. Defaults to 1.
- Returns:
structured dict of all the extracted data
- mex.extractors.confluence_vvt.parse_html.get_verantwortlichen(field_name: str, all_rows_data: dict[str, str | list[str]]) tuple[list[str], list[str]] ¶
Get verantworlichen from the extracted all rows data.
- Parameters:
field_name – Name of the field in the all_rows_data thats is to be extracted
all_rows_data – All extracted rows data
- Returns:
tuple of names and oes of verantworlicher(in)
- mex.extractors.confluence_vvt.parse_html.parse_data_html_page(html: str) tuple[str | list[str] | None, list[str], list[str], list[str], list[str], list[str], list[str], list[str] | Any] | None ¶
Parse required data from html string.
- Parameters:
html – Raw html in string format
- Returns:
abstract, verantwortliche_studienleiterin, OE names and interne_vorgangsnummer
mex.extractors.confluence_vvt.settings module¶
- class mex.extractors.confluence_vvt.settings.ConfluenceVvtSettings(*, url: str = 'https://confluence.vvt', username: SecretStr = SecretStr('**********'), password: SecretStr = SecretStr('**********'), overview_page_id: str = '123456', template_v1_mapping_path: AssetsPath = AssetsPath('mappings/__final__/confluence-vvt_template_v1'), skip_pages: list[str] = ['123456'])¶
Bases:
BaseModel
Confluence-vvt settings submodule definition for the Confluence-vvt extractor.
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}¶
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'overview_page_id': FieldInfo(annotation=str, required=False, default='123456', description='Confluence id of the overview page.'), 'password': FieldInfo(annotation=SecretStr, required=False, default=SecretStr('**********'), description='Confluence-vvt password'), 'skip_pages': FieldInfo(annotation=list[str], required=False, default=['123456'], description='List of Confluence-vvt page ids that must be skipped for incomplete or broken data, otherwise it will break the extractor.'), 'template_v1_mapping_path': FieldInfo(annotation=AssetsPath, required=False, default=AssetsPath("mappings/__final__/confluence-vvt_template_v1"), description='Path to the directory with the confluence-vvt mapping files containing the default values, absolute path or relative to `assets_dir`.'), 'url': FieldInfo(annotation=str, required=False, default='https://confluence.vvt', description='URL of Confluence-vvt.'), 'username': FieldInfo(annotation=SecretStr, required=False, default=SecretStr('**********'), description='Confluence-vvt user name')}¶
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
- overview_page_id: str¶
- password: SecretStr¶
- skip_pages: list[str]¶
- template_v1_mapping_path: AssetsPath¶
- url: str¶
- username: SecretStr¶
mex.extractors.confluence_vvt.transform module¶
- mex.extractors.confluence_vvt.transform.transform_confluence_vvt_activities_to_extracted_activities(pages: Iterable[ConfluenceVvtPage], extracted_primary_source: ExtractedPrimarySource, confluence_vvt_activity_mapping: Any, merged_ids_by_query_string: dict[str, list[MergedPersonIdentifier]], unit_merged_ids_by_synonym: dict[str, MergedOrganizationalUnitIdentifier]) list[ExtractedActivity] ¶
Transform Confluence-vvt pages to extracted activities.
- Parameters:
pages – All Confluence-vvt pages
extracted_primary_source – Extracted primary source for Confluence-vvt
confluence_vvt_activity_mapping – activity mapping for confluence-vvt
merged_ids_by_query_string – Mapping from author query to merged IDs
unit_merged_ids_by_synonym – Map from unit acronyms and labels to their merged ID
- Returns:
List of ExtractedActivity
- mex.extractors.confluence_vvt.transform.transform_confluence_vvt_page_to_extracted_activity(page: ConfluenceVvtPage, extracted_primary_source: ExtractedPrimarySource, confluence_vvt_activity_mapping: Any, merged_ids_by_query_string: dict[str, list[MergedPersonIdentifier]], unit_merged_ids_by_synonym: dict[str, MergedOrganizationalUnitIdentifier]) ExtractedActivity | None ¶
Transform Confluence-vvt page to extracted activity.
- Parameters:
page – Confluence-vvt page
extracted_primary_source – Extracted primary source for Confluence-vvt
confluence_vvt_activity_mapping – activity mapping for confluence-vvt
merged_ids_by_query_string – Mapping from author query to merged IDs
unit_merged_ids_by_synonym – Map from unit acronyms and labels to their merged ID
- Returns:
ExtractedActivity or None