mex.extractors.ff_projects package

Subpackages

Submodules

mex.extractors.ff_projects.extract module

mex.extractors.ff_projects.extract.extract_ff_project_authors(ff_projects_sources: Iterable[FFProjectsSource]) Generator[LDAPPersonWithQuery, None, None]

Extract LDAP persons with their query string for FF Projects authors.

Parameters:

ff_projects_sources – FF Projects sources

Returns:

Generator for LDAP persons with query

mex.extractors.ff_projects.extract.extract_ff_projects_organizations(ff_projects_sources: Iterable[FFProjectsSource]) dict[str, MergedOrganizationIdentifier]

Search and extract organization from wikidata.

Parameters:

ff_projects_sources – Iterable of ff-project sources

Returns:

Dict with organization label and WikidataOrganization ID

mex.extractors.ff_projects.extract.extract_ff_projects_source(row: pd.Series[Any]) FFProjectsSource | None

Extract one FF Projects source from a single pandas series row.

Parameters:

row – pandas df series row representing one source

Returns:

FF Projects source

mex.extractors.ff_projects.extract.extract_ff_projects_sources() Generator[FFProjectsSource, None, None]

Extract FF Projects sources by loading data from MS-Excel file.

Settings:
ff_projects.file_path: Path to the ff-projects list, absolute or relative to

assets_dir

Returns:

Generator for FF Projects sources

mex.extractors.ff_projects.extract.filter_out_duplicate_source_ids(sources: Iterable[FFProjectsSource]) Generator[FFProjectsSource, None, None]

Remove duplicate `lfd_nr`s from the given sources.

Parameters:

sources – Iterable of FF Projects sources

Returns:

Filtered FF Projects sources

mex.extractors.ff_projects.extract.get_clean_names(name: str) str

Clean name from unwanted characters and numerals.

Parameters:

name – Name of the person

Returns:

Cleaned Name

Return type:

str

mex.extractors.ff_projects.extract.get_optional_string_from_cell(cell_value: Any) str | None

Try to extract the string value from a cell by truncating floats.

Parameters:

cell_value – Value of a cell, could be string, int or datetime

Returns:

String or None

mex.extractors.ff_projects.extract.get_string_from_cell(cell_value: Any) str

Try to extract the string value from a cell by truncating floats.

Parameters:

cell_value – Value of a cell, could be string, int or datetime

Returns:

String

mex.extractors.ff_projects.extract.get_temporal_entity_from_cell(cell_value: Any) TemporalEntity | None

Try to extract a temporal_entity from a cell.

Parameters:

cell_value – Value of a cell, could be int, string or datetime

Returns:

TemporalEntity or None

mex.extractors.ff_projects.filter module

mex.extractors.ff_projects.filter.filter_and_log_ff_projects_source(source: FFProjectsSource, primary_source_id: Identifier, unit_stable_target_ids_by_synonym: dict[str, MergedOrganizationalUnitIdentifier]) bool

Filter a FFprojectSource according to settings and log filtering.

Parameters:
  • source – FFProjectSource

  • primary_source_id – Identifier of primary source

  • unit_stable_target_ids_by_synonym – Unit IDs grouped by synonyms

Settings:

ff_projects.skip_funding: Skip sources with this funding ff_projects.skip_topics: Skip sources with these topics ff_projects.skip_years_strings: Skip sources with these years ff_projects.skip_clients: Skip sources with these clients

Returns:

False if source is filtered out, else True

mex.extractors.ff_projects.filter.filter_and_log_ff_projects_sources(sources: Iterable[FFProjectsSource], primary_source_id: Identifier, unit_stable_target_ids_by_synonym: dict[str, MergedOrganizationalUnitIdentifier]) Generator[FFProjectsSource, None, None]

Filter FF Projects sources and log filtered sources.

Parameters:
  • sources – Iterable of FFProjectSources

  • primary_source_id – Identifier of primary source

  • unit_stable_target_ids_by_synonym – Unit IDs grouped by synonyms

Returns:

Generator for filtered FF Projects sources

mex.extractors.ff_projects.main module

mex.extractors.ff_projects.settings module

class mex.extractors.ff_projects.settings.FFProjectsSettings(*, file_path: AssetsPath = AssetsPath('raw-data/ff-projects/ff-projects.xlsx'), skip_funding: list[str] = ['Sonstige'], skip_topics: list[str] = ['Sonstige'], skip_years_strings: list[str] = ['fehlt', 'keine', 'offen'], skip_clients: list[str] = ['Sonstige'], mapping_path: AssetsPath = AssetsPath('mappings/__final__/ff-projects'))

Bases: BaseModel

Settings submodel for the FF Projects extractor.

file_path: AssetsPath
mapping_path: AssetsPath
model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'file_path': FieldInfo(annotation=AssetsPath, required=False, default=AssetsPath("raw-data/ff-projects/ff-projects.xlsx"), description='Path to the FF Projects excel file, absolute path or relative to `assets_dir`.'), 'mapping_path': FieldInfo(annotation=AssetsPath, required=False, default=AssetsPath("mappings/__final__/ff-projects"), description='Path to the directory with the ff-projects mapping filesvalues, absolute path or relative to `assets_dir`.'), 'skip_clients': FieldInfo(annotation=list[str], required=False, default=['Sonstige'], description='Skip sources with these clients'), 'skip_funding': FieldInfo(annotation=list[str], required=False, default=['Sonstige'], description='Skip sources with this funding'), 'skip_topics': FieldInfo(annotation=list[str], required=False, default=['Sonstige'], description='Skip sources with these topics'), 'skip_years_strings': FieldInfo(annotation=list[str], required=False, default=['fehlt', 'keine', 'offen'], description='Skip sources with these years')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

skip_clients: list[str]
skip_funding: list[str]
skip_topics: list[str]
skip_years_strings: list[str]

mex.extractors.ff_projects.transform module

mex.extractors.ff_projects.transform.transform_ff_projects_source_to_extracted_activity(ff_projects_source: FFProjectsSource, extracted_primary_source: ExtractedPrimarySource, person_stable_target_ids_by_query_string: dict[str, list[MergedPersonIdentifier]], unit_stable_target_id_by_synonym: dict[str, MergedOrganizationalUnitIdentifier], organization_stable_target_id_by_synonyms: dict[str, MergedOrganizationIdentifier], ff_projects_activity: Any) ExtractedActivity

Transform FF Projects source to an extracted activity.

Parameters:
  • ff_projects_source – FF Projects source

  • extracted_primary_source – Extracted primary_source for FF Projects

  • person_stable_target_ids_by_query_string – Mapping from author query to person stable target ID

  • unit_stable_target_id_by_synonym – Mapping from unit acronyms and labels to unit stable target ID

  • organization_stable_target_id_by_synonyms – Mapping from organization synonyms to organization stable target ID

  • ff_projects_activity – activity mapping model with default values

Returns:

Extracted activity for the given projects source

Module contents