mex.extractors.ff_projects package¶

Submodules¶

mex.extractors.ff_projects.extract module¶

mex.extractors.ff_projects.extract.extract_ff_project_authors(ff_projects_sources: Iterable[FFProjectsSource]) → Generator[LDAPPersonWithQuery, None, None]¶

Extract LDAP persons with their query string for FF Projects authors.

Parameters:: ff_projects_sources – FF Projects sources
Returns:: Generator for LDAP persons with query

mex.extractors.ff_projects.extract.extract_ff_projects_organizations(ff_projects_sources: Iterable[FFProjectsSource]) → dict[str, MergedOrganizationIdentifier]¶

Search and extract organization from wikidata.

Parameters:: ff_projects_sources – Iterable of ff-project sources
Returns:: Dict with organization label and WikidataOrganization ID

mex.extractors.ff_projects.extract.extract_ff_projects_source(row: pd.Series[Any]) → FFProjectsSource | None¶

Extract one FF Projects source from a single pandas series row.

Parameters:: row – pandas df series row representing one source
Returns:: FF Projects source

mex.extractors.ff_projects.extract.extract_ff_projects_sources() → Generator[FFProjectsSource, None, None]¶

Extract FF Projects sources by loading data from MS-Excel file.

Settings:

ff_projects.file_path: Path to the ff-projects list, absolute or relative to: assets_dir

Returns:: Generator for FF Projects sources

mex.extractors.ff_projects.extract.get_clean_names(name: str) → str¶

Clean name from unwanted characters and numerals.

Parameters:: name – Name of the person
Returns:: Cleaned Name
Return type:: str

mex.extractors.ff_projects.extract.get_optional_string_from_cell(cell_value: Any) → str | None¶

Try to extract the string value from a cell by truncating floats.

Parameters:: cell_value – Value of a cell, could be string, int or datetime
Returns:: String or None

mex.extractors.ff_projects.extract.get_string_from_cell(cell_value: Any) → str¶

Try to extract the string value from a cell by truncating floats.

Parameters:: cell_value – Value of a cell, could be string, int or datetime
Returns:: String

mex.extractors.ff_projects.extract.get_temporal_entity_from_cell(cell_value: Any) → TemporalEntity | None¶

Try to extract a temporal_entity from a cell.

Parameters:: cell_value – Value of a cell, could be int, string or datetime
Returns:: TemporalEntity or None

mex.extractors.ff_projects.filter module¶

mex.extractors.ff_projects.filter.filter_and_log_ff_projects_source(source: FFProjectsSource, primary_source_id: Identifier, unit_stable_target_ids_by_synonym: dict[str, MergedOrganizationalUnitIdentifier]) → bool¶

Filter a FFprojectSource according to settings and log filtering.

Parameters:

source – FFProjectSource
primary_source_id – Identifier of primary source
unit_stable_target_ids_by_synonym – Unit IDs grouped by synonyms

Settings:: ff_projects.skip_funding: Skip sources with this funding ff_projects.skip_topics: Skip sources with these topics ff_projects.skip_years_strings: Skip sources with these years ff_projects.skip_clients: Skip sources with these clients

Returns:: False if source is filtered out, else True

mex.extractors.ff_projects.filter.filter_and_log_ff_projects_sources(sources: Iterable[FFProjectsSource], primary_source_id: Identifier, unit_stable_target_ids_by_synonym: dict[str, MergedOrganizationalUnitIdentifier]) → list[FFProjectsSource]¶

Filter FF Projects sources and log filtered sources.

Parameters:

sources – Iterable of FFProjectSources
primary_source_id – Identifier of primary source
unit_stable_target_ids_by_synonym – Unit IDs grouped by synonyms

Returns:

List of filtered FF Projects sources

mex.extractors.ff_projects.filter.filter_out_duplicate_source_ids(sources: Iterable[FFProjectsSource]) → list[FFProjectsSource]¶

Remove duplicate `lfd_nr`s from the given sources.

Parameters:: sources – Iterable of FF Projects sources
Returns:: Filtered FF Projects sources

mex.extractors.ff_projects.main module¶

mex.extractors.ff_projects.settings module¶

class mex.extractors.ff_projects.settings.FFProjectsSettings(*, file_path: AssetsPath = AssetsPath('raw-data/ff-projects/ff-projects.xlsx'), skip_funding: list[str] = ['Sonstige'], skip_topics: list[str] = ['Sonstige'], skip_years_strings: list[str] = ['fehlt', 'keine', 'offen'], skip_clients: list[str] = ['Sonstige'], mapping_path: AssetsPath = AssetsPath('mappings/ff-projects'))¶

Bases: BaseModel

Settings submodel for the FF Projects extractor.

file_path: AssetsPath¶

mapping_path: AssetsPath¶

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}¶: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'file_path': FieldInfo(annotation=AssetsPath, required=False, default=AssetsPath("raw-data/ff-projects/ff-projects.xlsx"), description='Path to the FF Projects excel file, absolute path or relative to `assets_dir`.'), 'mapping_path': FieldInfo(annotation=AssetsPath, required=False, default=AssetsPath("mappings/ff-projects"), description='Path to the directory with the ff-projects mapping filesvalues, absolute path or relative to `assets_dir`.'), 'skip_clients': FieldInfo(annotation=list[str], required=False, default=['Sonstige'], description='Skip sources with these clients'), 'skip_funding': FieldInfo(annotation=list[str], required=False, default=['Sonstige'], description='Skip sources with this funding'), 'skip_topics': FieldInfo(annotation=list[str], required=False, default=['Sonstige'], description='Skip sources with these topics'), 'skip_years_strings': FieldInfo(annotation=list[str], required=False, default=['fehlt', 'keine', 'offen'], description='Skip sources with these years')}¶

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

skip_clients: list[str]¶

skip_funding: list[str]¶

skip_topics: list[str]¶

skip_years_strings: list[str]¶

mex.extractors.ff_projects.transform module¶

mex.extractors.ff_projects.transform.transform_ff_projects_source_to_extracted_activity(ff_projects_source: FFProjectsSource, extracted_primary_source: ExtractedPrimarySource, person_stable_target_ids_by_query_string: dict[str, list[MergedPersonIdentifier]], unit_stable_target_id_by_synonym: dict[str, MergedOrganizationalUnitIdentifier], organization_stable_target_id_by_synonyms: dict[str, MergedOrganizationIdentifier], ff_projects_activity: ActivityMapping) → ExtractedActivity¶

Transform FF Projects source to an extracted activity.

Parameters:

ff_projects_source – FF Projects source
extracted_primary_source – Extracted primary_source for FF Projects
person_stable_target_ids_by_query_string – Mapping from author query to person stable target ID
unit_stable_target_id_by_synonym – Mapping from unit acronyms and labels to unit stable target ID
organization_stable_target_id_by_synonyms – Mapping from organization synonyms to organization stable target ID
ff_projects_activity – activity mapping model with default values

Returns:

Extracted activity for the given projects source

mex.extractors.ff_projects package¶

Subpackages¶

Submodules¶

mex.extractors.ff_projects.extract module¶

mex.extractors.ff_projects.filter module¶

mex.extractors.ff_projects.main module¶

mex.extractors.ff_projects.settings module¶

mex.extractors.ff_projects.transform module¶

Module contents¶

mex-extractors

Navigation

Related Topics