mex.extractors.ff_projects package¶
Subpackages¶
- mex.extractors.ff_projects.models package
- Submodules
- mex.extractors.ff_projects.models.source module
FFProjectsSource
FFProjectsSource.foerderprogr
FFProjectsSource.get_end_year()
FFProjectsSource.get_identifier_in_primary_source()
FFProjectsSource.get_partners()
FFProjectsSource.get_start_year()
FFProjectsSource.get_units()
FFProjectsSource.kategorie
FFProjectsSource.laufzeit_bis
FFProjectsSource.laufzeit_cells
FFProjectsSource.laufzeit_von
FFProjectsSource.lfd_nr
FFProjectsSource.model_computed_fields
FFProjectsSource.model_config
FFProjectsSource.model_fields
FFProjectsSource.projektleiter
FFProjectsSource.rki_az
FFProjectsSource.rki_oe
FFProjectsSource.thema_des_projekts
FFProjectsSource.zuwendungs_oder_auftraggeber
- Module contents
Submodules¶
mex.extractors.ff_projects.extract module¶
- mex.extractors.ff_projects.extract.extract_ff_project_authors(ff_projects_sources: Iterable[FFProjectsSource]) Generator[LDAPPersonWithQuery, None, None] ¶
Extract LDAP persons with their query string for FF Projects authors.
- Parameters:
ff_projects_sources – FF Projects sources
- Returns:
Generator for LDAP persons with query
- mex.extractors.ff_projects.extract.extract_ff_projects_organizations(ff_projects_sources: Iterable[FFProjectsSource]) dict[str, MergedOrganizationIdentifier] ¶
Search and extract organization from wikidata.
- Parameters:
ff_projects_sources – Iterable of ff-project sources
- Returns:
Dict with organization label and WikidataOrganization ID
- mex.extractors.ff_projects.extract.extract_ff_projects_source(row: pd.Series[Any]) FFProjectsSource | None ¶
Extract one FF Projects source from a single pandas series row.
- Parameters:
row – pandas df series row representing one source
- Returns:
FF Projects source
- mex.extractors.ff_projects.extract.extract_ff_projects_sources() Generator[FFProjectsSource, None, None] ¶
Extract FF Projects sources by loading data from MS-Excel file.
- Settings:
- ff_projects.file_path: Path to the ff-projects list, absolute or relative to
assets_dir
- Returns:
Generator for FF Projects sources
- mex.extractors.ff_projects.extract.filter_out_duplicate_source_ids(sources: Iterable[FFProjectsSource]) Generator[FFProjectsSource, None, None] ¶
Remove duplicate `lfd_nr`s from the given sources.
- Parameters:
sources – Iterable of FF Projects sources
- Returns:
Filtered FF Projects sources
- mex.extractors.ff_projects.extract.get_clean_names(name: str) str ¶
Clean name from unwanted characters and numerals.
- Parameters:
name – Name of the person
- Returns:
Cleaned Name
- Return type:
str
- mex.extractors.ff_projects.extract.get_optional_string_from_cell(cell_value: Any) str | None ¶
Try to extract the string value from a cell by truncating floats.
- Parameters:
cell_value – Value of a cell, could be string, int or datetime
- Returns:
String or None
- mex.extractors.ff_projects.extract.get_string_from_cell(cell_value: Any) str ¶
Try to extract the string value from a cell by truncating floats.
- Parameters:
cell_value – Value of a cell, could be string, int or datetime
- Returns:
String
- mex.extractors.ff_projects.extract.get_temporal_entity_from_cell(cell_value: Any) TemporalEntity | None ¶
Try to extract a temporal_entity from a cell.
- Parameters:
cell_value – Value of a cell, could be int, string or datetime
- Returns:
TemporalEntity or None
mex.extractors.ff_projects.filter module¶
- mex.extractors.ff_projects.filter.filter_and_log_ff_projects_source(source: FFProjectsSource, primary_source_id: Identifier, unit_stable_target_ids_by_synonym: dict[str, MergedOrganizationalUnitIdentifier]) bool ¶
Filter a FFprojectSource according to settings and log filtering.
- Parameters:
source – FFProjectSource
primary_source_id – Identifier of primary source
unit_stable_target_ids_by_synonym – Unit IDs grouped by synonyms
- Settings:
ff_projects.skip_funding: Skip sources with this funding ff_projects.skip_topics: Skip sources with these topics ff_projects.skip_years_strings: Skip sources with these years ff_projects.skip_clients: Skip sources with these clients
- Returns:
False if source is filtered out, else True
- mex.extractors.ff_projects.filter.filter_and_log_ff_projects_sources(sources: Iterable[FFProjectsSource], primary_source_id: Identifier, unit_stable_target_ids_by_synonym: dict[str, MergedOrganizationalUnitIdentifier]) Generator[FFProjectsSource, None, None] ¶
Filter FF Projects sources and log filtered sources.
- Parameters:
sources – Iterable of FFProjectSources
primary_source_id – Identifier of primary source
unit_stable_target_ids_by_synonym – Unit IDs grouped by synonyms
- Returns:
Generator for filtered FF Projects sources
mex.extractors.ff_projects.main module¶
mex.extractors.ff_projects.settings module¶
- class mex.extractors.ff_projects.settings.FFProjectsSettings(*, file_path: AssetsPath = AssetsPath('raw-data/ff-projects/ff-projects.xlsx'), skip_funding: list[str] = ['Sonstige'], skip_topics: list[str] = ['Sonstige'], skip_years_strings: list[str] = ['fehlt', 'keine', 'offen'], skip_clients: list[str] = ['Sonstige'], mapping_path: AssetsPath = AssetsPath('mappings/__final__/ff-projects'))¶
Bases:
BaseModel
Settings submodel for the FF Projects extractor.
- file_path: AssetsPath¶
- mapping_path: AssetsPath¶
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}¶
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'file_path': FieldInfo(annotation=AssetsPath, required=False, default=AssetsPath("raw-data/ff-projects/ff-projects.xlsx"), description='Path to the FF Projects excel file, absolute path or relative to `assets_dir`.'), 'mapping_path': FieldInfo(annotation=AssetsPath, required=False, default=AssetsPath("mappings/__final__/ff-projects"), description='Path to the directory with the ff-projects mapping filesvalues, absolute path or relative to `assets_dir`.'), 'skip_clients': FieldInfo(annotation=list[str], required=False, default=['Sonstige'], description='Skip sources with these clients'), 'skip_funding': FieldInfo(annotation=list[str], required=False, default=['Sonstige'], description='Skip sources with this funding'), 'skip_topics': FieldInfo(annotation=list[str], required=False, default=['Sonstige'], description='Skip sources with these topics'), 'skip_years_strings': FieldInfo(annotation=list[str], required=False, default=['fehlt', 'keine', 'offen'], description='Skip sources with these years')}¶
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
- skip_clients: list[str]¶
- skip_funding: list[str]¶
- skip_topics: list[str]¶
- skip_years_strings: list[str]¶
mex.extractors.ff_projects.transform module¶
- mex.extractors.ff_projects.transform.transform_ff_projects_source_to_extracted_activity(ff_projects_source: FFProjectsSource, extracted_primary_source: ExtractedPrimarySource, person_stable_target_ids_by_query_string: dict[str, list[MergedPersonIdentifier]], unit_stable_target_id_by_synonym: dict[str, MergedOrganizationalUnitIdentifier], organization_stable_target_id_by_synonyms: dict[str, MergedOrganizationIdentifier], ff_projects_activity: Any) ExtractedActivity ¶
Transform FF Projects source to an extracted activity.
- Parameters:
ff_projects_source – FF Projects source
extracted_primary_source – Extracted primary_source for FF Projects
person_stable_target_ids_by_query_string – Mapping from author query to person stable target ID
unit_stable_target_id_by_synonym – Mapping from unit acronyms and labels to unit stable target ID
organization_stable_target_id_by_synonyms – Mapping from organization synonyms to organization stable target ID
ff_projects_activity – activity mapping model with default values
- Returns:
Extracted activity for the given projects source