mex.extractors package¶
Subpackages¶
- mex.extractors.artificial package
- mex.extractors.biospecimen package
- mex.extractors.blueant package
- Subpackages
- Submodules
- mex.extractors.blueant.connector module
BlueAntConnector
BlueAntConnector._get_json_from_api()
BlueAntConnector._set_authentication()
BlueAntConnector._set_url()
BlueAntConnector.get_client_name()
BlueAntConnector.get_department_name()
BlueAntConnector.get_persons()
BlueAntConnector.get_projects()
BlueAntConnector.get_status_name()
BlueAntConnector.get_type_description()
- mex.extractors.blueant.extract module
- mex.extractors.blueant.filter module
- mex.extractors.blueant.main module
- mex.extractors.blueant.settings module
- mex.extractors.blueant.transform module
- Module contents
- mex.extractors.confluence_vvt package
- Submodules
- mex.extractors.confluence_vvt.connector module
- mex.extractors.confluence_vvt.extract module
- mex.extractors.confluence_vvt.main module
- mex.extractors.confluence_vvt.models module
ConfluenceVvtCell
ConfluenceVvtHeading
ConfluenceVvtPage
ConfluenceVvtPage.get_end_year()
ConfluenceVvtPage.get_identifier_in_primary_source()
ConfluenceVvtPage.get_partners()
ConfluenceVvtPage.get_start_year()
ConfluenceVvtPage.get_units()
ConfluenceVvtPage.id
ConfluenceVvtPage.model_computed_fields
ConfluenceVvtPage.model_config
ConfluenceVvtPage.model_fields
ConfluenceVvtPage.tables
ConfluenceVvtPage.title
ConfluenceVvtRow
ConfluenceVvtTable
ConfluenceVvtValue
- mex.extractors.confluence_vvt.parse_html module
- mex.extractors.confluence_vvt.settings module
ConfluenceVvtSettings
ConfluenceVvtSettings.model_computed_fields
ConfluenceVvtSettings.model_config
ConfluenceVvtSettings.model_fields
ConfluenceVvtSettings.overview_page_id
ConfluenceVvtSettings.password
ConfluenceVvtSettings.skip_pages
ConfluenceVvtSettings.template_v1_mapping_path
ConfluenceVvtSettings.url
ConfluenceVvtSettings.username
- mex.extractors.confluence_vvt.transform module
- Module contents
- mex.extractors.datscha_web package
- mex.extractors.ff_projects package
- mex.extractors.grippeweb package
- Submodules
- mex.extractors.grippeweb.connector module
- mex.extractors.grippeweb.extract module
- mex.extractors.grippeweb.main module
- mex.extractors.grippeweb.settings module
- mex.extractors.grippeweb.transform module
get_or_create_external_partner()
transform_grippeweb_access_platform_to_extracted_access_platform()
transform_grippeweb_resource_mappings_to_dict()
transform_grippeweb_resource_mappings_to_extracted_resources()
transform_grippeweb_variable_group_to_extracted_variable_groups()
transform_grippeweb_variable_to_extracted_variables()
- Module contents
- mex.extractors.ifsg package
- Subpackages
- mex.extractors.ifsg.models package
- Submodules
- mex.extractors.ifsg.models.meta_catalogue2item module
- mex.extractors.ifsg.models.meta_catalogue2item2schema module
- mex.extractors.ifsg.models.meta_datatype module
- mex.extractors.ifsg.models.meta_disease module
- mex.extractors.ifsg.models.meta_field module
- mex.extractors.ifsg.models.meta_item module
- mex.extractors.ifsg.models.meta_schema2field module
- mex.extractors.ifsg.models.meta_schema2type module
- mex.extractors.ifsg.models.meta_type module
- Module contents
- mex.extractors.ifsg.models package
- Submodules
- mex.extractors.ifsg.connector module
- mex.extractors.ifsg.extract module
- mex.extractors.ifsg.filter module
- mex.extractors.ifsg.main module
- mex.extractors.ifsg.settings module
- mex.extractors.ifsg.transform module
- Module contents
- Subpackages
- mex.extractors.international_projects package
- mex.extractors.odk package
- mex.extractors.open_data package
- mex.extractors.pipeline package
- mex.extractors.primary_source package
- mex.extractors.publisher package
- mex.extractors.rdmo package
- mex.extractors.seq_repo package
- Submodules
- mex.extractors.seq_repo.extract module
- mex.extractors.seq_repo.filter module
- mex.extractors.seq_repo.main module
- mex.extractors.seq_repo.model module
SeqRepoSource
SeqRepoSource.customer_org_unit_id
SeqRepoSource.customer_sample_name
SeqRepoSource.lims_sample_id
SeqRepoSource.model_computed_fields
SeqRepoSource.model_config
SeqRepoSource.model_fields
SeqRepoSource.project_coordinators
SeqRepoSource.project_id
SeqRepoSource.project_name
SeqRepoSource.sequencing_date
SeqRepoSource.sequencing_platform
SeqRepoSource.species
- mex.extractors.seq_repo.settings module
- mex.extractors.seq_repo.transform module
- Module contents
- mex.extractors.sinks package
- mex.extractors.sumo package
- Subpackages
- mex.extractors.sumo.models package
- Submodules
- mex.extractors.sumo.models.base module
- mex.extractors.sumo.models.cc1_data_model_nokeda module
- mex.extractors.sumo.models.cc1_data_valuesets module
- mex.extractors.sumo.models.cc2_aux_mapping module
- mex.extractors.sumo.models.cc2_aux_model module
- mex.extractors.sumo.models.cc2_aux_valuesets module
- mex.extractors.sumo.models.cc2_feat_projection module
- Module contents
- mex.extractors.sumo.models package
- Submodules
- mex.extractors.sumo.extract module
- mex.extractors.sumo.filter module
- mex.extractors.sumo.main module
- mex.extractors.sumo.settings module
- mex.extractors.sumo.transform module
create_new_organization_with_official_name()
get_contact_merged_ids_by_emails()
get_contact_merged_ids_by_names()
transform_feat_projection_variable_to_mex_variable()
transform_feat_variable_to_mex_variable_group()
transform_model_nokeda_variable_to_mex_variable_group()
transform_nokeda_aux_variable_to_mex_variable()
transform_nokeda_aux_variable_to_mex_variable_group()
transform_nokeda_model_variable_to_mex_variable()
transform_resource_feat_model_to_mex_resource()
transform_resource_nokeda_to_mex_resource()
transform_sumo_access_platform_to_mex_access_platform()
transform_sumo_activity_to_extracted_activity()
- Module contents
- Subpackages
- mex.extractors.synopse package
- Subpackages
- Submodules
- mex.extractors.synopse.connector module
- mex.extractors.synopse.extract module
- mex.extractors.synopse.filter module
- mex.extractors.synopse.main module
- mex.extractors.synopse.settings module
SynopseSettings
SynopseSettings.datensatzuebersicht_path
SynopseSettings.mapping_path
SynopseSettings.metadaten_zu_datensaetzen_path
SynopseSettings.model_computed_fields
SynopseSettings.model_config
SynopseSettings.model_fields
SynopseSettings.projekt_und_studienverwaltung_path
SynopseSettings.report_server_password
SynopseSettings.report_server_url
SynopseSettings.report_server_username
SynopseSettings.variablenuebersicht_path
- mex.extractors.synopse.transform module
transform_overviews_to_resource_lookup()
transform_synopse_data_to_mex_resources()
transform_synopse_project_to_activity()
transform_synopse_projects_to_mex_activities()
transform_synopse_studies_into_access_platforms()
transform_synopse_variables_belonging_to_same_variable_group_to_mex_variables()
transform_synopse_variables_to_mex_variable_groups()
transform_synopse_variables_to_mex_variables()
- Module contents
- mex.extractors.voxco package
- mex.extractors.wikidata package
Submodules¶
mex.extractors.drop module¶
- class mex.extractors.drop.DropApiConnector¶
Bases:
HTTPConnector
Connector class to handle interaction with the Drop API.
- API_VERSION = 'v0'¶
- _check_availability() None ¶
Send a GET request to verify the API is available.
- _set_authentication() None ¶
Set the drop API key to all session headers.
- _set_url() None ¶
Set the drop api url with the version path.
- get_file(x_system: str, file_id: str) dict[str, Any] ¶
Get the content of a file from the x_system.
- Parameters:
x_system – name of the x_system
file_id – name of the file
- Returns:
content of the file
- list_files(x_system: str) list[str] ¶
Get available files for the x_system.
- Parameters:
x_system – name of the x_system to list the files for
- Returns:
list of available filenames for the x_system
mex.extractors.filters module¶
- mex.extractors.filters.filter_by_global_rules(primary_source_id: Identifier, items: Iterable[RawDataT]) Generator[RawDataT, None, None] ¶
Filter out items according to global filter rules, build filtered Generator.
- Parameters:
primary_source_id – identifier of the primary source
items – items, source or resource to be filtered
mex.extractors.logging module¶
- mex.extractors.logging.log_filter(identifier_in_primary_source: str | None, primary_source_id: Identifier, reason: str) None ¶
Log filtered sources.
- Parameters:
identifier_in_primary_source – optional identifier in the primary source
primary_source_id – identifier of the primary source
reason – string explaining the reason for filtering
mex.extractors.main module¶
mex.extractors.models module¶
- class mex.extractors.models.BaseRawData¶
Bases:
BaseModel
Raw-data base providing standardized access to attributes for filtering.
- abstractmethod get_end_year() TemporalEntity | None ¶
Return end year from extractor.
- abstractmethod get_identifier_in_primary_source() str | None ¶
Return identifier in primary source from extractor.
- abstractmethod get_partners() Sequence[str | None] ¶
Return partners from extractor.
- abstractmethod get_start_year() TemporalEntity | None ¶
Return start year from extractor.
- abstractmethod get_units() Sequence[str | None] ¶
Return units from extractor.
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}¶
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {}¶
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
mex.extractors.settings module¶
- class mex.extractors.settings.Settings(_env_file: Path | str | Sequence[Path | str] | None = PosixPath('.'), _env_file_encoding: str | None = None, _env_nested_delimiter: str | None = None, _secrets_dir: str | Path | None = None, *, pdb: bool = False, MEX_SINK: list[Sink] = [Sink.NDJSON], MEX_ASSETS_DIR: Path = PosixPath('/home/runner/work/mex-extractors/mex-extractors/assets'), MEX_WORK_DIR: Path = PosixPath('/home/runner/work/mex-extractors/mex-extractors'), MEX_IDENTITY_PROVIDER: IdentityProvider = IdentityProvider.MEMORY, MEX_BACKEND_API_URL: Url = Url('http://localhost:8080/'), MEX_BACKEND_API_KEY: SecretStr = SecretStr('**********'), MEX_VERIFY_SESSION: bool | AssetsPath = True, MEX_ORGANIGRAM_PATH: AssetsPath = AssetsPath('raw-data/organigram/organizational_units.json'), MEX_PRIMARY_SOURCES_PATH: AssetsPath = AssetsPath('raw-data/primary-sources/primary-sources.json'), MEX_LDAP_URL: SecretStr = SecretStr('**********'), MEX_WIKI_API_URL: Url = Url('https://wikidata/'), MEX_WIKI_QUERY_SERVICE_URL: Url = Url('https://wikidata/'), MEX_WEB_USER_AGENT: str = 'rki/mex', MEX_ORCID_API_URL: Url = Url('https://orcid/'), MEX_SKIP_EXTRACTORS: list[str] = [], MEX_SKIP_MERGED_ITEMS: list[str] = ['MergedPrimarySource', 'MergedConsent', 'MergedPerson'], MEX_SKIP_PARTNERS: list[str] = ['test'], MEX_SKIP_UNITS: list[str] = ['IT', 'PRAES', 'ZV'], MEX_SKIP_YEARS_BEFORE: int = 1970, MEX_DROP_API_KEY: SecretStr = SecretStr('**********'), MEX_DROP_API_URL: Url = Url('http://localhost:8081/'), MEX_SCHEDULE: str = '0 0 * * *', kerberos_user: str = 'user@domain.tld', kerberos_password: SecretStr = SecretStr('**********'), s3_endpoint_url: Url = Url('https://s3/'), s3_access_key_id: SecretStr = SecretStr('**********'), s3_secret_access_key: SecretStr = SecretStr('**********'), s3_bucket_key: str = 's3_bucket', artificial: ArtificialSettings = ArtificialSettings(count=100, chattiness=10, seed=0, locale=['de_DE', 'en_US'], mesh_file=AssetsPath('raw-data/artificial/asciimesh.bin')), biospecimen: BiospecimenSettings = BiospecimenSettings(raw_data_path=AssetsPath('raw-data/biospecimen'), key_col='Feldname', val_col='zu extrahierender Wert (maschinenlesbar)', mapping_path=AssetsPath('mappings/biospecimen')), blueant: BlueAntSettings = BlueAntSettings(api_key=SecretStr('**********'), url='https://blueant', skip_labels=['test'], delete_prefixes=['_', '1_', '2_', '3_', '4_', '5_', '6_', '7_', '8_', '9_'], mapping_path=AssetsPath('mappings/blueant')), confluence_vvt: ConfluenceVvtSettings = ConfluenceVvtSettings(url='https://confluence.vvt', username=SecretStr('**********'), password=SecretStr('**********'), overview_page_id='123456', template_v1_mapping_path=AssetsPath('mappings/confluence-vvt_template_v1'), skip_pages=['123456']), datscha_web: DatschaWebSettings = DatschaWebSettings(url='https://datscha/', vorname=SecretStr('**********'), nachname=SecretStr('**********'), pw=SecretStr('**********'), organisation='RKI'), ff_projects: FFProjectsSettings = FFProjectsSettings(file_path=AssetsPath('raw-data/ff-projects/ff-projects.xlsx'), skip_funding=['Sonstige'], skip_topics=['Sonstige'], skip_years_strings=['fehlt', 'keine', 'offen'], skip_clients=['Sonstige'], mapping_path=AssetsPath('mappings/ff-projects')), grippeweb: GrippewebSettings = GrippewebSettings(mapping_path=AssetsPath('mappings/grippeweb'), mssql_connection_dsn='DRIVER={ODBC Driver 18 for SQL Server};SERVER=domain.tld;DATABASE=database'), ifsg: IFSGSettings = IFSGSettings(mapping_path=AssetsPath('mappings/ifsg'), mssql_connection_dsn='DRIVER={ODBC Driver 18 for SQL Server};SERVER=domain.tld;DATABASE=database'), international_projects: InternationalProjectsSettings = InternationalProjectsSettings(file_path=AssetsPath('raw-data/international-projects/international_projects.xlsx'), mapping_path=AssetsPath('mappings/international-projects')), odk: ODKSettings = ODKSettings(raw_data_path=AssetsPath('raw-data/odk'), mapping_path=AssetsPath('mappings/odk')), open_data: OpenDataSettings = OpenDataSettings(url='https://zenodo', community_rki='robertkochinstitut'), rdmo: RDMOSettings = RDMOSettings(url='https://rdmo/', username=SecretStr('**********'), password=SecretStr('**********')), seq_repo: SeqRepoSettings = SeqRepoSettings(mapping_path=AssetsPath('mappings/seq-repo')), sumo: SumoSettings = SumoSettings(raw_data_path=AssetsPath('raw-data/sumo'), mapping_path=AssetsPath('mappings/sumo')), voxco: VoxcoSettings = VoxcoSettings(mapping_path=AssetsPath('mappings/voxco')), synopse: SynopseSettings = SynopseSettings(report_server_url='https://report-server/', report_server_username=SecretStr('**********'), report_server_password=SecretStr('**********'), variablenuebersicht_path=AssetsPath('raw-data/synopse/variablenuebersicht.csv'), projekt_und_studienverwaltung_path=AssetsPath('raw-data/synopse/projekt_und_studienverwaltung.csv'), metadaten_zu_datensaetzen_path=AssetsPath('raw-data/synopse/metadaten_zu_datensaetzen.csv'), datensatzuebersicht_path=AssetsPath('raw-data/synopse/datensatzuebersicht.csv'), mapping_path=AssetsPath('mappings/synopse')), wikidata: WikidataSettings = WikidataSettings(mapping_path=AssetsPath('mappings/wikidata')))¶
Bases:
BaseSettings
Settings definition class for extractors and related scripts.
- artificial: ArtificialSettings¶
- biospecimen: BiospecimenSettings¶
- blueant: BlueAntSettings¶
- confluence_vvt: ConfluenceVvtSettings¶
- datscha_web: DatschaWebSettings¶
- drop_api_key: SecretStr¶
- drop_api_url: Url¶
- ff_projects: FFProjectsSettings¶
- grippeweb: GrippewebSettings¶
- ifsg: IFSGSettings¶
- international_projects: InternationalProjectsSettings¶
- kerberos_password: SecretStr¶
- kerberos_user: str¶
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}¶
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[SettingsConfigDict] = {'arbitrary_types_allowed': True, 'case_sensitive': False, 'cli_avoid_json': False, 'cli_enforce_required': False, 'cli_exit_on_error': True, 'cli_flag_prefix_char': '-', 'cli_hide_none_type': False, 'cli_ignore_unknown_args': False, 'cli_implicit_flags': False, 'cli_kebab_case': False, 'cli_parse_args': None, 'cli_parse_none_str': None, 'cli_prefix': '', 'cli_prog_name': None, 'cli_use_class_docs_for_groups': False, 'enable_decoding': True, 'env_file': '.env', 'env_file_encoding': 'utf-8', 'env_ignore_empty': False, 'env_nested_delimiter': '__', 'env_parse_enums': None, 'env_parse_none_str': None, 'env_prefix': 'mex_', 'extra': 'ignore', 'json_file': None, 'json_file_encoding': None, 'nested_model_default_partial_update': False, 'populate_by_name': True, 'protected_namespaces': ('model_validate', 'model_dump', 'settings_customise_sources'), 'secrets_dir': None, 'toml_file': None, 'validate_assignment': True, 'validate_default': True, 'yaml_file': None, 'yaml_file_encoding': None}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'artificial': FieldInfo(annotation=ArtificialSettings, required=False, default=ArtificialSettings(count=100, chattiness=10, seed=0, locale=['de_DE', 'en_US'], mesh_file=AssetsPath("raw-data/artificial/asciimesh.bin"))), 'assets_dir': FieldInfo(annotation=Path, required=False, default=PosixPath('/home/runner/work/mex-extractors/mex-extractors/assets'), alias_priority=2, validation_alias='MEX_ASSETS_DIR', description='Path to directory that contains input files treated as read-only, looks for a folder named `assets` in the current directory by default.'), 'backend_api_key': FieldInfo(annotation=SecretStr, required=False, default=SecretStr('**********'), alias_priority=2, validation_alias='MEX_BACKEND_API_KEY', description='Backend API key with write access to call POST/PUT endpoints'), 'backend_api_url': FieldInfo(annotation=Url, required=False, default=Url('http://localhost:8080/'), alias_priority=2, validation_alias='MEX_BACKEND_API_URL', description='MEx backend API url.'), 'biospecimen': FieldInfo(annotation=BiospecimenSettings, required=False, default=BiospecimenSettings(raw_data_path=AssetsPath("raw-data/biospecimen"), key_col='Feldname', val_col='zu extrahierender Wert (maschinenlesbar)', mapping_path=AssetsPath("mappings/biospecimen"))), 'blueant': FieldInfo(annotation=BlueAntSettings, required=False, default=BlueAntSettings(api_key=SecretStr('**********'), url='https://blueant', skip_labels=['test'], delete_prefixes=['_', '1_', '2_', '3_', '4_', '5_', '6_', '7_', '8_', '9_'], mapping_path=AssetsPath("mappings/blueant"))), 'confluence_vvt': FieldInfo(annotation=ConfluenceVvtSettings, required=False, default=ConfluenceVvtSettings(url='https://confluence.vvt', username=SecretStr('**********'), password=SecretStr('**********'), overview_page_id='123456', template_v1_mapping_path=AssetsPath("mappings/confluence-vvt_template_v1"), skip_pages=['123456'])), 'datscha_web': FieldInfo(annotation=DatschaWebSettings, required=False, default=DatschaWebSettings(url='https://datscha/', vorname=SecretStr('**********'), nachname=SecretStr('**********'), pw=SecretStr('**********'), organisation='RKI')), 'debug': FieldInfo(annotation=bool, required=False, default=False, alias='pdb', alias_priority=2, validation_alias='MEX_DEBUG', description='Jump into post-mortem debugging after any uncaught exception.'), 'drop_api_key': FieldInfo(annotation=SecretStr, required=False, default=SecretStr('**********'), alias_priority=2, validation_alias='MEX_DROP_API_KEY', description='Drop API key with admin access to call all GET endpoints'), 'drop_api_url': FieldInfo(annotation=Url, required=False, default=Url('http://localhost:8081/'), alias_priority=2, validation_alias='MEX_DROP_API_URL', description='MEx drop API url.'), 'ff_projects': FieldInfo(annotation=FFProjectsSettings, required=False, default=FFProjectsSettings(file_path=AssetsPath("raw-data/ff-projects/ff-projects.xlsx"), skip_funding=['Sonstige'], skip_topics=['Sonstige'], skip_years_strings=['fehlt', 'keine', 'offen'], skip_clients=['Sonstige'], mapping_path=AssetsPath("mappings/ff-projects"))), 'grippeweb': FieldInfo(annotation=GrippewebSettings, required=False, default=GrippewebSettings(mapping_path=AssetsPath("mappings/grippeweb"), mssql_connection_dsn='DRIVER={ODBC Driver 18 for SQL Server};SERVER=domain.tld;DATABASE=database')), 'identity_provider': FieldInfo(annotation=IdentityProvider, required=False, default=<IdentityProvider.MEMORY: 'memory'>, alias_priority=2, validation_alias='MEX_IDENTITY_PROVIDER', description='Provider to assign identifiers to new model instances.'), 'ifsg': FieldInfo(annotation=IFSGSettings, required=False, default=IFSGSettings(mapping_path=AssetsPath("mappings/ifsg"), mssql_connection_dsn='DRIVER={ODBC Driver 18 for SQL Server};SERVER=domain.tld;DATABASE=database')), 'international_projects': FieldInfo(annotation=InternationalProjectsSettings, required=False, default=InternationalProjectsSettings(file_path=AssetsPath("raw-data/international-projects/international_projects.xlsx"), mapping_path=AssetsPath("mappings/international-projects"))), 'kerberos_password': FieldInfo(annotation=SecretStr, required=False, default=SecretStr('**********'), description='Kerberos password to authenticate against MSSQL server.'), 'kerberos_user': FieldInfo(annotation=str, required=False, default='user@domain.tld', description='Kerberos user to authenticate against MSSQL server.'), 'ldap_url': FieldInfo(annotation=SecretStr, required=False, default=SecretStr('**********'), alias_priority=2, validation_alias='MEX_LDAP_URL', description='LDAP server for person queries with authentication credentials. Must follow format `ldap://user:pw@host:port`, where `user` is the username, and `pw` is the password for authenticating against ldap, `host` is the url of the ldap server, and `port` is the port of the ldap server.'), 'mex_web_user_agent': FieldInfo(annotation=str, required=False, default='rki/mex', alias_priority=2, validation_alias='MEX_WEB_USER_AGENT', description='a user agent is sent in the header of some requests to external services '), 'odk': FieldInfo(annotation=ODKSettings, required=False, default=ODKSettings(raw_data_path=AssetsPath("raw-data/odk"), mapping_path=AssetsPath("mappings/odk"))), 'open_data': FieldInfo(annotation=OpenDataSettings, required=False, default=OpenDataSettings(url='https://zenodo', community_rki='robertkochinstitut')), 'orcid_api_url': FieldInfo(annotation=Url, required=False, default=Url('https://orcid/'), alias_priority=2, validation_alias='MEX_ORCID_API_URL', description='URL of orcid api.'), 'organigram_path': FieldInfo(annotation=AssetsPath, required=False, default=AssetsPath("raw-data/organigram/organizational_units.json"), alias_priority=2, validation_alias='MEX_ORGANIGRAM_PATH', description='Path to the JSON file describing the organizational units, absolute path or relative to `assets_dir`.'), 'primary_sources_path': FieldInfo(annotation=AssetsPath, required=False, default=AssetsPath("raw-data/primary-sources/primary-sources.json"), alias_priority=2, validation_alias='MEX_PRIMARY_SOURCES_PATH', description='Path to the JSON file describing the primary sources, absolute path or relative to `assets_dir`.'), 'rdmo': FieldInfo(annotation=RDMOSettings, required=False, default=RDMOSettings(url='https://rdmo/', username=SecretStr('**********'), password=SecretStr('**********'))), 's3_access_key_id': FieldInfo(annotation=SecretStr, required=False, default=SecretStr('**********'), description='The access key to use when creating the client.'), 's3_bucket_key': FieldInfo(annotation=str, required=False, default='s3_bucket', description='The S3 bucket where to store objects.'), 's3_endpoint_url': FieldInfo(annotation=Url, required=False, default=Url('https://s3/'), description='The complete URL to use for the constructed client.'), 's3_secret_access_key': FieldInfo(annotation=SecretStr, required=False, default=SecretStr('**********'), description='The secret key to use when creating the client.'), 'schedule': FieldInfo(annotation=str, required=False, default='0 0 * * *', alias_priority=2, validation_alias='MEX_SCHEDULE', description='A valid cron string defining when to run extractor jobs'), 'seq_repo': FieldInfo(annotation=SeqRepoSettings, required=False, default=SeqRepoSettings(mapping_path=AssetsPath("mappings/seq-repo"))), 'sink': FieldInfo(annotation=list[Sink], required=False, default=[<Sink.NDJSON: 'ndjson'>], alias_priority=2, validation_alias='MEX_SINK', description='Where to send data that is extracted or ingested. Defaults to writing ndjson files, but can be configured to push to the backend or the graph.'), 'skip_extractors': FieldInfo(annotation=list[str], required=False, default=[], alias_priority=2, validation_alias='MEX_SKIP_EXTRACTORS', description='Skip execution of these extractors in dagster'), 'skip_merged_items': FieldInfo(annotation=list[str], required=False, default=['MergedPrimarySource', 'MergedConsent', 'MergedPerson'], alias_priority=2, validation_alias='MEX_SKIP_MERGED_ITEMS', description='Skip merged items with these types'), 'skip_partners': FieldInfo(annotation=list[str], required=False, default=['test'], alias_priority=2, validation_alias='MEX_SKIP_PARTNERS', description='Skip projects with these external partners'), 'skip_units': FieldInfo(annotation=list[str], required=False, default=['IT', 'PRAES', 'ZV'], alias_priority=2, validation_alias='MEX_SKIP_UNITS', description='Skip projects with these responsible units'), 'skip_years_before': FieldInfo(annotation=int, required=False, default=1970, alias_priority=2, validation_alias='MEX_SKIP_YEARS_BEFORE', description='Skip projects conducted before this year'), 'sumo': FieldInfo(annotation=SumoSettings, required=False, default=SumoSettings(raw_data_path=AssetsPath("raw-data/sumo"), mapping_path=AssetsPath("mappings/sumo"))), 'synopse': FieldInfo(annotation=SynopseSettings, required=False, default=SynopseSettings(report_server_url='https://report-server/', report_server_username=SecretStr('**********'), report_server_password=SecretStr('**********'), variablenuebersicht_path=AssetsPath("raw-data/synopse/variablenuebersicht.csv"), projekt_und_studienverwaltung_path=AssetsPath("raw-data/synopse/projekt_und_studienverwaltung.csv"), metadaten_zu_datensaetzen_path=AssetsPath("raw-data/synopse/metadaten_zu_datensaetzen.csv"), datensatzuebersicht_path=AssetsPath("raw-data/synopse/datensatzuebersicht.csv"), mapping_path=AssetsPath("mappings/synopse"))), 'verify_session': FieldInfo(annotation=Union[bool, AssetsPath], required=False, default=True, alias_priority=2, validation_alias='MEX_VERIFY_SESSION', description="Either a boolean that controls whether we verify the server's TLS certificate, or a path to a CA bundle to use. If a path is given, it can be either absolute or relative to the `assets_dir`. Defaults to True."), 'voxco': FieldInfo(annotation=VoxcoSettings, required=False, default=VoxcoSettings(mapping_path=AssetsPath("mappings/voxco"))), 'wiki_api_url': FieldInfo(annotation=Url, required=False, default=Url('https://wikidata/'), alias_priority=2, validation_alias='MEX_WIKI_API_URL', description='URL of Wikidata API, this URL is used to send wikidata organization ID to get all the info about the organization, which includes basic info, aliases, labels, descriptions, claims, and sitelinks'), 'wiki_query_service_url': FieldInfo(annotation=Url, required=False, default=Url('https://wikidata/'), alias_priority=2, validation_alias='MEX_WIKI_QUERY_SERVICE_URL', description='URL of Wikidata query service, this URL is to send organization name in plain text to wikidata and receive search results with wikidata organization ID'), 'wikidata': FieldInfo(annotation=WikidataSettings, required=False, default=WikidataSettings(mapping_path=AssetsPath("mappings/wikidata"))), 'work_dir': FieldInfo(annotation=Path, required=False, default=PosixPath('/home/runner/work/mex-extractors/mex-extractors'), alias_priority=2, validation_alias='MEX_WORK_DIR', description='Path to directory that stores generated and temporary files. Defaults to the current working directory.')}¶
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
- odk: ODKSettings¶
- open_data: OpenDataSettings¶
- rdmo: RDMOSettings¶
- s3_access_key_id: SecretStr¶
- s3_bucket_key: str¶
- s3_endpoint_url: Url¶
- s3_secret_access_key: SecretStr¶
- schedule: str¶
- seq_repo: SeqRepoSettings¶
- skip_extractors: list[str]¶
- skip_merged_items: list[str]¶
- skip_partners: list[str]¶
- skip_units: list[str]¶
- skip_years_before: int¶
- sumo: SumoSettings¶
- synopse: SynopseSettings¶
- voxco: VoxcoSettings¶
- wikidata: WikidataSettings¶
mex.extractors.utils module¶
- mex.extractors.utils.load_yaml(path: PathLike[str]) dict[str, Any] ¶
Load the contents of a YAML file from the given path and return as a dict.