mex.extractors.artificial package¶
Submodules¶
mex.extractors.artificial.identity module¶
- mex.extractors.artificial.identity._create_numeric_ids(faker: Faker) dict[str, list[int]] ¶
Create a mapping from entity type to a list of numeric ids.
These numeric ids can be used as seeds for the identity of artificial items. The seeds will be passed to Identifier.generate(seed=…) to get deterministic identifiers throughout consecutive runs of the artificial extractor.
- Parameters:
faker – Instance of faker
- Returns:
Dict with entity types and lists of numeric ids
- mex.extractors.artificial.identity._get_offset_int(cls: type) int ¶
Calculate an integer based on the crc32 checksum of the name of a class.
- mex.extractors.artificial.identity.create_identities(faker: Faker) dict[str, list[Identity]] ¶
Create the identities of the to-be-faked models.
We do this before actually creating the models, because we need to be able to set existing stableTargetIds on reference fields.
- Parameters:
faker – Instance of faker
- Returns:
Dict with entity types and lists of Identities
- mex.extractors.artificial.identity.restore_identities(identity_map: dict[str, list[Identity]]) None ¶
Restore the state of the memory identity provider.
Because identity creation and model instantiation happen in different subprocesses, the identity provider does not have access to previously stored identities.
- Parameters:
identity_map – Identity map that needs to be loaded back into the provider
mex.extractors.artificial.main module¶
mex.extractors.artificial.provider module¶
- class mex.extractors.artificial.provider.BuilderProvider(generator: Any)¶
Bases:
Provider
Faker provider that deals with interpreting pydantic model fields.
- extracted_data(model: type[ExtractedData]) list[ExtractedData] ¶
Get a list of extracted data instances for the given model class.
- field_value(field: FieldInfo, identity: Identity) list[Any] ¶
Get a single artificial value for the given field and identity.
- inner_type_and_pattern(field: FieldInfo) tuple[Any, str | None] ¶
Return the inner type and pattern of a field.
If the Field arguments, randomly pick an argument.
- min_max_for_field(field: FieldInfo) tuple[int, int] ¶
Return a min and max item count for a field.
- class mex.extractors.artificial.provider.IdentityProvider(factory: Any, identities: dict[str, list[Identity]])¶
Bases:
BaseProvider
Faker provider that creates identities and helps with referencing them.
- __init__(factory: Any, identities: dict[str, list[Identity]]) None ¶
Create and persist identities for all entity types.
- identities(model: type[ExtractedData]) list[Identity] ¶
Return a list of identities for the given model class.
- reference(inner_type: type[Identifier], exclude: Identity) Identifier | None ¶
Return ID for random identity of given type (that is not excluded).
- class mex.extractors.artificial.provider.LinkProvider(generator: Any)¶
Bases:
Provider
,Provider
Faker provider that can return links with optional title and language.
- link() Link ¶
Return a link with optional title and language.
- class mex.extractors.artificial.provider.PatternProvider(factory: Any)¶
Bases:
BaseProvider
Faker provider to create strings matching given patterns.
- MESH_TO_TEMPLATE = {'^http://id\\.nlm\\.nih\\.gov/mesh/[A-Z0-9]{2,64}$': 'http://id.nlm.nih.gov/mesh/{}'}¶
- REGEX_TO_NUMERIFY = {'^(((http)|(https))://(dx.)?doi.org/)(10.\\d{4,9}/[-._;()/:A-Z0-9]+)$': 'https://dx.doi.org/10.####/#######', '^https://d\\-nb\\.info/gnd/[-X0-9]{3,10}$': 'https://d-nb.info/gnd/3########', '^https://gepris\\.dfg\\.de/gepris/institution/[0-9]{1,64}$': 'https://gepris.dfg.de/gepris/institution/#######', '^https://isni\\.org/isni/[X0-9]{16}$': 'https://isni.org/isni/################', '^https://loinc.org/([a-zA-z]*)|(([0-9]*(-[0-9])*))$': 'https://loinc.org/#####-#', '^https://orcid\\.org/[-X0-9]{9,21}$': 'https://orcid.org/0000-####-####-###X', '^https://ror\\.org/[a-z0-9]{9}$': 'https://ror.org/#########', '^https://viaf\\.org/viaf/[0-9]{2,22}$': 'https://viaf.org/viaf/#########', '^https://www\\.wikidata\\.org/entity/[PQ0-9]{2,64}$': 'https://www.wikidata.org/entity/P######'}¶
- __init__(factory: Any) None ¶
Initialize the provider by loading the contents of the mesh_file.
- pattern(regex: str) str ¶
Return a randomized string matching the given pattern.
- class mex.extractors.artificial.provider.TemporalEntityProvider(generator: Any)¶
Bases:
Provider
Faker provider that can return a custom TemporalEntity with random precision.
- temporal_entity(allowed_precision_levels: list[TemporalEntityPrecision]) TemporalEntity ¶
Return a custom temporal entity with random date, time and precision.
- class mex.extractors.artificial.provider.TextProvider(generator: Any)¶
Bases:
Provider
Faker provider that handles custom text related requirements.
- text_object() Text ¶
Return a random text paragraph with an auto-detected language.
- text_string() str ¶
Return a randomized sequence of words as a string.
mex.extractors.artificial.settings module¶
- class mex.extractors.artificial.settings.ArtificialSettings(*, count: Annotated[int, Ge(ge=26), Lt(lt=10000000.0)] = 100, chattiness: Annotated[int, Gt(gt=1), Le(le=100)] = 10, seed: int = 0, locale: list[str] = ['de_DE', 'en_US'], mesh_file: AssetsPath = AssetsPath('raw-data/artificial/asciimesh.bin'))¶
Bases:
BaseModel
Artificial settings submodel definition for the artificial data creator.
- chattiness: int¶
- count: int¶
- locale: list[str]¶
- mesh_file: AssetsPath¶
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}¶
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'chattiness': FieldInfo(annotation=int, required=False, default=10, description='Maximum amount of words to produce for textual fields.', metadata=[Gt(gt=1), Le(le=100)]), 'count': FieldInfo(annotation=int, required=False, default=100, description='Amount of artificial entities to create. At least 2 per entity type are required, to ensure valid linking between the entities.', metadata=[Ge(ge=26), Lt(lt=10000000.0)]), 'locale': FieldInfo(annotation=list[str], required=False, default=['de_DE', 'en_US'], description='The locale to use for faker.'), 'mesh_file': FieldInfo(annotation=AssetsPath, required=False, default=AssetsPath("raw-data/artificial/asciimesh.bin"), description='Binary MeSH file, absolute path or relative to `assets_dir`. MeSH (Medical Subject Headings) are used by the US National Library of Medicine as a controlled vocabulary thesaurus for indexing articles in PubMed. See: https://www.ncbi.nlm.nih.gov/mesh'), 'seed': FieldInfo(annotation=int, required=False, default=0, description='The seed value for faker randomness.')}¶
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
- seed: int¶