# The FMU results data model This section describes the data model used for FMU results when exporting with fmu-dataio. For the time being, the data model is hosted as part of fmu-dataio. The data model described herein is new and shiny, and experimental in many aspects. Any feedback on this is greatly appreciated. The most effective feedback is to apply the data model, then use the resulting metadata. The FMU data model is described using a [Pydantic](https://pydantic.dev/) model which programmatically generates a [JSON Schema](https://json-schema.org/). This schema contains rules and definitions for all attributes in the data model. This means, in practice, that outgoing metadata from FMU needs to comply with the schema. If data is uploaded to e.g. Sumo, validation will be done on the incoming data to ensure consistency. ## Data model documentation There are two closely related data models represented here: metadata generated from an FMU realization and metadata generated on a case level. The structure and documentation of these two models can be inspected from here. ```{eval-rst} .. autosummary:: :toctree: model/ :recursive: .. toctree:: :maxdepth: -1 ~fmu.dataio._models.fmu_results.fmu_results.ObjectMetadata ~fmu.dataio._models.fmu_results.fmu_results.CaseMetadata ``` ## About the data model ### Why is it made? FMU is a mighty system developed by and for the subsurface community in Equinor, to make reservoir modeling more efficient, less error-prone and more repeatable with higher quality, mainly through automation of cross-disciplinary workflows. It combines off-the-shelf software with in-house components such as the ERT orchestrator. FMU is defined more and more by the data it produces, and direct and indirect dependencies on output from FMU is increasing. When FMU results started to be regularly transferred to cloud storage for direct consumption from 2017/2018 and outwards, the need for stable metadata on outgoing data became immiment. Local development on Johan Sverdrup was initiated to cater for the digital ecosystem evolving in and around that particular project, and the need for generalizing became apparent with the development of Sumo, Webviz and other initiatives. The purpose of the data model is to cater for the existing dependencies, as well as enable more direct usage of FMU results in different contexts. The secondary objective of this data model is to create a normalization layer between the components that create data and the components that use those data. The data model is designed to also be adapted to other sources of data than FMU. ### Scope of this data model This data model covers data produced by FMU workflows. This includes data generated by direct runs of model templates, data produced by pre-processing workflows, data produced in individual realizations or hooked workflows, and data produced by post-processing workflows. :::{note} An example of a pre-processing workflow is a set of jobs modifying selected input data for later use in the FMU workflows and/or for comparison with other results in a QC context. ::: :::{note} An example of a post-processing workflow is a script that aggregates results across many realizations and/or iterations of an FMU case. ::: This data model covers data that, in the FMU context, can be linked to a specific case. Note that e.g. ERT and other components will, and should, have their own data models to cater for their needs. It is not the intention of this data model to cover all aspects of data in the FMU context. The scope is primarily data going *out* of FMU to be used elsewhere. ### A denormalized data model The data model used for FMU results is a denormalized data model, at least to a certain point. This means that the static data will be repeated many times. Example: Each exported data object contains basic information about the FMU case it belongs to, such as a unique ID for this case, its name, the user that made it, which model template was used, etc. This information if stored in *every* exported .yml file. This may seem counterintuitive, and differs from a relational database (where this information would typically be stored once, and referred to when needed). There are a few reasons for choosing a denormalized data model: First, the components for creating a relational database containing these data is not and would be extremely difficult to implement fast. Also, the nature of data in an FMU context is very distributed, with lots of files spread across many files and folders (currently). Second, a denormalized data model enables us to utilize search engine technologies for for indexing. This is not efficient for a normalized data model. The penalty for duplicating metadata across many individual files is returned in speed and ease-of-use. :::{note} The data model is only denormalized *to a certain point*. Most likely, it is better described as a hybrid. Example: The concept of a *case* is used in FMU context. In the outgoing metadata for FMU results, some information about the current case is included. However, *details* about the case is out of scope. For this, a consumer would have to refer to the owner of the *case* definition. In FMU contexts, this will be the workflow manager (ERT). ::: ### Standardized vs anarchy Creating a data model for FMU results brings with it some standard. In essence, this represents the next evolution of the existing FMU standard. We haven't called it "FMU standard 2.0" because although this would ressonate with many people, many would find it revolting. But, sure, if you are so inclined you are allowed to think of it this way. The FMU standard 1.0 is centric around folder structure and file names - a pre-requisite for standardizing for the good old days when files where files, folders were folders, and data could be consumed by double-clicking. Or, by traversing the mounted file system. With the transition to a cloud-native state comes numerous opportunities - but also great responsibilities. Some of them are visible in the data model, and the data model is in itself a testament to the most important of them: We need to get our data straight. There are many challenges. Aligning with everyone and everything is one. We probably don't succeed with that in the first iteration(s). Materializing metadata effectively, and without hassle, during FMU runs (meaning that *everything* must be *fully automated* is another. This is what fmu-dataio solves. But, finding the balance between *retaining flexibility* and *enforcing a standard* is perhaps the most tricky of all. This data model has been designed with the great flexibility of FMU in mind. If you are a geologist on an asset using FMU for something important, you need to be able to export any data from *your* workflow and *use that data* without having to wait for someone else to rebuild something. For FMU, one glove certainly does not fit all, and this has been taken into account. While the data model and the associated validation will set some requirements that you need to follow, you are still free to do more or less what you want. We do, however, STRONGLY ENCOURAGE you to not invent too many private wheels. The risk is that your data cannot be used by others. The materialized metadata has a nested structure which can be represented by Python dictionaries, yaml or json formats. The root level only contains key attributes, where most are nested sub-dictionaries. ### Relations to other data models The data model for FMU results is designed with generalization in mind. While in practice this data model cover data produced by, or in direct relations to, an FMU workflow - in *theory* it relates more to *subsurface predictive modeling* generally, than FMU specifically. In Equinor, FMU is the primary system for creating, maintaining and using 3D predictive numerical models for the subsurface. Therefore, FMU is the main use case for this data model. There are plenty of other data models in play in the complex world of subsurface predictive modeling. Each software applies its own data model, and in FMU this encompasses multiple different systems. Similarly, there are other data models in the larger scope where FMU workflows represent one out of many providors/consumers of data. A significant motivation for defining this data model is to ensure consistency towards other systems and enable stable conditions for integration. fmu-dataio has three important roles in this context: - Be a translating layer between individual softwares' data models and the FMU results data model. - Enable fully-automated materialization of metadata during FMU runs (hundreds of thousands of files being made) - Abstract the FMU results data model through Python methods and functions, allowing them to be embedded into other systems - helping maintain a centralized definition of this data model. ### The parent/child principle In the FMU results data model, the traditional hierarchy of an FMU setup is not continued. An individual file produced by an FMU workflow and exported to disk can be seen in relations to a hiearchy looking something like this: case > iteration > realization > file Many reading this will instinctively disagree with this definition, and significant confusion arises from trying to have meaningful discussions around this. There is no unified definition of this hierarchy (despite many *claiming to have* such a definition). In the FMU results data model, this hiearchy is flattened down to two levels: The Parent (*case*) and children to that parent (*files*). From this, it follows that the most fundamental definition in this context is a *case*. To a large degree, this definition belongs to the ERT workflow manager in the FMU context. For now, however, the case definitions are extracted by-proxy from the file structure and from arguments passed to fmu-dataio. Significant confusion can *also* arise from discussing the definition of a case, and the validity of this hiearchy, of course. But consensus (albeit probably local minima) is that this serves the needs. Each file produced *in relations to* an FMU case (meaning *before*, *during* or *after*) is tagged with information about the case - signalling that *this entity* belongs to *this case*. It is not the intention of the FMU results data model to maintain *all* information about a case, and in the future it is expected that ERT will serve case information beyond the basics. ```{eval-rst} .. note:: **Dot-annotation** - we like it and use it. This is what it means: The metadata structure is a dictionary-like structure, e.g. .. code-block:: json { "myfirstkey": { "mykey": "myvalue", "anotherkey": "anothervalue" } } ``` Annotating tracks along a dictionary can be tricky. With dot-annotation, we can refer to `mykey` in the example above as `myfirstkey.mykey`. This will be a pointer to `myvalue` in this case. You will see dot annotation in the explanations of the various metadata blocks below: Now you know what it means! ### Weaknesses **uniqueness** The data model currently has challenges wrt ensuring uniqueness. Uniqueness is a challenge in this context, as a centralized data model cannot (and should not!) dictate in detail nor define in detail which data an FMU user should be able to export from local workflows. **understanding validation errors** When validating against the current schema, understanding the reasons for non-validation can be tricky. The root cause of this is the use of conditional logic in the schemas - a functionality JSON Schema is not designed for. See Logical rules below. ### Logical rules The schema contains some logical rules which are applied during validation. These are rules of type "if this, then that". They are, however, not explicitly written (nor readable) as such directly. This type of logic is implemented in the schema by explicitly generating subschemas that A) are only valid for specific conditions, and B) contain requirements for that specific situation. In this manner, one can assure that if a specific condition is met, the associated requirements for that condition is used. Example: ```json "oneOf": [ { "$comment": "Conditional schema A - 'if class == case make myproperty required'", "required": [ "myproperty" ], "properties": { "class": { "enum": ["case"] }, "myproperty": { "type": "string", "example": "sometext" } } }, { "$comment": "Conditional schema B - 'if class != case do NOT make myproperty required'", "properties": { "myproperty": { "type": "string", "example": "sometext" } } } ] ``` For metadata describing a `case`, requirements are different compared to metadata describing data objects. For selected contents, a content-specific block under `data` is required. This is implemented for `fluid_contact`, `field_outline` and `seismic`. ## Validation of data When fmu-dataio exports data from FMU workflows, it produces a pair of data + metadata. The two are considered one entity. Data consumers who wish to validate the correct match of data and metadata can do so by verifying recreation of `file.checksum_md5` on the data object only. Metadata is not considered when generating the checksum. This checksum is the string representation of the hash created using RSA's `MD5` algorithm. This hash was created from the _file_ that fmu-dataio exported. In most cases, this is the same file that are provided to consumer. However, there are some exceptions: - Seismic data may be transformed to other formats when stored out of FMU context and the checksum may be invalid. ## Changes and revisions The only constant is change, as we know, and in the case of the FMU results data model - definitely so. The learning component here is huge, and there will be iterations. This poses a challenge, given that there are existing dependencies on top of this data model already, and more are arriving. To handle this, two important concepts has been introduced. 1. **Versioning**. The current version of the FMU metadata is **{{ FmuResultsSchema.VERSION }}**. 2. **Contractual attributes**. Within the FMU ecosystem, we need to retain the ability to do rapid changes to the data model. As we are in early days, unknowns will become knowns and unknown unknowns will become known unknowns. However, from the outside perspective some stability is required. Therefore, we have labelled some key attributes as *contractual*. They are listed at the top of the schema. This is not to say that they will never change - but they should not change erratically, and when we need to change them, this needs to be subject to alignment. ### Contractual attributes The following attributes are contractual: {{ FmuResultsSchema.contractual }} ## Metadata example Expand below to see a full example of valid metadata for surface exported from FMU. ```{eval-rst} .. toggle:: .. literalinclude:: ../../../examples/0.8.0/surface_depth.yml :language: yaml ``` You will find more examples in [fmu-dataio github repository](https://github.com/equinor/fmu-dataio/tree/main/examples/0.8.0). ## FAQ We won't claim that these questions are really very *frequently* asked, but these are some key questions you may have along the way. **My existing FMU workflow does not produce any metadata. Now I am told that it has to. What do I do?** First step: Start using fmu-dataio in your workflow. You will get a lot for free using it, amongst other things, metadata will start to appear from your workflow. To get started with fmu-dataio, see [the overview section](../overview). **This data model is not what I would have chosen. How can I change it?** The FMU community (almost always) builds what the FMU community wants. The first step would be to define what you are unhappy with, preferably formulated as an issue in the [fmu-dataio github repository](https://github.com/equinor/fmu-dataio). **This data model allows me to create a smashing data visualisation component, but I fear that it is so immature that it will not be stable - will it change all the time?** Yes, and no. It is definitely experimental and these are early days. Therefore, changes will occur as learning is happening. Part of that learning comes from development of components utilizing the data model, so your feedback may contribute to evolving this data model. However, you should not expact erratic changes. The concept of Contractual attributes are introduced for this exact purpose. We have also chosen to version the metadata - partly to clearly separate from previous versions, but also for allowing smooth evolution going forward. We don't yet know *exactly* how this will be done in practice, but perhaps you will tell us!