## Key problems of data-heavy R&D

The complexity of modern R&D data often blocks realizing the
scientific progress it promises.

Here, we list key problems we see and how we think about solving them.

# Data can't be accessed

| --- | --- | --- |
| Problem | Description | Solution |
| =================================== | =================================== | =================================== |
| *Object storage.* | Data in object storage can't be | Index *observations* and |
| queried. | *variables* and link them in a |
| query database. |
| --- | --- | --- |
| *Pile of data.* | Data can't be accessed as it's | Structure data both by biological |
| not structured and siloed in | entities and by provenance with |
| fragmented infrastructure. | one interface across storage and |
| database backends. |
| --- | --- | --- |

# Data can't be accessed at scale

| --- | --- | --- |
| Problem | Description | Solution |
| =================================== | =================================== | =================================== |
| *Anecdotal data.* | Data can't be accessed at scale | API-first platform. |
| as no viable programmatic |
| interfaces exist. |
| --- | --- | --- |
| *Cross-storage integration.* | Molecular (high-dimensional) data | Index molecular data with the |
| can't be efficiently integrated | same biological entities as |
| with phenotypic (low-dimensional) | phenotypic data. Provide |
| data. | connectors for low-dimensional |
| data management systems (ELN & |
| LIMS systems). |
| --- | --- | --- |

# Scientific results aren't solid

| --- | --- | --- |
| Problem | Description | Solution |
| =================================== | =================================== | =================================== |
| *Stand on solid ground.* | Key analytics results cannot be | Provide full data provenance. |
| linked to supporting data as too |
| many processing steps are |
| involved. |
| --- | --- | --- |

# Collaboration across organizations is hard

| --- | --- | --- |
| Problem | Description | Solution |
| =================================== | =================================== | =================================== |
| *Siloed infrastructure.* | Data can't be easily shared | Federated collaboration hub on |
| across organizations. | distributed infrastructure. |
| --- | --- | --- |
| *Siloed semantics.* | External data can't be mapped on | Provide curation and ingestion |
| in-house data and vice versa. | API, operate on open-source data |
| models that can be adopted by any |
| organization. |
| --- | --- | --- |

# R&D could be more effective

| --- | --- | --- |
| Problem | Description | Solution |
| =================================== | =================================== | =================================== |
| *Optimal decision making.* | There is no framework for | Graph of data flow in R&D team, |
| tracking decision making in | including scientists, |
| complex R&D teams. | computation, decisions, |
| predictions. Unlike workflow |
| frameworks, LaminDB creates an |
| emergent graph. |
| --- | --- | --- |
| *Dry lab is not integrated.* | Data platforms offer no adequate | API-first with data scientist |
| interface for the drylab. | needs in mind. |
| --- | --- | --- |
| *Support learning.* | There is no support for the | Support data models across the |
| learning-from-data cycle. | full lab cycle, including |
| measured → relevant → derived |
| features. Manage knowledge |
| through rich semantic models that |
| map high-dimensional data. |
| --- | --- | --- |

# No support for basic R&D operations

| --- | --- | --- |
| Problem | Description | Solution |
| =================================== | =================================== | =================================== |
| *Development data.* | Data associated with assay | Allow partial integrity in |
| development can't be ingested as | LaminDB's implementation of a |
| data models are too rigid. | data lakehouse: ingest data of |
| any curation level and label them |
| with corresponding QC flags. |
| --- | --- | --- |
| *Corrupted data.* | Data is often corrupted. | Full provenance allows to trace |
| back corruption to its origin and |
| write a simple fix, typically, in |
| form of an ingestion constraint. |
| --- | --- | --- |

# Building a data platform is hard

| --- | --- | --- |
| Problem | Description | Solution |
| =================================== | =================================== | =================================== |
| *Aligning data models.* | Data models are hard to align | Lamin's data model templates |
| across interdisciplinary | cover 90% of cases, the remaining |
| stakeholders. | 10% can be get configured. |
| --- | --- | --- |
| *Lock-in.* | Existing platforms lock | Open-source and multi-cloud stack |
| organizations into specific cloud | with zero lock-in danger. |
| infrastructure. |
| --- | --- | --- |
| *Migrations are a pain.* | Migrating data models in a fast- | LaminDB's schema modules migrate |
| paced R&D environment can be | automatically. |
| prohibitive. |
| --- | --- | --- |

*Note: This problem statement was originally published as part of the
"lamindb" docs. It was linked from lamin.ai/about while traveling
through various repositories with small edits: within lamindb, within
lamin-about, within lamin-docs*. It got moved to the blog page on
2023-08-11.