如何最好地索引此表?

我在PostgreSQL中有一个大表(> 2000 M行),必须尽快查询。它表示生物样品中基因表达的量度。问题是,有时直接在基因上进行测量(然后“探针”为NULL),有时通过基因的“探针”进行测量(然后仍设置“基因”)。一个基因可以有多个探针。没有其他表包含基因-探针关系。

CREATE TABLE "gene_measurements" (
  "gene" INTEGER NOT NULL REFERENCES "genes" ON DELETE CASCADE,
  "sample" INTEGER NOT NULL REFERENCES "samples" ON DELETE CASCADE,
  "probe" INTEGER REFERENCES "probes" ON DELETE CASCADE,
  "value" REAL NOT NULL
);

常见的查询包括获取给定样本中所有基因的表达,获取所有样本中给定基因/探针的表达或获取给定样本中给定基因/探针的表达。

现在,我有以下覆盖索引。它工作正常,但是非常占用空间。

CREATE INDEX "gene_measurements_gene_sample_value_index" ON "gene_measurements" ("gene", "sample", "value");
CREATE INDEX "gene_measurements_sample_gene_value_index" ON "gene_measurements" ("sample", "gene", "value");
CREATE INDEX "gene_measurements_sample_probe_value_index" ON "gene_measurements" ("sample", "probe", "value");
CREATE INDEX "gene_measurements_probe_sample_value_index" ON "gene_measurements" ("probe", "sample", "value");

在保持速度的同时,我可以做些聪明的事情来获得更整洁和/或更小的实现吗?谢谢!

评论
若弱C
若弱C

您可以在时空之间选择一个任意阈值。现在,您已经索引了整个表四次。这显然会占用大量空间。

您可以摆脱索引中的一些数据,以换取更快的运行时间:

  • For example, you could remove value from all indexes. But, then a lookup of the data becomes necessary additionally to a lookup in the index.
  • You could also remove some indexes altogether. For example, depending on your data, you might remove either (sample, gene) or (sample, probe). This removes one complete coverage of the data, while still allowing you to use the sample part for queries with conditions on sample and the removed column. Again, the case you removed is then not as fast as before.

如果您的目标是不惜一切代价实现最小的运行时间,那么所有这些建议都不适合您。我认为PostgreSQL Universe中没有任何东西可以解决您的问题。

Since your data is simple and your use cases restricted, you can consider solutions other than PostgreSQL. Especially, you basically want only a B-Tree data structure. (Or multiple.) There are other solutions to build such a data structure, e.g., QDBM. Still, you would need to build multiple of these structures to optimize for each of your select types. The achievable savings in space I would consider to be not very high – basically, you could get rid of the data but none of the indexes. Therefore, you could roughly save 1/5 of your current storage size, at the cost of restricted functionality and additional complexity in your software ecosystem.

您必须决定需要什么,想要什么以及想要为这些目标而牺牲的东西。考虑到我在这里写下的内容,我会坚持使用PostgreSQL。

点赞
评论