PostgreSQL под капотом. Часть 6. Сложный SELECT запрос / forpes.ru

Главная
PostgreSQL под капотом. Часть 6. Сложный SELECT запрос

PostgreSQL под капотом. Часть 6. Сложный SELECT запрос +13

19.05.2023 07:30

AshBlade 0 1900 Источник

Приветствую!

Сегодня мы рассмотрим путь выполнения сложного SELECT запроса. Выполнение простого запроса было рассмотрено в предыдущей статье. На этот раз запрос выглядит следующим образом:

SELECT u.name 
FROM users AS u
WHERE u.age > 24
GROUP BY u.name
HAVING COUNT(*) = 1
ORDER BY u.name
LIMIT 50

Различия, при сравнении с выполнением простого запроса, имеются в 3 местах: парсинг, создание плана и выполнение. Поэтому рассмотрим только эти случаи.

Парсинг

Различия начинаются на этапе переписывания.

Напомню, что входная точка переписывания запроса - transformStmt (src/backend/parser/analyze.c). Эта функция различает основные типы запросов и запускает соответствующий обработчик. Для SELECT это снова transformSelectStmt (src/backend/parser/analyze.c).

FROM users AS u

Для обработки выражения FROM вызывается transformFromClause (src/backend/parser/parse_clause.c). Обработка будет вестись точно так же как и в простом запросе.

SELECT u.name

В простом запросе мы получали все столбцы, используя астериск (*). Для этого использовалась отдельная функция ExpandColumnRefStar (src/backend/parser/parse_target.c).

В случае когда в запросе нет астериска, вызывается общая функция transformTargetEntry (src/backend/parser/parse_target.c) - не важно столбец, выражение, вызов функции или константа.

/*
 * transformTargetEntry()
 *	Transform any ordinary "expression-type" node into a targetlist entry.
 *	This is exported so that parse_clause.c can generate targetlist entries
 *	for ORDER/GROUP BY items that are not already in the targetlist.
 *
 * node		the (untransformed) parse tree for the value expression.
 * expr		the transformed expression, or NULL if caller didn't do it yet.
 * exprKind expression kind (EXPR_KIND_SELECT_TARGET, etc)
 * colname	the column name to be assigned, or NULL if none yet set.
 * resjunk	true if the target should be marked resjunk, ie, it is not
 *			wanted in the final projected tuple.
 */
TargetEntry *
transformTargetEntry(ParseState *pstate, Node *node, Node *expr, ParseExprKind exprKind, char *colname, bool resjunk)

Внутри она состоит из 2 частей (вызовов функций):

transformExpr - создание выражения (src/backend/parser/parse_expr.c)
FigureColname - определение названия столбца (src/backend/parser/parse_target.c)

transformExpr создает объект ParseState и передает ее transformExprRecurse - настоящему парсеру.

Внутри он запускает обработчик на основании тэга узла. Для u.name тэг - T_ColumnRef (выражение - ссылка на столбец). Для него запустится обработчик transformColumnRef

На вход он получает созданную структуру состояния парсинга ParseState и структуру столбца ColumnRef

/*
 * ColumnRef - specifies a reference to a column, or possibly a whole tuple
 *
 * The "fields" list must be nonempty.  It can contain String nodes
 * (representing names) and A_Star nodes (representing occurrence of a '*').
 * Currently, A_Star must appear only as the last list element --- the grammar
 * is responsible for enforcing this!
 *
 * Note: any container subscripting or selection of fields from composite columns
 * is represented by an A_Indirection node above the ColumnRef.  However,
 * for simplicity in the normal case, initial field selection from a table
 * name is represented within ColumnRef and not by adding A_Indirection.
 */
typedef struct ColumnRef
{
	NodeTag		type;
	List	   *fields;			/* field names (String nodes) or A_Star */
	int			location;		/* token location, or -1 if unknown */
} ColumnRef;

/*
 * State information used during parse analysis
 *
 * parentParseState: NULL in a top-level ParseState.  When parsing a subquery,
 * links to current parse state of outer query.
 *
 * p_sourcetext: source string that generated the raw parsetree being
 * analyzed, or NULL if not available.  (The string is used only to
 * generate cursor positions in error messages: we need it to convert
 * byte-wise locations in parse structures to character-wise cursor
 * positions.)
 *
 * p_rtable: list of RTEs that will become the rangetable of the query.
 * Note that neither relname nor refname of these entries are necessarily
 * unique; searching the rtable by name is a bad idea.
 *
 * p_joinexprs: list of JoinExpr nodes associated with p_rtable entries.
 * This is one-for-one with p_rtable, but contains NULLs for non-join
 * RTEs, and may be shorter than p_rtable if the last RTE(s) aren't joins.
 *
 * p_joinlist: list of join items (RangeTblRef and JoinExpr nodes) that
 * will become the fromlist of the query's top-level FromExpr node.
 *
 * p_namespace: list of ParseNamespaceItems that represents the current
 * namespace for table and column lookup.  (The RTEs listed here may be just
 * a subset of the whole rtable.  See ParseNamespaceItem comments below.)
 *
 * p_lateral_active: true if we are currently parsing a LATERAL subexpression
 * of this parse level.  This makes p_lateral_only namespace items visible,
 * whereas they are not visible when p_lateral_active is FALSE.
 *
 * p_ctenamespace: list of CommonTableExprs (WITH items) that are visible
 * at the moment.  This is entirely different from p_namespace because a CTE
 * is not an RTE, rather "visibility" means you could make an RTE from it.
 *
 * p_future_ctes: list of CommonTableExprs (WITH items) that are not yet
 * visible due to scope rules.  This is used to help improve error messages.
 *
 * p_parent_cte: CommonTableExpr that immediately contains the current query,
 * if any.
 *
 * p_target_relation: target relation, if query is INSERT/UPDATE/DELETE/MERGE
 *
 * p_target_nsitem: target relation's ParseNamespaceItem.
 *
 * p_is_insert: true to process assignment expressions like INSERT, false
 * to process them like UPDATE.  (Note this can change intra-statement, for
 * cases like INSERT ON CONFLICT UPDATE.)
 *
 * p_windowdefs: list of WindowDefs representing WINDOW and OVER clauses.
 * We collect these while transforming expressions and then transform them
 * afterwards (so that any resjunk tlist items needed for the sort/group
 * clauses end up at the end of the query tlist).  A WindowDef's location in
 * this list, counting from 1, is the winref number to use to reference it.
 *
 * p_expr_kind: kind of expression we're currently parsing, as per enum above;
 * EXPR_KIND_NONE when not in an expression.
 *
 * p_next_resno: next TargetEntry.resno to assign, starting from 1.
 *
 * p_multiassign_exprs: partially-processed MultiAssignRef source expressions.
 *
 * p_locking_clause: query's FOR UPDATE/FOR SHARE clause, if any.
 *
 * p_locked_from_parent: true if parent query level applies FOR UPDATE/SHARE
 * to this subquery as a whole.
 *
 * p_resolve_unknowns: resolve unknown-type SELECT output columns as type TEXT
 * (this is true by default).
 *
 * p_hasAggs, p_hasWindowFuncs, etc: true if we've found any of the indicated
 * constructs in the query.
 *
 * p_last_srf: the set-returning FuncExpr or OpExpr most recently found in
 * the query, or NULL if none.
 *
 * p_pre_columnref_hook, etc: optional parser hook functions for modifying the
 * interpretation of ColumnRefs and ParamRefs.
 *
 * p_ref_hook_state: passthrough state for the parser hook functions.
 */
struct ParseState
{
	ParseState *parentParseState;	/* stack link */
	const char *p_sourcetext;	/* source text, or NULL if not available */
	List	   *p_rtable;		/* range table so far */
	List	   *p_joinexprs;	/* JoinExprs for RTE_JOIN p_rtable entries */
	List	   *p_joinlist;		/* join items so far (will become FromExpr
								 * node's fromlist) */
	List	   *p_namespace;	/* currently-referenceable RTEs (List of
								 * ParseNamespaceItem) */
	bool		p_lateral_active;	/* p_lateral_only items visible? */
	List	   *p_ctenamespace; /* current namespace for common table exprs */
	List	   *p_future_ctes;	/* common table exprs not yet in namespace */
	CommonTableExpr *p_parent_cte;	/* this query's containing CTE */
	Relation	p_target_relation;	/* INSERT/UPDATE/DELETE/MERGE target rel */
	ParseNamespaceItem *p_target_nsitem;	/* target rel's NSItem, or NULL */
	bool		p_is_insert;	/* process assignment like INSERT not UPDATE */
	List	   *p_windowdefs;	/* raw representations of window clauses */
	ParseExprKind p_expr_kind;	/* what kind of expression we're parsing */
	int			p_next_resno;	/* next targetlist resno to assign */
	List	   *p_multiassign_exprs;	/* junk tlist entries for multiassign */
	List	   *p_locking_clause;	/* raw FOR UPDATE/FOR SHARE info */
	bool		p_locked_from_parent;	/* parent has marked this subquery
										 * with FOR UPDATE/FOR SHARE */
	bool		p_resolve_unknowns; /* resolve unknown-type SELECT outputs as
									 * type text */

	QueryEnvironment *p_queryEnv;	/* curr env, incl refs to enclosing env */

	/* Flags telling about things found in the query: */
	bool		p_hasAggs;
	bool		p_hasWindowFuncs;
	bool		p_hasTargetSRFs;
	bool		p_hasSubLinks;
	bool		p_hasModifyingCTE;

	Node	   *p_last_srf;		/* most recent set-returning func/op found */

	/*
	 * Optional hook functions for parser callbacks.  These are null unless
	 * set up by the caller of make_parsestate.
	 */
	PreParseColumnRefHook p_pre_columnref_hook;
	PostParseColumnRefHook p_post_columnref_hook;
	ParseParamRefHook p_paramref_hook;
	CoerceParamHook p_coerce_param_hook;
	void	   *p_ref_hook_state;	/* common passthrough link for above */
};

Внутри проверяется допустимость использования ссылки на столбец. Проверка выполняется с помощью проверки типа выражения узла. Тип выражения определяется перечислением ParseExprKind(src/include/parser/parse_node.h)

/*
 * Expression kinds distinguished by transformExpr().  Many of these are not
 * semantically distinct so far as expression transformation goes; rather,
 * we distinguish them so that context-specific error messages can be printed.
 *
 * Note: EXPR_KIND_OTHER is not used in the core code, but is left for use
 * by extension code that might need to call transformExpr().  The core code
 * will not enforce any context-driven restrictions on EXPR_KIND_OTHER
 * expressions, so the caller would have to check for sub-selects, aggregates,
 * window functions, SRFs, etc if those need to be disallowed.
 */
typedef enum ParseExprKind
{
	EXPR_KIND_NONE = 0,			/* "not in an expression" */
	EXPR_KIND_OTHER,			/* reserved for extensions */
	EXPR_KIND_JOIN_ON,			/* JOIN ON */
	EXPR_KIND_JOIN_USING,		/* JOIN USING */
	EXPR_KIND_FROM_SUBSELECT,	/* sub-SELECT in FROM clause */
	EXPR_KIND_FROM_FUNCTION,	/* function in FROM clause */
	EXPR_KIND_WHERE,			/* WHERE */
	EXPR_KIND_HAVING,			/* HAVING */
	EXPR_KIND_FILTER,			/* FILTER */
	EXPR_KIND_WINDOW_PARTITION, /* window definition PARTITION BY */
	EXPR_KIND_WINDOW_ORDER,		/* window definition ORDER BY */
	EXPR_KIND_WINDOW_FRAME_RANGE,	/* window frame clause with RANGE */
	EXPR_KIND_WINDOW_FRAME_ROWS,	/* window frame clause with ROWS */
	EXPR_KIND_WINDOW_FRAME_GROUPS,	/* window frame clause with GROUPS */
	EXPR_KIND_SELECT_TARGET,	/* SELECT target list item */
	EXPR_KIND_INSERT_TARGET,	/* INSERT target list item */
	EXPR_KIND_UPDATE_SOURCE,	/* UPDATE assignment source item */
	EXPR_KIND_UPDATE_TARGET,	/* UPDATE assignment target item */
	EXPR_KIND_MERGE_WHEN,		/* MERGE WHEN [NOT] MATCHED condition */
	EXPR_KIND_GROUP_BY,			/* GROUP BY */
	EXPR_KIND_ORDER_BY,			/* ORDER BY */
	EXPR_KIND_DISTINCT_ON,		/* DISTINCT ON */
	EXPR_KIND_LIMIT,			/* LIMIT */
	EXPR_KIND_OFFSET,			/* OFFSET */
	EXPR_KIND_RETURNING,		/* RETURNING */
	EXPR_KIND_VALUES,			/* VALUES */
	EXPR_KIND_VALUES_SINGLE,	/* single-row VALUES (in INSERT only) */
	EXPR_KIND_CHECK_CONSTRAINT, /* CHECK constraint for a table */
	EXPR_KIND_DOMAIN_CHECK,		/* CHECK constraint for a domain */
	EXPR_KIND_COLUMN_DEFAULT,	/* default value for a table column */
	EXPR_KIND_FUNCTION_DEFAULT, /* default parameter value for function */
	EXPR_KIND_INDEX_EXPRESSION, /* index expression */
	EXPR_KIND_INDEX_PREDICATE,	/* index predicate */
	EXPR_KIND_STATS_EXPRESSION, /* extended statistics expression */
	EXPR_KIND_ALTER_COL_TRANSFORM,	/* transform expr in ALTER COLUMN TYPE */
	EXPR_KIND_EXECUTE_PARAMETER,	/* parameter value in EXECUTE */
	EXPR_KIND_TRIGGER_WHEN,		/* WHEN condition in CREATE TRIGGER */
	EXPR_KIND_POLICY,			/* USING or WITH CHECK expr in policy */
	EXPR_KIND_PARTITION_BOUND,	/* partition bound expression */
	EXPR_KIND_PARTITION_EXPRESSION, /* PARTITION BY expression */
	EXPR_KIND_CALL_ARGUMENT,	/* procedure argument in CALL */
	EXPR_KIND_COPY_WHERE,		/* WHERE condition in COPY FROM */
	EXPR_KIND_GENERATED_COLUMN, /* generation expression for a column */
	EXPR_KIND_CYCLE_MARK,		/* cycle mark value */
} ParseExprKind;

Единственные недопустимые значения - EXPR_KIND_COLUMN_DEFAULT и EXPR_KIND_PARTITION_BOUND.

switch (pstate->p_expr_kind)
{
    // ...
    case EXPR_KIND_COLUMN_DEFAULT:
        err = _("cannot use column reference in DEFAULT expression");
        break;
    case EXPR_KIND_PARTITION_BOUND:
        err = _("cannot use column reference in partition bound expression");
        break;
    // ...
}

Стоит обратить внимание на алгоритм парсинга ссылок на столбцы.

Для обращений к столбцам таблиц может быть использовано несколько вариантов. Например, A.B.C или A.*

Всего может быть 7 вариантов действий. Каждый описан в комментарии функции

/*----------
 * The allowed syntaxes are:
 *
 * A		First try to resolve as unqualified column name;
 *			if no luck, try to resolve as unqualified table name (A.*).
 * A.B		A is an unqualified table name; B is either a
 *			column or function name (trying column name first).
 * A.B.C	schema A, table B, col or func name C.
 * A.B.C.D	catalog A, schema B, table C, col or func D.
 * A.*		A is an unqualified table name; means whole-row value.
 * A.B.*	whole-row value of table B in schema A.
 * A.B.C.*	whole-row value of table C in schema B in catalog A.
 *
 * We do not need to cope with bare "*"; that will only be accepted by
 * the grammar at the top level of a SELECT list, and transformTargetList
 * will take care of it before it ever gets here.  Also, "A.*" etc will
 * be expanded by transformTargetList if they appear at SELECT top level,
 * so here we are only going to see them as function or operator inputs.
 *
 * Currently, if a catalog name is given then it must equal the current
 * database name; we check it here and then discard it.
 *----------
 */

Для определения используемого варианта, используется поле fields структуры ColumnRef. Это список содержащий названия A, B, C, D (из комментария).

В зависимости от длины списка определяется и метод обработки. Например, если длина списка 2, то проверяются варианты A.B и A.*

switch (list_length(cref->fields))
{
  // ...
  case 2:
  {
      Node	   *field1 = (Node *) linitial(cref->fields);
      Node	   *field2 = (Node *) lsecond(cref->fields);

      Assert(IsA(field1, String));
      relname = strVal(field1);

      /* Locate the referenced nsitem */
      nsitem = refnameNamespaceItem(pstate, nspname, relname,
                                    cref->location,
                                    &levels_up);
      if (nsitem == NULL)
      {
          crerr = CRERR_NO_RTE;
          break;
      }

      /* Whole-row reference? */
      if (IsA(field2, A_Star))
      {
          node = transformWholeRowRef(pstate, nsitem, levels_up,
                                      cref->location);
          break;
      }

      Assert(IsA(field2, String));
      colname = strVal(field2);

      /* Try to identify as a column of the nsitem */
      node = scanNSItemForColumn(pstate, nsitem, levels_up, colname,
                                 cref->location);
      if (node == NULL)
      {
          /* Try it as a function call on the whole row */
          node = transformWholeRowRef(pstate, nsitem, levels_up,
                                      cref->location);
          node = ParseFuncOrColumn(pstate,
                                   list_make1(makeString(colname)),
                                   list_make1(node),
                                   pstate->p_last_srf,
                                   NULL,
                                   false,
                                   cref->location);
      }
      break;
  }
  // ...
}

Также внутри функции определено вспомогательное перечисление для корректной обработки ошибок. Так в описанном выше случае, если таблица не найдена, то переменной crerr присваивается значение CRERR_NO_RTE

enum
{
    CRERR_NO_COLUMN,
    CRERR_NO_RTE,
    CRERR_WRONG_DB,
    CRERR_TOO_MANY
} crerr = CRERR_NO_COLUMN;

// ...

if (nsitem == NULL)
{
    crerr = CRERR_NO_RTE;
    break;
}

После успешного парсинга столбца нужно определить его отображаемое название. Если мы указали псевдоним ( “AS …”), то название уже задано, иначе пытаемся вывести. Для этого используется функция FigureColname (src/backend/parser/parse_target.c)

Название создается на основании типа узла. В случае столбца выбирается самое последнее название из списка (за исключением *)

switch (nodeTag(node))
{
    case T_ColumnRef:
        {
            char	   *fname = NULL;
            ListCell   *l;
  
            /* find last field name, if any, ignoring "*" */
            foreach(l, ((ColumnRef *) node)->fields)
            {
                Node	   *i = lfirst(l);
  
                if (IsA(i, String))
                    fname = strVal(i);
            }
            if (fname)
            {
                *name = fname;
                return 2;
            }
        }
        break;
    // ...
}

Сейчас мы имеем только названия столбцов, но куда они указывают не понятно. Последняя стадия - добавление OID таблицы, на которую столбец указывает. За это отвечает функция markTargetListOrigin (src/backend/parser/parse_target.c)

WHERE u.age > 24

Дальше идет парсинг предложения WHERE. За это отвечает transformWhereClause

В начале происходит парсинг самого выражения. В условии может быть любое выражение, поэтому для обработки вызывается transformExpr (общая функция парсинга выражений)

Выражение это такой же узел и для него определены свои тэги. Для выражений с бинарными/унарными операторами или SQL предложениями (LIKE, BETWEEN) тег узла равен T_A_Expr. А чтобы выявить подтип, используется дополнительное поле kind типа A_Expr_Kind (src/include/nodes/parsenodes.h)

/*
 * A_Expr - infix, prefix, and postfix expressions
 */
typedef enum A_Expr_Kind
{
	AEXPR_OP,					/* normal operator */
	AEXPR_OP_ANY,				/* scalar op ANY (array) */
	AEXPR_OP_ALL,				/* scalar op ALL (array) */
	AEXPR_DISTINCT,				/* IS DISTINCT FROM - name must be "=" */
	AEXPR_NOT_DISTINCT,			/* IS NOT DISTINCT FROM - name must be "=" */
	AEXPR_NULLIF,				/* NULLIF - name must be "=" */
	AEXPR_IN,					/* [NOT] IN - name must be "=" or "<>" */
	AEXPR_LIKE,					/* [NOT] LIKE - name must be "~~" or "!~~" */
	AEXPR_ILIKE,				/* [NOT] ILIKE - name must be "~~*" or "!~~*" */
	AEXPR_SIMILAR,				/* [NOT] SIMILAR - name must be "~" or "!~" */
	AEXPR_BETWEEN,				/* name must be "BETWEEN" */
	AEXPR_NOT_BETWEEN,			/* name must be "NOT BETWEEN" */
	AEXPR_BETWEEN_SYM,			/* name must be "BETWEEN SYMMETRIC" */
	AEXPR_NOT_BETWEEN_SYM		/* name must be "NOT BETWEEN SYMMETRIC" */
} A_Expr_Kind;

Для бинарной операции (сравнение) это поле равно AEXPR_OP. За его обработку отвечает функция transformAExprOp (src/backend/parser/parse_expr.c). В итоге получается такой порядок

switch (nodeTag(expr))
{
    case T_A_Expr:
	{
		A_Expr	   *a = (A_Expr *) expr;
		switch (a->kind)
		{
            case AEXPR_OP:
				result = transformAExprOp(pstate, a);
    			break;
            // ..
        }
        break;
    // ...
    }
}

Сам transformAExprOp просматривает несколько краевых случаев:

"= NULL"

Самый первый обрабатываемый случай - это трансформация “= NULL” в “IS NULL”. В комментарии поясняется, что это сделано для совместимости с другими языками, нарушающими стандарты SQL

/*
 * Special-case "foo = NULL" and "NULL = foo" for compatibility with
 * standards-broken products (like Microsoft's).  Turn these into IS NULL
 * exprs. (If either side is a CaseTestExpr, then the expression was
 * generated internally from a CASE-WHEN expression, and
 * transform_null_equals does not apply.)
 */
if (Transform_null_equals &&
    list_length(a->name) == 1 &&
    strcmp(strVal(linitial(a->name)), "=") == 0 &&
    (exprIsNullConstant(lexpr) || exprIsNullConstant(rexpr)) &&
    (!IsA(lexpr, CaseTestExpr) && !IsA(rexpr, CaseTestExpr)))
{ 
  // ...
}

ROW op SUBSELECT

Другой особый случай - когда один из операндов подзапрос. Этом случае тип выражения меняется с EXPR_SUBLINK на ROWCOMPARE_SUBLINK. Сам тип представляется перечислением SubLinkType

/*
 * SubLink
 *
 * A SubLink represents a subselect appearing in an expression, and in some
 * cases also the combining operator(s) just above it.  The subLinkType
 * indicates the form of the expression represented:
 *	EXISTS_SUBLINK		EXISTS(SELECT ...)
 *	ALL_SUBLINK			(lefthand) op ALL (SELECT ...)
 *	ANY_SUBLINK			(lefthand) op ANY (SELECT ...)
 *	ROWCOMPARE_SUBLINK	(lefthand) op (SELECT ...)
 *	EXPR_SUBLINK		(SELECT with single targetlist item ...)
 *	MULTIEXPR_SUBLINK	(SELECT with multiple targetlist items ...)
 *	ARRAY_SUBLINK		ARRAY(SELECT with single targetlist item ...)
 *	CTE_SUBLINK			WITH query (never actually part of an expression)
 * For ALL, ANY, and ROWCOMPARE, the lefthand is a list of expressions of the
 * same length as the subselect's targetlist.  ROWCOMPARE will *always* have
 * a list with more than one entry; if the subselect has just one target
 * then the parser will create an EXPR_SUBLINK instead (and any operator
 * above the subselect will be represented separately).
 * ROWCOMPARE, EXPR, and MULTIEXPR require the subselect to deliver at most
 * one row (if it returns no rows, the result is NULL).
 * ALL, ANY, and ROWCOMPARE require the combining operators to deliver boolean
 * results.  ALL and ANY combine the per-row results using AND and OR
 * semantics respectively.
 * ARRAY requires just one target column, and creates an array of the target
 * column's type using any number of rows resulting from the subselect.
 *
 * SubLink is classed as an Expr node, but it is not actually executable;
 * it must be replaced in the expression tree by a SubPlan node during
 * planning.
 *
 * NOTE: in the raw output of gram.y, testexpr contains just the raw form
 * of the lefthand expression (if any), and operName is the String name of
 * the combining operator.  Also, subselect is a raw parsetree.  During parse
 * analysis, the parser transforms testexpr into a complete boolean expression
 * that compares the lefthand value(s) to PARAM_SUBLINK nodes representing the
 * output columns of the subselect.  And subselect is transformed to a Query.
 * This is the representation seen in saved rules and in the rewriter.
 *
 * In EXISTS, EXPR, MULTIEXPR, and ARRAY SubLinks, testexpr and operName
 * are unused and are always null.
 *
 * subLinkId is currently used only for MULTIEXPR SubLinks, and is zero in
 * other SubLinks.  This number identifies different multiple-assignment
 * subqueries within an UPDATE statement's SET list.  It is unique only
 * within a particular targetlist.  The output column(s) of the MULTIEXPR
 * are referenced by PARAM_MULTIEXPR Params appearing elsewhere in the tlist.
 *
 * The CTE_SUBLINK case never occurs in actual SubLink nodes, but it is used
 * in SubPlans generated for WITH subqueries.
 */
typedef enum SubLinkType
{
	EXISTS_SUBLINK,
	ALL_SUBLINK,
	ANY_SUBLINK,
	ROWCOMPARE_SUBLINK,
	EXPR_SUBLINK,
	MULTIEXPR_SUBLINK,
	ARRAY_SUBLINK,
	CTE_SUBLINK					/* for SubPlans only */
} SubLinkType;

Когда обе части - это ROW выражения тоже отдельный случай (ROW op ROW).

Обычный оператор

Точнее сказать, это все остальные случаи. Сюда мы попадаем, если другие условия не сработали.

При исполнении нашего запроса, мы попадаем в последнюю ветку. Внутри каждая часть выражения рекурсивно разбирается (вызов transformExprRecurse):

Левая часть u.name попадает под случай ColumnRef - ссылка на столбец. Она обрабатывается так же как и выше (в SELECT)
Правая часть 24 попадает под случай T_Const : узел - константа. Для разбора констант узлов используется функция make_const. Разбор константы - тот же самый switch/case. Причем парсинг числа - самый простой (всего 4 строки)

/*
 * make_const
 *
 *	Convert an A_Const node (as returned by the grammar) to a Const node
 *	of the "natural" type for the constant.  Note that this routine is
 *	only used when there is no explicit cast for the constant, so we
 *	have to guess what type is wanted.
 *
 *	For string literals we produce a constant of type UNKNOWN ---- whose
 *	representation is the same as cstring, but it indicates to later type
 *	resolution that we're not sure yet what type it should be considered.
 *	Explicit "NULL" constants are also typed as UNKNOWN.
 *
 *	For integers and floats we produce int4, int8, or numeric depending
 *	on the value of the number.  XXX We should produce int2 as well,
 *	but additional cleanup is needed before we can do that; there are
 *	too many examples that fail if we try.
 */
Const *
make_const(ParseState *pstate, A_Const *aconst)
{
  // ...
  switch (nodeTag(&aconst->val))
  {
      case T_Integer:
          val = Int32GetDatum(intVal(&aconst->val));
          typeid = INT4OID;
          typelen = sizeof(int32);
          typebyval = true;
          break;
      // ...
  }
  // ..
}

После вызывается функция make_op (src/backend/parser/parse_oper.c) для создания узла операции.

HAVING COUNT(*) = 1

Следующий этап - разбор предложения HAVING. Для разбора HAVING используется та же функция, что и при парсинге WHERE - transformWhereClause

Различие начинается в функции transformExprRecurse: левый оператор теперь не ссылка на столбец, а агрегирующая функция. Поэтому тэг не T_ColumnRef, а T_FuncCall - вызов функции. За ее обработку отвечает transformFuncCall(src/backend/parser/parse_expr.c)

Сама логика парсинга заключается в функции ParseFuncOrColumn

tab.col = col(tab)

В Postgres обращение к столбцу в таблице может быть записано 2 способами: через точку (table.column) и как вызов функции (column(table)). Это легаси, с которым нужно мириться.

Поэтому название функции имеет такое название.

В комментарии дано более подробное описание

/*
 *	Parse a function call
 *
 *	For historical reasons, Postgres tries to treat the notations tab.col
 *	and col(tab) as equivalent: if a single-argument function call has an
 *	argument of complex type and the (unqualified) function name matches
 *	any attribute of the type, we can interpret it as a column projection.
 *	Conversely a function of a single complex-type argument can be written
 *	like a column reference, allowing functions to act like computed columns.
 *
 *	If both interpretations are possible, we prefer the one matching the
 *	syntactic form, but otherwise the form does not matter.
 *
 *	Hence, both cases come through here.  If fn is null, we're dealing with
 *	column syntax not function syntax.  In the function-syntax case,
 *	the FuncCall struct is needed to carry various decoration that applies
 *	to aggregate and window functions.
 *
 *	Also, when fn is null, we return NULL on failure rather than
 *	reporting a no-such-function error.
 * ...
 */

Как происходит поиск функции

Информация о функциях (и других . Для получения информации о функции используется func_get_detail(src/backend/parser/parse_func.c).

Для поиска используются кандидаты. Поиск производится по названию (разбирается на неймспейс и имя) и аргументам (количеству и именам).

Поиск кандидатов реализуется в FuncnameGetCandidates(src/backend/catalog/namespace.c). Вначале из системного кэша получаются все вхождения функций (список) с совпадающим названием.

После происходит итерирование по всем элементам этого списка в поиске кандидатов. Функция может стать кандидатом только если:

Неймспейс совпадает (поиск происходил только по названию)
Произошло совпадение по OUT параметрам
Именованные аргументы совпали
Совпадения по типам переданных параметров

Когда список кандидатов готов, он подвергается дополнительным испытаниям.

Происходит поиск лучшего кандидата: сравниваются типы всех аргументов. Если они все совпадают, то функция найдена

/*
 * Quickly check if there is an exact match to the input datatypes (there
 * can be only one)
 */
for (best_candidate = raw_candidates;
     best_candidate != NULL;
     best_candidate = best_candidate->next)
{
    /* if nargs==0, argtypes can be null; don't pass that to memcmp */
    if (nargs == 0 ||
        memcmp(argtypes, best_candidate->args, nargs * sizeof(Oid)) == 0)
        break;
}

Если такой функции не нашлось и при этом у функции только 1 аргумент - делается предположение, что функция - каст к типу. Для этого имя функции интерпретируется как название типа и происходит попытка привести его к касту (coercion).

if (best_candidate == NULL)
{
    if (nargs == 1 && fargs != NIL && fargnames == NIL)
    {
        Oid	targetType = FuncNameAsType(funcname);

        if (OidIsValid(targetType))
        {
           // ...
        }
   }
}

Потом происходит поиск по совпадению аргументов с возможным приведением типов - func_match_argtypes(src/backend/parser/parse_func.c).

Если нашлась одна подходящая функция, то возвращается она.
Если под условие подошло сразу несколько то происходит ~~последний рывок~~ вызов func_select_candidate (src/backend/parser/parse_func.c), которая пытается решить конфликт.

Если дошли до этого момента - то функцию не удалось найти.

Системный кэш, syscache

Во время работы нам может понадобиться множество объектов. Например, информация о существующих функциях как в примере выше. Для ускорения работы, подобные “горячие” элементы хранятся в памяти. Доступ к ним производится через системный кэш.

За работу с ним отвечает модуль syscache (src/include/utils/syscache.h).

В этом заголовочном файле объявлены необходимые для работы перечисление SysCacheIdentifier (src/include/utils/syscache.h), сигнатуры функций для доступа к кэшу и несколько полезных макросов.

Перечисление SysCacheIdentifier определяет область данных, которые мы ищем.

/*
 *		SysCache identifiers.
 *
 *		The order of these identifiers must match the order
 *		of the entries in the array cacheinfo[] in syscache.c.
 *		Keep them in alphabetical order (renumbering only costs a
 *		backend rebuild).
 */

enum SysCacheIdentifier
{
	AGGFNOID = 0,
	AMNAME,
	AMOID,
	AMOPOPID,
	AMOPSTRATEGY,
	AMPROCNUM,
	ATTNAME,
	ATTNUM,
	AUTHMEMMEMROLE,
	AUTHMEMROLEMEM,
	AUTHNAME,
	AUTHOID,
	CASTSOURCETARGET,
	CLAAMNAMENSP,
	CLAOID,
	COLLNAMEENCNSP,
	COLLOID,
	CONDEFAULT,
	CONNAMENSP,
	CONSTROID,
	CONVOID,
	DATABASEOID,
	DEFACLROLENSPOBJ,
	ENUMOID,
	ENUMTYPOIDNAME,
	EVENTTRIGGERNAME,
	EVENTTRIGGEROID,
	FOREIGNDATAWRAPPERNAME,
	FOREIGNDATAWRAPPEROID,
	FOREIGNSERVERNAME,
	FOREIGNSERVEROID,
	FOREIGNTABLEREL,
	INDEXRELID,
	LANGNAME,
	LANGOID,
	NAMESPACENAME,
	NAMESPACEOID,
	OPERNAMENSP,
	OPEROID,
	OPFAMILYAMNAMENSP,
	OPFAMILYOID,
	PARAMETERACLNAME,
	PARAMETERACLOID,
	PARTRELID,
	PROCNAMEARGSNSP,
	PROCOID,
	PUBLICATIONNAME,
	PUBLICATIONNAMESPACE,
	PUBLICATIONNAMESPACEMAP,
	PUBLICATIONOID,
	PUBLICATIONREL,
	PUBLICATIONRELMAP,
	RANGEMULTIRANGE,
	RANGETYPE,
	RELNAMENSP,
	RELOID,
	REPLORIGIDENT,
	REPLORIGNAME,
	RULERELNAME,
	SEQRELID,
	STATEXTDATASTXOID,
	STATEXTNAMENSP,
	STATEXTOID,
	STATRELATTINH,
	SUBSCRIPTIONNAME,
	SUBSCRIPTIONOID,
	SUBSCRIPTIONRELMAP,
	TABLESPACEOID,
	TRFOID,
	TRFTYPELANG,
	TSCONFIGMAP,
	TSCONFIGNAMENSP,
	TSCONFIGOID,
	TSDICTNAMENSP,
	TSDICTOID,
	TSPARSERNAMENSP,
	TSPARSEROID,
	TSTEMPLATENAMENSP,
	TSTEMPLATEOID,
	TYPENAMENSP,
	TYPEOID,
	USERMAPPINGOID,
	USERMAPPINGUSERSERVER

#define SysCacheSize (USERMAPPINGUSERSERVER + 1)
};

Как понятно из сигнатуры, функции InitCatalogCache и InitCatalogCachePhase2 инициализируют кэш. Обе эти функции вызываются в InitPostgres на моменте старта базы данных. Используется 2 функции, так как инициализация разбита на 2 фазы: выделение памяти и общая инициализация (не загрузка кэша), соответственно.

На примере поиска необходимой функции, работа с кэшем выглядит следующим образом:

Вызываем макрос SearchSysCacheList* для списка значений по ключу. Под * имеется ввиду количество ключей по которым идет поиск. В случае поиска функции, использовалось значение 1

/* Search syscache by name only */
catlist = SearchSysCacheList1(PROCNAMEARGSNSP, CStringGetDatum(funcname));

Нам возвращается структура CatCList. Она представляет собой список из элементов кэша. Элементами этого списка являются CatCTup

/*
 * A CatCList describes the result of a partial search, ie, a search using
 * only the first K key columns of an N-key cache.  We store the keys used
 * into the keys attribute to represent the stored key set.  The CatCList
 * object contains links to cache entries for all the table rows satisfying
 * the partial key.  (Note: none of these will be negative cache entries.)
 *
 * A CatCList is only a member of a per-cache list; we do not currently
 * divide them into hash buckets.
 *
 * A list marked "dead" must not be returned by subsequent searches.
 * However, it won't be physically deleted from the cache until its
 * refcount goes to zero.  (A list should be marked dead if any of its
 * member entries are dead.)
 *
 * If "ordered" is true then the member tuples appear in the order of the
 * cache's underlying index.  This will be true in normal operation, but
 * might not be true during bootstrap or recovery operations. (namespace.c
 * is able to save some cycles when it is true.)
 */
typedef struct catclist
{
	int			cl_magic;		/* for identifying CatCList entries */
#define CL_MAGIC   0x52765103

	uint32		hash_value;		/* hash value for lookup keys */

	dlist_node	cache_elem;		/* list member of per-catcache list */

	/*
	 * Lookup keys for the entry, with the first nkeys elements being valid.
	 * All by-reference are separately allocated.
	 */
	Datum		keys[CATCACHE_MAXKEYS];

	int			refcount;		/* number of active references */
	bool		dead;			/* dead but not yet removed? */
	bool		ordered;		/* members listed in index order? */
	short		nkeys;			/* number of lookup keys specified */
	int			n_members;		/* number of member tuples */
	CatCache   *my_cache;		/* link to owning catcache */
	CatCTup    *members[FLEXIBLE_ARRAY_MEMBER]; /* members */
} CatCList;

typedef struct catctup
{
	int			ct_magic;		/* for identifying CatCTup entries */
#define CT_MAGIC   0x57261502

	uint32		hash_value;		/* hash value for this tuple's keys */

	/*
	 * Lookup keys for the entry. By-reference datums point into the tuple for
	 * positive cache entries, and are separately allocated for negative ones.
	 */
	Datum		keys[CATCACHE_MAXKEYS];

	/*
	 * Each tuple in a cache is a member of a dlist that stores the elements
	 * of its hash bucket.  We keep each dlist in LRU order to speed repeated
	 * lookups.
	 */
	dlist_node	cache_elem;		/* list member of per-bucket list */

	/*
	 * A tuple marked "dead" must not be returned by subsequent searches.
	 * However, it won't be physically deleted from the cache until its
	 * refcount goes to zero.  (If it's a member of a CatCList, the list's
	 * refcount must go to zero, too; also, remember to mark the list dead at
	 * the same time the tuple is marked.)
	 *
	 * A negative cache entry is an assertion that there is no tuple matching
	 * a particular key.  This is just as useful as a normal entry so far as
	 * avoiding catalog searches is concerned.  Management of positive and
	 * negative entries is identical.
	 */
	int			refcount;		/* number of active references */
	bool		dead;			/* dead but not yet removed? */
	bool		negative;		/* negative cache entry? */
	HeapTupleData tuple;		/* tuple management header */

	/*
	 * The tuple may also be a member of at most one CatCList.  (If a single
	 * catcache is list-searched with varying numbers of keys, we may have to
	 * make multiple entries for the same tuple because of this restriction.
	 * Currently, that's not expected to be common, so we accept the potential
	 * inefficiency.)
	 */
	struct catclist *c_list;	/* containing CatCList, or NULL if none */

	CatCache   *my_cache;		/* link to owning catcache */
	/* properly aligned tuple data follows, unless a negative entry */
} CatCTup;

Итерируемся по всем элементам members, получаем хранимые в нем данные и работаем с ними

for (i = 0; i < catlist->n_members; i++)
{
	HeapTuple	proctup = &catlist->members[i]->tuple;
    // ...
}

Возвращаем список обратно в кэш - ReleaseSysCacheList

ReleaseSysCacheList(catlist);

Осталось только выяснить как этот кэш заполняется и хранится.

Хранение

Сам кэш хранится в глобальной переменной SysCache. Это массив типа CatCache

typedef struct catcache
{
	int			id;				/* cache identifier --- see syscache.h */
	int			cc_nbuckets;	/* # of hash buckets in this cache */
	TupleDesc	cc_tupdesc;		/* tuple descriptor (copied from reldesc) */
	dlist_head *cc_bucket;		/* hash buckets */
	CCHashFN	cc_hashfunc[CATCACHE_MAXKEYS];	/* hash function for each key */
	CCFastEqualFN cc_fastequal[CATCACHE_MAXKEYS];	/* fast equal function for
													 * each key */
	int			cc_keyno[CATCACHE_MAXKEYS]; /* AttrNumber of each key */
	dlist_head	cc_lists;		/* list of CatCList structs */
	int			cc_ntup;		/* # of tuples currently in this cache */
	int			cc_nkeys;		/* # of keys (1..CATCACHE_MAXKEYS) */
	const char *cc_relname;		/* name of relation the tuples come from */
	Oid			cc_reloid;		/* OID of relation the tuples come from */
	Oid			cc_indexoid;	/* OID of index matching cache keys */
	bool		cc_relisshared; /* is relation shared across databases? */
	slist_node	cc_next;		/* list link */
	ScanKeyData cc_skey[CATCACHE_MAXKEYS];	/* precomputed key info for heap
											 * scans */

	/*
	 * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
	 * doesn't break ABI for other modules
	 */
#ifdef CATCACHE_STATS
	long		cc_searches;	/* total # searches against this cache */
	long		cc_hits;		/* # of matches against existing entry */
	long		cc_neg_hits;	/* # of matches against negative entry */
	long		cc_newloads;	/* # of successful loads of new entry */

	/*
	 * cc_searches - (cc_hits + cc_neg_hits + cc_newloads) is number of failed
	 * searches, each of which will result in loading a negative entry
	 */
	long		cc_invals;		/* # of entries invalidated from cache */
	long		cc_lsearches;	/* total # list-searches */
	long		cc_lhits;		/* # of matches against existing lists */
#endif
} CatCache;


static CatCache *SysCache[SysCacheSize];

Заполнение

Заполнение - лениво: во время вызова SearchCatCacheList мы итерируемся по имеющимся в памяти данным, но если в результате мы ничего не нашли, то входим в область заполнения кэша. Эта область окаймлена PG_TRY/PG_CATCH так как во время работы с таблицей может возникнуть исключение и тогда нужно удалить все добавленные элементы в хэш таблицу.

CatCList *
SearchCatCacheList(CatCache *cache, int nkeys, Datum v1, Datum v2, Datum v3)
{
    dlist_foreach(iter, &cache->cc_lists)
    {
        // Поиск в памяти
    }
    
    PG_TRY();
    {
        // Сканирование системной таблицы для поиска необходимых данных
    }
    PG_CATCH();
	{
		// Удаление добавленных данных из кэша

		PG_RE_THROW();
	}
	PG_END_TRY();
  
    foreach(ctlist_item, ctlist)
    {
        // Поиск в новых элементах таблицы
    }
}

ORDER BY u.name

После идет обработка ORDER BY u.name. За это отвечает transformSortClause(src/backend/parser/parse_clause.c)

Парсинг заключается в последовательной обработке выражений из списка сортировки (выражений, используемых для сортировки) и добавлению их в результирующий список. За их обработку отвечают 2 функции:

findTargetlistEntrySQL99 - стандарт SQL 99 (src/backend/parser/parse_clause.c)
findTargetlistEntrySQL92 - стандарт SQL 92 (src/backend/parser/parse_clause.c)

Сам разбор состоит из простого итерирования по списку и вызова transformTargetEntry для каждого элемента

foreach(olitem, orderlist)
{
    SortBy	   *sortby = (SortBy *) lfirst(olitem);
    TargetEntry *tle;

    if (useSQL99)
        tle = findTargetlistEntrySQL99(pstate, sortby->node,
                                       targetlist, exprKind);
    else
        tle = findTargetlistEntrySQL92(pstate, sortby->node,
                                       targetlist, exprKind);

    sortlist = addTargetToSortList(pstate, tle,
                                   sortlist, *targetlist, sortby);
}

Разница между SQL 92 и SQL 99

Одно из различий в стандартах SQL92 и SQL99 заключается в логике обработки списка выражений в ORDER BY и GROUP BY.

В стандарте SQL92 этот список состоит из названий столбцов или номера столбца (числа). Например, SELECT name, age FROM users ORDER BY 2 - будет сортировать по возрасту (age)

В стандарте SQL99 каждый элемент списка - это выражение, использующее столбцы таблиц. Например,

SELECT length(name), name from users ORDER BY age * 2

в ORDER BY содержится выражение, а не сырой столбец.

Важно отметить, что в логике парсинга сделана небольшая оптимизация: при парсинге стандартом SQL92, если проваливается попытка обнаружения ссылки на столбец из SELECT выражения, то запасным вариантом запускается парсинг SQL99 стандартом.

static TargetEntry *
findTargetlistEntrySQL92(ParseState *pstate, Node *node, List **tlist,
						 ParseExprKind exprKind)
{
    // ... Парсинг стандартом SQL92 ...

  	/*
	 * Otherwise, we have an expression, so process it per SQL99 rules.
	 */
	return findTargetlistEntrySQL99(pstate, node, tlist, exprKind);
}

Для определения какой метод разбора выбрать используется фича-флаг useSQL99, который передается в transformSortClause.

List *
transformSortClause(ParseState *pstate,
					List *orderlist,
					List **targetlist,
					ParseExprKind exprKind,
					bool useSQL99)
{
    // ...
    if (useSQL99)
        tle = findTargetlistEntrySQL99(pstate, sortby->node,
                                       targetlist, exprKind);
    else
        tle = findTargetlistEntrySQL92(pstate, sortby->node,
                                       targetlist, exprKind);
    // ...
}

GROUP BY u.name

После идет обработка GROUP BY. За это отвечает transformGroupClause (src/backend/parser/parse_clause.c).

В списке выражений группировки может быть 2 “граничных” случая, которые обрабатываются по разному:

Выражение (столбец, функция)
GROUPING SET

Все выражения группировки хранятся в общем списке flat_grouplist. В процессе обработки каждого выражения группировки этот список обновляется.

За обработку выражения отвечает transformGroupClauseExpr. Разбор выражения ведется точно так же как и выражения сортировки - findTargetlistEntrySQL99 и findTargetlistEntrySQL92. Но дальше идет оптимизация: учет сортировки.

Для обработки таких случаев используется структура SortGroupClause (src/include/nodes/parsenodes.h).

/*
 * SortGroupClause -
 *		representation of ORDER BY, GROUP BY, PARTITION BY,
 *		DISTINCT, DISTINCT ON items
 * ...
 */
typedef struct SortGroupClause
{
	NodeTag		type;
	Index		tleSortGroupRef;	/* reference into targetlist */
	Oid			eqop;			/* the equality operator ('=' op) */
	Oid			sortop;			/* the ordering operator ('<' op), or 0 */
	bool		nulls_first;	/* do NULLs come before normal values? */
	bool		hashable;		/* can eqop be implemented by hashing? */
} SortGroupClause;

Внутри себя она хранит информацию необходимую для сортировки и группировки (операторы сравнения, флаг хэшируемости), а для оптимизации памяти вместо самого выражения - индекс на элемент в списке столбцов SELECT.

Работа transformGroupClauseExpr зависит от типа используемой группировки.

В случае GROUPING SET действия зависят от подтипа. Подтип определяется перечислением GroupingSetKind (src/include/nodes/parsenodes.h)

typedef enum GroupingSetKind
{
	GROUPING_SET_EMPTY,
	GROUPING_SET_SIMPLE,
	GROUPING_SET_ROLLUP,
	GROUPING_SET_CUBE,
	GROUPING_SET_SETS
} GroupingSetKind;

Остальные случаи используют transformGroupClauseExpr, для парсинга выражений.

Результатом работы этого этапа является список SortGroupClause - ссылок на элементы из результирующего списка выражений.

LIMIT 50

Последним обрабатывается выражение LIMIT. За его обработку отвечает transformLimitClause(src/backend/parser/parse_clause.c)

Обработка заключается в парсинге выражения. За это отвечает базовый transformExpr. Дополнительно проводится проверка на то, что выражение не ссылается на переменные в запросе. Для этого используется функция checkExprIsVarFree. Она обходит дерево запроса и проверяет существование переменных в любом из узлов.

Этот паттерн посетитель реализуется функцией query_tree_walker(src/backend/nodes/nodeFuncs.c)

bool
query_tree_walker(Query *query,
				  bool (*walker) (),
				  void *context,
				  int flags)

Сам посетитель - функция, которую передаем аргументом. query_tree_walker проходит по всем узлам запроса и применяет ее к каждому узлу.

Посетитель, проверяющий существование переменных, - contain_vars_of_level_walker(src/backend/optimizer/util/var.c).

Важно заметить, что transformLimitClause используется для парсинга как и LIMIT так и OFFSET. Разница заключается в дополнительной проверке для LIMIT: в FETCH FIRST … WITH TIES не разрешено использование NULL

/*
* Don't allow NULLs in FETCH FIRST .. WITH TIES.  This test is ugly and
* extremely simplistic, in that you can pass a NULL anyway by hiding it
* inside an expression -- but this protects ruleutils against emitting an
* unadorned NULL that's not accepted back by the grammar.
*/
if (exprKind == EXPR_KIND_LIMIT && limitOption == LIMIT_OPTION_WITH_TIES &&
    IsA(clause, A_Const) && castNode(A_Const, clause)->isnull)
  ereport(ERROR,
          (errcode(ERRCODE_INVALID_ROW_COUNT_IN_LIMIT_CLAUSE),
           errmsg("row count cannot be null in FETCH FIRST ... WITH TIES clause")));

Предобработка

Следующий важный этап - составление плана запроса. Но перед этим происходит предобработка некоторых выражений.

LIMIT

Первым производится обработка LIMIT/OFFSET выражения. За это отвечает функция preprocess_limit(src/backend/optimizer/plan/planner.c). Эта функция не просто парсит значение узла LIMIT, здесь происходит оценка возможного количества кортежей на выходе - tuple_fraction.

tuple_fraction

Одно из достоинств PostgreSQL - умный планировщик. Один из компонентов такой репутации - логика определения количества кортежей на выходе. В GUC присутствует параметр cursor_tuple_fraction, который представляет оценку количества записей, которые будут получены. По умолчанию, он равен 0,1 (10%)

Это значение используется для определения лучшего плана исполнения: если планируется получить мало строк, то выбирается план с самым быстрым стартом.

Но cursor_tuple_fraction - не конечное значение. Это значение регулируется - preprocess_limit возвращает новое значение tuple_fraction, которое и будет использоваться в процессе создания плана.

Код его расчета - дерево выбора

static double
preprocess_limit(PlannerInfo *root, double tuple_fraction,
				 int64 *offset_est, int64 *count_est)
{
    // ...

    if (*count_est != 0)
	{
		/*
		 * A LIMIT clause limits the absolute number of tuples returned.
		 * However, if it's not a constant LIMIT then we have to guess; for
		 * lack of a better idea, assume 10% of the plan's result is wanted.
		 */
		if (*count_est < 0 || *offset_est < 0)
		{
			/* LIMIT or OFFSET is an expression ... punt ... */
			limit_fraction = 0.10;
		}
		else
		{
			/* LIMIT (plus OFFSET, if any) is max number of tuples needed */
			limit_fraction = (double) *count_est + (double) *offset_est;
		}

		/*
		 * If we have absolute limits from both caller and LIMIT, use the
		 * smaller value; likewise if they are both fractional.  If one is
		 * fractional and the other absolute, we can't easily determine which
		 * is smaller, but we use the heuristic that the absolute will usually
		 * be smaller.
		 */
		if (tuple_fraction >= 1.0)
		{
			if (limit_fraction >= 1.0)
			{
				/* both absolute */
				tuple_fraction = Min(tuple_fraction, limit_fraction);
			}
			else
			{
				/* caller absolute, limit fractional; use caller's value */
			}
		}
		else if (tuple_fraction > 0.0)
		{
			if (limit_fraction >= 1.0)
			{
				/* caller fractional, limit absolute; use limit */
				tuple_fraction = limit_fraction;
			}
			else
			{
				/* both fractional */
				tuple_fraction = Min(tuple_fraction, limit_fraction);
			}
		}
		else
		{
			/* no info from caller, just use limit */
			tuple_fraction = limit_fraction;
		}
	}
	else if (*offset_est != 0 && tuple_fraction > 0.0)
	{
		/*
		 * We have an OFFSET but no LIMIT.  This acts entirely differently
		 * from the LIMIT case: here, we need to increase rather than decrease
		 * the caller's tuple_fraction, because the OFFSET acts to cause more
		 * tuples to be fetched instead of fewer.  This only matters if we got
		 * a tuple_fraction > 0, however.
		 *
		 * As above, use 10% if OFFSET is present but unestimatable.
		 */
		if (*offset_est < 0)
			limit_fraction = 0.10;
		else
			limit_fraction = (double) *offset_est;

		/*
		 * If we have absolute counts from both caller and OFFSET, add them
		 * together; likewise if they are both fractional.  If one is
		 * fractional and the other absolute, we want to take the larger, and
		 * we heuristically assume that's the fractional one.
		 */
		if (tuple_fraction >= 1.0)
		{
			if (limit_fraction >= 1.0)
			{
				/* both absolute, so add them together */
				tuple_fraction += limit_fraction;
			}
			else
			{
				/* caller absolute, limit fractional; use limit */
				tuple_fraction = limit_fraction;
			}
		}
		else
		{
			if (limit_fraction >= 1.0)
			{
				/* caller fractional, limit absolute; use caller's value */
			}
			else
			{
				/* both fractional, so add them together */
				tuple_fraction += limit_fraction;
				if (tuple_fraction >= 1.0)
					tuple_fraction = 0.0;	/* assume fetch all */
			}
		}
	}

	return tuple_fraction;

GROUP BY

Дальше происходит обработка выражения GROUP BY - вызов preprocess_groupclause (src/backend/optimizer/plan/planner.c).

Главная задача этой функции - синхронизировать GROUP BY и ORDER BY выражения. Если это получится сделать, то сортировку и группировку можно выполнить одной операцией.

SELECT …

За предобработку возвращаемого списка значений отвечает preprocess_targetlist (src/backend/optimizer/prep/preptlist.c). Вкратце, на данном этапе добавляются рабочие столбцы, которые будут полезны в дальнейшей работе.

Как работает FOR UPDATE/SHARE

В случае, если в запросе было указано FOR UPDATE/SHARE, то полученные кортежи нужно заблокировать, чтобы не дать получить к ним доступ другим пользователям.

Интересен вариант реализации - если в выражении есть помеченные строки, то в результирующий список добавляются новые (мусорные, junk) столбцы, единственная задача которых - помечать строки.

foreach(lc, root->rowMarks)
{
    PlanRowMark *rc = (PlanRowMark *) lfirst(lc);

    // ...
  
    if (rc->allMarkTypes & ~(1 << ROW_MARK_COPY))
    {
        /* Need to fetch TID */
        var = makeVar(rc->rti,
                      SelfItemPointerAttributeNumber,
                      TIDOID,
                      -1,
                      InvalidOid,
                      0);
        snprintf(resname, sizeof(resname), "ctid%u", rc->rowmarkId);
        tle = makeTargetEntry((Expr *) var,
                              list_length(tlist) + 1,
                              pstrdup(resname),
                              true);
        tlist = lappend(tlist, tle);
    }
    if (rc->allMarkTypes & (1 << ROW_MARK_COPY))
    {
        /* Need the whole row as a junk var */
        var = makeWholeRowVar(rt_fetch(rc->rti, range_table),
                              rc->rti,
                              0,
                              false);
        snprintf(resname, sizeof(resname), "wholerow%u", rc->rowmarkId);
        tle = makeTargetEntry((Expr *) var,
                              list_length(tlist) + 1,
                              pstrdup(resname),
                              true);
        tlist = lappend(tlist, tle);
    }
    // ...
}

Имена этих столбцов создаются динамически и доступа к ним не получить обычным способом. Сами помеченные столбцы представляются структурой PlanRowMark

/*
 * PlanRowMark -
 *	   plan-time representation of FOR [KEY] UPDATE/SHARE clauses
 * ...
 */
typedef struct PlanRowMark
{
	NodeTag		type;
	Index		rti;			/* range table index of markable relation */
	Index		prti;			/* range table index of parent relation */
	Index		rowmarkId;		/* unique identifier for resjunk columns */
	RowMarkType markType;		/* see enum above */
	int			allMarkTypes;	/* OR of (1<<markType) for all children */
	LockClauseStrength strength;	/* LockingClause's strength, or LCS_NONE */
	LockWaitPolicy waitPolicy;	/* NOWAIT and SKIP LOCKED options */
	bool		isParent;		/* true if this is a "dummy" parent entry */
} PlanRowMark;

Создание путей

Когда предобработка завершена, начинается создание путей выполнения.

Путь - это все данные необходимые для создания плана выполнения. Самый простой путь - последовательное сканирование - представляется структурой Path. Он является базовым для всех других.

/*
 * Type "Path" is used as-is for sequential-scan paths, as well as some other
 * simple plan types that we don't need any extra information in the path for.
 * For other path types it is the first component of a larger struct.
 *
 * "pathtype" is the NodeTag of the Plan node we could build from this Path.
 * It is partially redundant with the Path's NodeTag, but allows us to use
 * the same Path type for multiple Plan types when there is no need to
 * distinguish the Plan type during path processing.
 *
 * "parent" identifies the relation this Path scans, and "pathtarget"
 * describes the precise set of output columns the Path would compute.
 * In simple cases all Paths for a given rel share the same targetlist,
 * which we represent by having path->pathtarget equal to parent->reltarget.
 *
 * "param_info", if not NULL, links to a ParamPathInfo that identifies outer
 * relation(s) that provide parameter values to each scan of this path.
 * That means this path can only be joined to those rels by means of nestloop
 * joins with this path on the inside.  Also note that a parameterized path
 * is responsible for testing all "movable" joinclauses involving this rel
 * and the specified outer rel(s).
 *
 * "rows" is the same as parent->rows in simple paths, but in parameterized
 * paths and UniquePaths it can be less than parent->rows, reflecting the
 * fact that we've filtered by extra join conditions or removed duplicates.
 *
 * "pathkeys" is a List of PathKey nodes (see above), describing the sort
 * ordering of the path's output rows.
 */
typedef struct Path
{
	NodeTag		type;

	NodeTag		pathtype;		/* tag identifying scan/join method */

	RelOptInfo *parent;			/* the relation this path can build */
	PathTarget *pathtarget;		/* list of Vars/Exprs, cost, width */

	ParamPathInfo *param_info;	/* parameterization info, or NULL if none */

	bool		parallel_aware; /* engage parallel-aware logic? */
	bool		parallel_safe;	/* OK to use as part of parallel plan? */
	int			parallel_workers;	/* desired # of workers; 0 = not parallel */

	/* estimated size/costs for path (see costsize.c for more info) */
	Cardinality rows;			/* estimated number of result tuples */
	Cost		startup_cost;	/* cost expended before fetching any tuples */
	Cost		total_cost;		/* total cost (assuming all tuples fetched) */

	List	   *pathkeys;		/* sort ordering of path's output */
	/* pathkeys is a List of PathKey nodes; see above */
} Path;

Начальный путь создается вызовом query_planner. Затем, каждый из дополнительных выражений в запросе (GROUP BY, ORDER BY) добавляют свои пути.

Сортировка

Для создания путей, учитывающих сортировку, вызывается make_sort_input_target(src/backend/optimizer/plan/planner.c).

Вначале, просматривается каждое выражение участвующее в сортировке. В зависимости от его типа/свойств, обновляются переменные контекста.

Например, если в сортировке используются volatile функции или функции, стоимость которых слишком большая, то выставляется флаг have_volatile

foreach(lc, final_target->exprs)
{
    // ...
    if (contain_volatile_functions((Node *) expr))
    {
        /* Unconditionally postpone */
        postpone_col[i] = true;
        have_volatile = true;
    }
    // ...
}

Дополнительно учитываются столбцы, вычисление которых может быть отложено. Для отслеживания таких выражений используется массив флагов. Это показано выше: postpone_col[i] = true.

После создается новый PathTarget специально для сортировки.

Что такое PathTarget

В процессе выбора лучшего пути, нам нужны не все данные. В частности, больше всего нам полезны стоимость пути или количество используемых выражений. Для этого существует PathTarget

typedef struct PathTarget
{
	NodeTag		type;
	List	   *exprs;			/* list of expressions to be computed */
	Index	   *sortgrouprefs;	/* corresponding sort/group refnos, or 0 */
	QualCost	cost;			/* cost of evaluating the expressions */
	int			width;			/* estimated avg width of result tuples */
	VolatileFunctionStatus has_volatile_expr;	/* indicates if exprs contain
												 * any volatile functions. */
} PathTarget;

На текущем этапе создаются именно они. Сами планы будут созданы после, на основании этих путей

Первым заполняется список выражений, которые будут возвращены запросом - сперва идут простые переменные, а в конце те, что мы указали как отложенные.

В конце, вызывается set_pathtarget_cost_width(src/backend/optimizer/path/costsize.c) для определения стоимости пути.

set_pathtarget_cost_width

Почти все функции, создающие новые пути, подчиняются одному паттерну:

Создание нового пути (PathTarget)
Заполнение его данными
Вызов set_pathtarget_cost_width
Возврат полученного значения из этой функции

P.S. 3 и 4 объединяются в return set_pathtarget_cost_width

Задача этой функции - определить стоимость пути.

Стоимость представляется структурой QualCost

typedef double Cost;	/* execution cost (in page-access units) */

/*
 * The cost estimate produced by cost_qual_eval() includes both a one-time
 * (startup) cost, and a per-tuple cost.
 */
typedef struct QualCost
{
	Cost		startup;	/* one-time cost */
	Cost		per_tuple;	/* per-evaluation cost */
} QualCost;

Эта функция итерируется по всем выражениям, используемым в пути, и, в зависимости от его типа, обновляет стоимость.

Можно заметить, что при расчете стоимости учитывается только 2 типа выражений: переменные и обычные выражения. Причем стоимость пути могут изменить только обычные выражения - считается, что стоимость переменных равна 0

foreach(lc, target->exprs)
{
    Node	   *node = (Node *) lfirst(lc);

    if (IsA(node, Var))
    {
        Var		   *var = (Var *) node;
        int32		item_width;

        // ...
        
        tuple_width += item_width;
    }
    else
    {
        /*
         * Handle general expressions using type info.
         */
        int32		item_width;
        QualCost	cost;

        // ...
      
        target->cost.startup += cost.startup;
        target->cost.per_tuple += cost.per_tuple;
    }
}

Для определения стоимости выражения используется паттерн Visitor - функция cost_qual_eval_node (src/backend/optimizer/path/costsize.c) запускает посетитель cost_qual_eval_walker который обходит каждый узел выражения.

Например, в случае если выражение - вызов функции, то вызовется функция add_function_cost (src/backend/optimizer/util/plancat.c), которая и подсчитает стоимость функции

static bool
cost_qual_eval_walker(Node *node, cost_qual_eval_context *context)
{
	if (node == NULL)
		return false;

	if (IsA(node, RestrictInfo))
	{
		// ...
	}

	if (IsA(node, FuncExpr))
	{
		add_function_cost(context->root, ((FuncExpr *) node)->funcid, node,
						  &context->total);
	}
	else if (IsA(node, OpExpr) ||
			 IsA(node, DistinctExpr) ||
			 IsA(node, NullIfExpr))
	{
		// ...
	}
	else if (IsA(node, ScalarArrayOpExpr))
	{
		// ...
	}
	else if (IsA(node, Aggref) ||
			 IsA(node, WindowFunc))
	{
		// ...
	}
	else if (IsA(node, GroupingFunc))
	{
		// ...
	}
	else if (IsA(node, CoerceViaIO))
	{
		// ...
	}
	else if (IsA(node, ArrayCoerceExpr))
	{
		// ...
	}
	else if (IsA(node, RowCompareExpr))
	{
		// ...
	}
	else if (IsA(node, MinMaxExpr) ||
			 IsA(node, SQLValueFunction) ||
			 IsA(node, XmlExpr) ||
			 IsA(node, CoerceToDomain) ||
			 IsA(node, NextValueExpr))
	{
		// ...
	}
	else if (IsA(node, CurrentOfExpr))
	{
		// ...
	}
	else if (IsA(node, SubLink))
	{
		// ...
	}
	else if (IsA(node, SubPlan))
	{
		// ...
	}
	else if (IsA(node, AlternativeSubPlan))
	{
		// ...
	}
	else if (IsA(node, PlaceHolderVar))
	{
		// ...
	}

	/* recurse into children */
	return expression_tree_walker(node, cost_qual_eval_walker,
								  (void *) context);
}

Пути группировки

Дальше идет вычисление путей для группирующих выражений. За это отвечает make_group_input_target(src/backend/optimizer/plan/planner.c).

В начале все результирующие столбцы делятся на 2 группы: участвующие и нет в группировке.

После этого в результирующий путь сначала добавляются все столбцы, участвующие в группировке, а затем все остальные.

После так же как и в сортировке, вызывается set_pathtarget_cost_width для оценки стоимости.

Создание планов

Когда пути созданы, они сохраняются в поле upper_targets. Это массив хранящий в себе созданные пути. Для индексации по нему используется перечисление UpperRelationKind

/*
 * This enum identifies the different types of "upper" (post-scan/join)
 * relations that we might deal with during planning.
 */
typedef enum UpperRelationKind
{
	UPPERREL_SETOP,				/* result of UNION/INTERSECT/EXCEPT, if any */
	UPPERREL_PARTIAL_GROUP_AGG, /* result of partial grouping/aggregation, if
								 * any */
	UPPERREL_GROUP_AGG,			/* result of grouping/aggregation, if any */
	UPPERREL_WINDOW,			/* result of window functions, if any */
	UPPERREL_PARTIAL_DISTINCT,	/* result of partial "SELECT DISTINCT", if any */
	UPPERREL_DISTINCT,			/* result of "SELECT DISTINCT", if any */
	UPPERREL_ORDERED,			/* result of ORDER BY, if any */
	UPPERREL_FINAL				/* result of any remaining top-level actions */
	/* NB: UPPERREL_FINAL must be last enum entry; it's used to size arrays */
} UpperRelationKind;

После предобработки, сохраняются следующие планы

/*
 * Save the various upper-rel PathTargets we just computed into
 * root->upper_targets[].  The core code doesn't use this, but it
 * provides a convenient place for extensions to get at the info.  For
 * consistency, we save all the intermediate targets, even though some
 * of the corresponding upperrels might not be needed for this query.
 */
root->upper_targets[UPPERREL_FINAL] = final_target;
root->upper_targets[UPPERREL_ORDERED] = final_target;
root->upper_targets[UPPERREL_PARTIAL_DISTINCT] = sort_input_target;
root->upper_targets[UPPERREL_DISTINCT] = sort_input_target;
root->upper_targets[UPPERREL_WINDOW] = sort_input_target;
root->upper_targets[UPPERREL_GROUP_AGG] = grouping_target;

Планы группировки

Сейчас начинается создание дополнительных планов.

Первым создается план, учитывающий группировку. Это create_grouping_paths.

degenerate_grouping_paths

В Postgres есть понятие degenerate grouping. Это такое выражение, в котором не используются агрегирующие функции и GROUP BY, но при этом есть HAVING или GROUPING SET.

Для таких случаев имеется отдельный поток выполнения.

Проверка на этот случай содержится в is_degenerate_grouping (src/backend/optimizer/plan/planner.c)

/*
 * is_degenerate_grouping
 *
 * A degenerate grouping is one in which the query has a HAVING qual and/or
 * grouping sets, but no aggregates and no GROUP BY (which implies that the
 * grouping sets are all empty).
 */
static bool
is_degenerate_grouping(PlannerInfo *root)
{
	Query	   *parse = root->parse;

	return (root->hasHavingQual || parse->groupingSets) &&
		!parse->hasAggs && parse->groupClause == NIL;
}

Для них определен свой собственный узел - GroupResultPath

/*
 * GroupResultPath represents use of a Result plan node to compute the
 * output of a degenerate GROUP BY case, wherein we know we should produce
 * exactly one row, which might then be filtered by a HAVING qual.
 *
 * Note that quals is a list of bare clauses, not RestrictInfos.
 */
typedef struct GroupResultPath
{
	Path		path;
	List	   *quals;
} GroupResultPath;

Создание группирующих планов начинается с определения важных свойств выражения. Например, применимы ли сортировка или хэширование.

Эти флаги определены макросами

/*
 * Various flags indicating what kinds of grouping are possible.
 *
 * GROUPING_CAN_USE_SORT should be set if it's possible to perform
 * sort-based implementations of grouping.  When grouping sets are in use,
 * this will be true if sorting is potentially usable for any of the grouping
 * sets, even if it's not usable for all of them.
 *
 * GROUPING_CAN_USE_HASH should be set if it's possible to perform
 * hash-based implementations of grouping.
 *
 * GROUPING_CAN_PARTIAL_AGG should be set if the aggregation is of a type
 * for which we support partial aggregation (not, for example, grouping sets).
 * It says nothing about parallel-safety or the availability of suitable paths.
 */
#define GROUPING_CAN_USE_SORT       0x0001
#define GROUPING_CAN_USE_HASH       0x0002
#define GROUPING_CAN_PARTIAL_AGG	0x0004

Например, возможность использования хэширования определяется таким образом

if ((parse->groupClause != NIL &&
     root->numOrderedAggs == 0 &&
     (gd ? gd->any_hashable : grouping_is_hashable(parse->groupClause))))
    flags |= GROUPING_CAN_USE_HASH;

Сама логика добавления путей группировки заключена в add_paths_to_grouping_rel (src/backend/optimizer/plan/planner.c). Здесь создаются пути выполнения для вариантов с хэшированием и сортировкой.

static void
add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
						  RelOptInfo *grouped_rel,
						  RelOptInfo *partially_grouped_rel,
						  const AggClauseCosts *agg_costs,
						  grouping_sets_data *gd, double dNumGroups,
						  GroupPathExtraData *extra)
{
	Query	   *parse = root->parse;
	Path	   *cheapest_path = input_rel->cheapest_total_path;
	ListCell   *lc;
	bool		can_hash = (extra->flags & GROUPING_CAN_USE_HASH) != 0;
	bool		can_sort = (extra->flags & GROUPING_CAN_USE_SORT) != 0;
	List	   *havingQual = (List *) extra->havingQual;
	AggClauseCosts *agg_final_costs = &extra->agg_final_costs;

	if (can_sort)
	{
		/*
		 * Use any available suitably-sorted path as input, and also consider
		 * sorting the cheapest-total path.
		 */
		foreach(lc, input_rel->pathlist)
		{
            // ...
        }
    }

    if (can_hash)
	{
		if (parse->groupingSets)
		{
			/*
			 * Try for a hash-only groupingsets path over unsorted input.
			 */
			consider_groupingsets_paths(root, grouped_rel,
										cheapest_path, false, true,
										gd, agg_costs, dNumGroups);
		}
		else
		{
			/*
			 * Generate a HashAgg Path.  We just need an Agg over the
			 * cheapest-total input path, since input order won't matter.
			 */
			add_path(grouped_rel, (Path *)
					 create_agg_path(root, grouped_rel,
									 cheapest_path,
									 grouped_rel->reltarget,
									 AGG_HASHED,
									 AGGSPLIT_SIMPLE,
									 parse->groupClause,
									 havingQual,
									 agg_costs,
									 dNumGroups));
		}

        // ...
    }
}

Для случая GROUP BY u.name работа будет следующей.

Флаг can_hash выставлен, поэтому обрабатывается эта ветка.

Берется следующий путь (массив pathlist)
GROUPING SET у нас отсутствует, поэтому создаем HashAgg узел (create_agg_path)

В конце, вызывается set_cheapest (src/backend/optimizer/util/pathnode.c), чтобы обновить лучшие пути.

set_cheapest

set_cheapest - функция, которая занимается обновлением информации о лучших путях

При создании планов, для хранения информации используется структура RelOptInfo. В ней присутствуют поля, хранящие текущие лучшие пути

typedef struct RelOptInfo
{
    // ...
    struct Path *cheapest_startup_path;
	struct Path *cheapest_total_path;
	struct Path *cheapest_unique_path;
	List	   *cheapest_parameterized_paths;
    // ...
}

Но это только слепок - в процессе могут появиться новые более выгодные пути. Для обновления этих полей используется set_cheapest.

Внутри происходит простой проход по всем путям и сравнения с текущим лучшим

void
set_cheapest(RelOptInfo *parent_rel)
{
    Path *cheapest_startup_path;
	Path *cheapest_total_path;
	Path *best_param_path;

    cheapest_startup_path = cheapest_total_path = best_param_path = NULL;

  	foreach(p, parent_rel->pathlist)
	{
		Path	   *path = (Path *) lfirst(p);
		int			cmp;
    
        if (path->param_info)
		{
            // Обработка параметризованного выражения
        }
		else
		{
			/* Unparameterized path, so consider it for cheapest slots */
			if (cheapest_total_path == NULL)
			{
				cheapest_startup_path = cheapest_total_path = path;
				continue;
			}

			/*
			 * If we find two paths of identical costs, try to keep the
			 * better-sorted one.  The paths might have unrelated sort
			 * orderings, in which case we can only guess which might be
			 * better to keep, but if one is superior then we definitely
			 * should keep that one.
			 */
			cmp = compare_path_costs(cheapest_startup_path, path, STARTUP_COST);
			if (cmp > 0 ||
				(cmp == 0 &&
				 compare_pathkeys(cheapest_startup_path->pathkeys,
								  path->pathkeys) == PATHKEYS_BETTER2))
				cheapest_startup_path = path;

			cmp = compare_path_costs(cheapest_total_path, path, TOTAL_COST);
			if (cmp > 0 ||
				(cmp == 0 &&
				 compare_pathkeys(cheapest_total_path->pathkeys,
								  path->pathkeys) == PATHKEYS_BETTER2))
				cheapest_total_path = path;
		}
    }

    // ...
    
  	parent_rel->cheapest_startup_path = cheapest_startup_path;
	parent_rel->cheapest_total_path = cheapest_total_path;
	parent_rel->cheapest_unique_path = NULL;	/* computed only if needed */
	parent_rel->cheapest_parameterized_paths = parameterized_paths;
}

Сравнение стоимости путей реализуется функцией compare_path_costs (src/backend/optimizer/util/pathnode.c)

/*
 * compare_path_costs
 *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
 *	  or more expensive than path2 for the specified criterion.
 */
int
compare_path_costs(Path *path1, Path *path2, CostSelector criterion)
{
	if (criterion == STARTUP_COST)
	{
		if (path1->startup_cost < path2->startup_cost)
			return -1;
		if (path1->startup_cost > path2->startup_cost)
			return +1;

		/*
		 * If paths have the same startup cost (not at all unlikely), order
		 * them by total cost.
		 */
		if (path1->total_cost < path2->total_cost)
			return -1;
		if (path1->total_cost > path2->total_cost)
			return +1;
	}
	else
	{
		if (path1->total_cost < path2->total_cost)
			return -1;
		if (path1->total_cost > path2->total_cost)
			return +1;

		/*
		 * If paths have the same total cost, order them by startup cost.
		 */
		if (path1->startup_cost < path2->startup_cost)
			return -1;
		if (path1->startup_cost > path2->startup_cost)
			return +1;
	}
	return 0;
}

Планы сортировки

Дальше идет создание планов, учитывающих сортировку. За это отвечает create_ordered_paths(src/backend/optimizer/plan/planner.c).

За создание сортирующего плана отвечает create_sort_path (src/backend/optimizer/util/pathnode.c), либо create_incremental_sort_path, если необходима инкрементальная сортировка.

Просто обходится каждый путь и добавляется узел сортировки

foreach(lc, input_rel->pathlist)
{
    Path	   *input_path = (Path *) lfirst(lc);
    Path	   *sorted_path = input_path;
    bool		is_sorted;
    int			presorted_keys;

    is_sorted = pathkeys_count_contained_in(root->sort_pathkeys,
                                            input_path->pathkeys, &presorted_keys);

    if (is_sorted)
    {
        /* Use the input path as is, but add a projection step if needed */
        if (sorted_path->pathtarget != target)
            sorted_path = apply_projection_to_path(root, ordered_rel,
                                                   sorted_path, target);

        add_path(ordered_rel, sorted_path);
    }
    else
    {
        // Добавление явных узлов сортировки   
    }
}

План LIMIT/OFFSET

В конце когда результирующий план построен, к нему добавляются узлы LIMIT и/или OFFSET. Это делает create_limit_path (src/backend/optimizer/util/pathnode.c).

Создание результирующего плана

Спустя несколько возвратов вверх по стеку вызовов, результирующий план вычисляется в standard_planner

/* Select best Path and turn it into a Plan */
final_rel = fetch_upper_rel(root, UPPERREL_FINAL, NULL);
best_path = get_cheapest_fractional_path(final_rel, tuple_fraction);
top_plan = create_plan(root, best_path);

План построен, пора приступать к его выполнению.

Этап выполнения

Выполнение начинается в функции ExecutePlan.

Напомню, что:

Выполняется не план, а портал. Он хранит в себе все необходимое для выполнения плана
Каждый узел “наследуется” от PlanState
Каждый PlanState “реализует” ExecProcNodeMtd - функцию, которая исполняет работу узла (например, если это LIMIT, то следит за количеством полученных строк)

/* ----------------
 *		PlanState node
 *
 * We never actually instantiate any PlanState nodes; this is just the common
 * abstract superclass for all PlanState-type nodes.
 * ----------------
 */
typedef struct PlanState
{
	NodeTag		type;

	Plan	   *plan;			/* associated Plan node */

	EState	   *state;			/* at execution time, states of individual
								 * nodes point to one EState for the whole
								 * top-level plan */

	ExecProcNodeMtd ExecProcNode;	/* function to return next tuple */
	ExecProcNodeMtd ExecProcNodeReal;	/* actual function, if above is a
										 * wrapper */

	Instrumentation *instrument;	/* Optional runtime stats for this node */
	WorkerInstrumentation *worker_instrument;	/* per-worker instrumentation */

	/* Per-worker JIT instrumentation */
	struct SharedJitInstrumentation *worker_jit_instrument;

	/*
	 * Common structural data for all Plan types.  These links to subsidiary
	 * state trees parallel links in the associated plan tree (except for the
	 * subPlan list, which does not exist in the plan tree).
	 */
	ExprState  *qual;			/* boolean qual condition */
	struct PlanState *lefttree; /* input plan tree(s) */
	struct PlanState *righttree;

	List	   *initPlan;		/* Init SubPlanState nodes (un-correlated expr
								 * subselects) */
	List	   *subPlan;		/* SubPlanState nodes in my expressions */

	/*
	 * State for management of parameter-change-driven rescanning
	 */
	Bitmapset  *chgParam;		/* set of IDs of changed Params */

	/*
	 * Other run-time state needed by most if not all node types.
	 */
	TupleDesc	ps_ResultTupleDesc; /* node's return type */
	TupleTableSlot *ps_ResultTupleSlot; /* slot for my result tuples */
	ExprContext *ps_ExprContext;	/* node's expression-evaluation context */
	ProjectionInfo *ps_ProjInfo;	/* info for doing tuple projection */

	bool		async_capable;	/* true if node is async-capable */

	/*
	 * Scanslot's descriptor if known. This is a bit of a hack, but otherwise
	 * it's hard for expression compilation to optimize based on the
	 * descriptor, without encoding knowledge about all executor nodes.
	 */
	TupleDesc	scandesc;

	/*
	 * Define the slot types for inner, outer and scanslots for expression
	 * contexts with this state as a parent.  If *opsset is set, then
	 * *opsfixed indicates whether *ops is guaranteed to be the type of slot
	 * used. That means that every slot in the corresponding
	 * ExprContext.ecxt_*tuple will point to a slot of that type, while
	 * evaluating the expression.  If *opsfixed is false, but *ops is set,
	 * that indicates the most likely type of slot.
	 *
	 * The scan* fields are set by ExecInitScanTupleSlot(). If that's not
	 * called, nodes can initialize the fields themselves.
	 *
	 * If outer/inneropsset is false, the information is inferred on-demand
	 * using ExecGetResultSlotOps() on ->righttree/lefttree, using the
	 * corresponding node's resultops* fields.
	 *
	 * The result* fields are automatically set when ExecInitResultSlot is
	 * used (be it directly or when the slot is created by
	 * ExecAssignScanProjectionInfo() /
	 * ExecConditionalAssignProjectionInfo()).  If no projection is necessary
	 * ExecConditionalAssignProjectionInfo() defaults those fields to the scan
	 * operations.
	 */
	const TupleTableSlotOps *scanops;
	const TupleTableSlotOps *outerops;
	const TupleTableSlotOps *innerops;
	const TupleTableSlotOps *resultops;
	bool		scanopsfixed;
	bool		outeropsfixed;
	bool		inneropsfixed;
	bool		resultopsfixed;
	bool		scanopsset;
	bool		outeropsset;
	bool		inneropsset;
	bool		resultopsset;
} PlanState;

Представим, что для нашего запроса создался план выполнения:

LIMIT
ORDER BY
GROUP BY
SEQ SCAN

Тогда получение следующего кортежа пройдет следующие узлы (этапы).

LIMIT

Корнем нашего дерева запроса будет LIMIT.

Узел, его представляющий, - LimitState

typedef struct LimitState
{
	PlanState	ps;				/* its first field is NodeTag */
	ExprState  *limitOffset;	/* OFFSET parameter, or NULL if none */
	ExprState  *limitCount;		/* COUNT parameter, or NULL if none */
	LimitOption limitOption;	/* limit specification type */
	int64		offset;			/* current OFFSET value */
	int64		count;			/* current COUNT, if any */
	bool		noCount;		/* if true, ignore count */
	LimitStateCond lstate;		/* state machine status, as above */
	int64		position;		/* 1-based index of last tuple returned */
	TupleTableSlot *subSlot;	/* tuple last obtained from subplan */
	ExprState  *eqfunction;		/* tuple equality qual in case of WITH TIES
								 * option */
	TupleTableSlot *last_slot;	/* slot for evaluation of ties */
} LimitState;

Реализация ExecProcNodeMtd - ExecLimit (src/backend/executor/nodeLimit.c).

Узел работает по принципу машины состояний. Состояния описываются перечислением LimitStateCond

typedef enum
{
	LIMIT_INITIAL,				/* initial state for LIMIT node */
	LIMIT_RESCAN,				/* rescan after recomputing parameters */
	LIMIT_EMPTY,				/* there are no returnable rows */
	LIMIT_INWINDOW,				/* have returned a row in the window */
	LIMIT_WINDOWEND_TIES,		/* have returned a tied row */
	LIMIT_SUBPLANEOF,			/* at EOF of subplan (within window) */
	LIMIT_WINDOWEND,			/* stepped off end of window */
	LIMIT_WINDOWSTART			/* stepped off beginning of window */
} LimitStateCond;

Для LIMIT 50 переходы в процессе работы будут следующими:

LIMIT_INITIAL

Это начальное состояние. Здесь только вызывается recompute_limits, чтобы вычислить значения offset (сколько пропустить) и count (сколько взять). Из этого состояния мы сразу переходим в LIMIT_RESCAN.

LIMIT_RESCAN

Это основное рабочее состояние. Здесь мы вызываем подплан (получаем от него кортежи) до того момента пока не отправим указанное количество кортежей. Если предела достигли, то переходим в LIMIT_INWINDOW.

LIMIT_INWINDOW

Поведение в этом состоянии определяется опцией WITH TIES. В нашем случае, если мы попали сюда, то это означает, что мы отправили 50 строк и в таблице есть еще (так как не вернулся null).

WITH TIES мы не указывали, а значит следующее состояние - LIMIT_WINDOWEND.

LIMIT_WINDOWEND

Для нас это терминальное состояние (из этого состояния еще можно вернуть данные, но только если курсор движется в обратном направлении).

* Если случилось так, что из подплана вернулось меньше, чем указано в LIMIT, то следующее после LIMIT_RESCAN состояние - LIMIT_EMPTY. Это также терминальное состояние, но в отличие от LIMIT_WINDOWEND - всегда возвращает NULL (нет варианта передвинуться назад)

SORT

Следующий подузел - сортировка. Его состояние представляется структурой SortState

typedef struct SortState
{
	ScanState	ss;				/* its first field is NodeTag */
	bool		randomAccess;	/* need random access to sort output? */
	bool		bounded;		/* is the result set bounded? */
	int64		bound;			/* if bounded, how many tuples are needed */
	bool		sort_Done;		/* sort completed yet? */
	bool		bounded_Done;	/* value of bounded we did the sort with */
	int64		bound_Done;		/* value of bound we did the sort with */
	void	   *tuplesortstate; /* private state of tuplesort.c */
	bool		am_worker;		/* are we a worker? */
	bool		datumSort;		/* Datum sort instead of tuple sort? */
	SharedSortInfo *shared_info;	/* one entry per worker */
} SortState;

Реализация ExecProcNodeMtd - ExecSort (src/backend/executor/nodeSort.c).

Флаг sort_Done сигнализирует о том, была ли сортировка уже произведена (первый ли запуск). Если нет, то запускается логика сортировки.

Структура Tuplesortstate представляет собой состояние необходимое для выполнения сортировки. Сейчас важно знать, что она хранит в себе все кортежи, которые предстоит отсортировать. Также стоит подметить, что это часть реализации и закрыта для остальных (поэтому поле, в котором она хранится, имеет тип void*, см. выше)

struct Tuplesortstate
{
	TupSortStatus status;		/* enumerated value as shown above */
	int			nKeys;			/* number of columns in sort key */
	int			sortopt;		/* Bitmask of flags used to setup sort */
	bool		bounded;		/* did caller specify a maximum number of
								 * tuples to return? */
	bool		boundUsed;		/* true if we made use of a bounded heap */
	int			bound;			/* if bounded, the maximum number of tuples */
	bool		tuples;			/* Can SortTuple.tuple ever be set? */
	int64		availMem;		/* remaining memory available, in bytes */
	int64		allowedMem;		/* total memory allowed, in bytes */
	int			maxTapes;		/* max number of input tapes to merge in each
								 * pass */
	int64		maxSpace;		/* maximum amount of space occupied among sort
								 * of groups, either in-memory or on-disk */
	bool		isMaxSpaceDisk; /* true when maxSpace is value for on-disk
								 * space, false when it's value for in-memory
								 * space */
	TupSortStatus maxSpaceStatus;	/* sort status when maxSpace was reached */
	MemoryContext maincontext;	/* memory context for tuple sort metadata that
								 * persists across multiple batches */
	MemoryContext sortcontext;	/* memory context holding most sort data */
	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */

	/*
	 * These function pointers decouple the routines that must know what kind
	 * of tuple we are sorting from the routines that don't need to know it.
	 * They are set up by the tuplesort_begin_xxx routines.
	 *
	 * Function to compare two tuples; result is per qsort() convention, ie:
	 * <0, 0, >0 according as a<b, a=b, a>b.  The API must match
	 * qsort_arg_comparator.
	 */
	SortTupleComparator comparetup;

	/*
	 * Function to copy a supplied input tuple into palloc'd space and set up
	 * its SortTuple representation (ie, set tuple/datum1/isnull1).  Also,
	 * state->availMem must be decreased by the amount of space used for the
	 * tuple copy (note the SortTuple struct itself is not counted).
	 */
	void		(*copytup) (Tuplesortstate *state, SortTuple *stup, void *tup);

	/*
	 * Function to write a stored tuple onto tape.  The representation of the
	 * tuple on tape need not be the same as it is in memory; requirements on
	 * the tape representation are given below.  Unless the slab allocator is
	 * used, after writing the tuple, pfree() the out-of-line data (not the
	 * SortTuple struct!), and increase state->availMem by the amount of
	 * memory space thereby released.
	 */
	void		(*writetup) (Tuplesortstate *state, LogicalTape *tape,
							 SortTuple *stup);

	/*
	 * Function to read a stored tuple from tape back into memory. 'len' is
	 * the already-read length of the stored tuple.  The tuple is allocated
	 * from the slab memory arena, or is palloc'd, see readtup_alloc().
	 */
	void		(*readtup) (Tuplesortstate *state, SortTuple *stup,
							LogicalTape *tape, unsigned int len);

	/*
	 * Whether SortTuple's datum1 and isnull1 members are maintained by the
	 * above routines.  If not, some sort specializations are disabled.
	 */
	bool		haveDatum1;

	/*
	 * This array holds the tuples now in sort memory.  If we are in state
	 * INITIAL, the tuples are in no particular order; if we are in state
	 * SORTEDINMEM, the tuples are in final sorted order; in states BUILDRUNS
	 * and FINALMERGE, the tuples are organized in "heap" order per Algorithm
	 * H.  In state SORTEDONTAPE, the array is not used.
	 */
	SortTuple  *memtuples;		/* array of SortTuple structs */
	int			memtupcount;	/* number of tuples currently present */
	int			memtupsize;		/* allocated length of memtuples array */
	bool		growmemtuples;	/* memtuples' growth still underway? */

	/*
	 * Memory for tuples is sometimes allocated using a simple slab allocator,
	 * rather than with palloc().  Currently, we switch to slab allocation
	 * when we start merging.  Merging only needs to keep a small, fixed
	 * number of tuples in memory at any time, so we can avoid the
	 * palloc/pfree overhead by recycling a fixed number of fixed-size slots
	 * to hold the tuples.
	 *
	 * For the slab, we use one large allocation, divided into SLAB_SLOT_SIZE
	 * slots.  The allocation is sized to have one slot per tape, plus one
	 * additional slot.  We need that many slots to hold all the tuples kept
	 * in the heap during merge, plus the one we have last returned from the
	 * sort, with tuplesort_gettuple.
	 *
	 * Initially, all the slots are kept in a linked list of free slots.  When
	 * a tuple is read from a tape, it is put to the next available slot, if
	 * it fits.  If the tuple is larger than SLAB_SLOT_SIZE, it is palloc'd
	 * instead.
	 *
	 * When we're done processing a tuple, we return the slot back to the free
	 * list, or pfree() if it was palloc'd.  We know that a tuple was
	 * allocated from the slab, if its pointer value is between
	 * slabMemoryBegin and -End.
	 *
	 * When the slab allocator is used, the USEMEM/LACKMEM mechanism of
	 * tracking memory usage is not used.
	 */
	bool		slabAllocatorUsed;

	char	   *slabMemoryBegin;	/* beginning of slab memory arena */
	char	   *slabMemoryEnd;	/* end of slab memory arena */
	SlabSlot   *slabFreeHead;	/* head of free list */

	/* Memory used for input and output tape buffers. */
	size_t		tape_buffer_mem;

	/*
	 * When we return a tuple to the caller in tuplesort_gettuple_XXX, that
	 * came from a tape (that is, in TSS_SORTEDONTAPE or TSS_FINALMERGE
	 * modes), we remember the tuple in 'lastReturnedTuple', so that we can
	 * recycle the memory on next gettuple call.
	 */
	void	   *lastReturnedTuple;

	/*
	 * While building initial runs, this is the current output run number.
	 * Afterwards, it is the number of initial runs we made.
	 */
	int			currentRun;

	/*
	 * Logical tapes, for merging.
	 *
	 * The initial runs are written in the output tapes.  In each merge pass,
	 * the output tapes of the previous pass become the input tapes, and new
	 * output tapes are created as needed.  When nInputTapes equals
	 * nInputRuns, there is only one merge pass left.
	 */
	LogicalTape **inputTapes;
	int			nInputTapes;
	int			nInputRuns;

	LogicalTape **outputTapes;
	int			nOutputTapes;
	int			nOutputRuns;

	LogicalTape *destTape;		/* current output tape */

	/*
	 * These variables are used after completion of sorting to keep track of
	 * the next tuple to return.  (In the tape case, the tape's current read
	 * position is also critical state.)
	 */
	LogicalTape *result_tape;	/* actual tape of finished output */
	int			current;		/* array index (only used if SORTEDINMEM) */
	bool		eof_reached;	/* reached EOF (needed for cursors) */

	/* markpos_xxx holds marked position for mark and restore */
	long		markpos_block;	/* tape block# (only used if SORTEDONTAPE) */
	int			markpos_offset; /* saved "current", or offset in tape block */
	bool		markpos_eof;	/* saved "eof_reached" */

	/*
	 * These variables are used during parallel sorting.
	 *
	 * worker is our worker identifier.  Follows the general convention that
	 * -1 value relates to a leader tuplesort, and values >= 0 worker
	 * tuplesorts. (-1 can also be a serial tuplesort.)
	 *
	 * shared is mutable shared memory state, which is used to coordinate
	 * parallel sorts.
	 *
	 * nParticipants is the number of worker Tuplesortstates known by the
	 * leader to have actually been launched, which implies that they must
	 * finish a run that the leader needs to merge.  Typically includes a
	 * worker state held by the leader process itself.  Set in the leader
	 * Tuplesortstate only.
	 */
	int			worker;
	Sharedsort *shared;
	int			nParticipants;

	/*
	 * The sortKeys variable is used by every case other than the hash index
	 * case; it is set by tuplesort_begin_xxx.  tupDesc is only used by the
	 * MinimalTuple and CLUSTER routines, though.
	 */
	TupleDesc	tupDesc;
	SortSupport sortKeys;		/* array of length nKeys */

	/*
	 * This variable is shared by the single-key MinimalTuple case and the
	 * Datum case (which both use qsort_ssup()).  Otherwise, it's NULL.  The
	 * presence of a value in this field is also checked by various sort
	 * specialization functions as an optimization when comparing the leading
	 * key in a tiebreak situation to determine if there are any subsequent
	 * keys to sort on.
	 */
	SortSupport onlyKey;

	/*
	 * Additional state for managing "abbreviated key" sortsupport routines
	 * (which currently may be used by all cases except the hash index case).
	 * Tracks the intervals at which the optimization's effectiveness is
	 * tested.
	 */
	int64		abbrevNext;		/* Tuple # at which to next check
								 * applicability */

	/*
	 * These variables are specific to the CLUSTER case; they are set by
	 * tuplesort_begin_cluster.
	 */
	IndexInfo  *indexInfo;		/* info about index being used for reference */
	EState	   *estate;			/* for evaluating index expressions */

	/*
	 * These variables are specific to the IndexTuple case; they are set by
	 * tuplesort_begin_index_xxx and used only by the IndexTuple routines.
	 */
	Relation	heapRel;		/* table the index is being built on */
	Relation	indexRel;		/* index being built */

	/* These are specific to the index_btree subcase: */
	bool		enforceUnique;	/* complain if we find duplicate tuples */
	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */

	/* These are specific to the index_hash subcase: */
	uint32		high_mask;		/* masks for sortable part of hash code */
	uint32		low_mask;
	uint32		max_buckets;

	/*
	 * These variables are specific to the Datum case; they are set by
	 * tuplesort_begin_datum and used only by the DatumTuple routines.
	 */
	Oid			datumType;
	/* we need typelen in order to know how to copy the Datums. */
	int			datumTypeLen;

	/*
	 * Resource snapshot for time of sort start.
	 */
#ifdef TRACE_SORT
	PGRUsage	ru_start;
#endif
};

Сортировать мы можем либо скалярные (единичные) значения, либо целые кортежи. Для обобщения их логики и была создана абстракция Tuplesortstate. Для каждого типа сравнения, определен свой класс функций выполняющих определенные шаги: добавление и получение данных из состояния.

Например, вот так происходит добавление данных

if (node->datumSort)
{
    ExecClearTuple(slot);
    if (tuplesort_getdatum(tuplesortstate, ScanDirectionIsForward(dir),
                           &(slot->tts_values[0]), &(slot->tts_isnull[0]), NULL))
        ExecStoreVirtualTuple(slot);
}
else
    (void) tuplesort_gettupleslot(tuplesortstate,
                                  ScanDirectionIsForward(dir),
                                  false, slot, NULL);

Что такое Datum

Datum представляет собой обертку над “любым” типом данных, с которым мы можем работать. Он является псевдонимом, поэтому с сырым значением ничего полезного с ним сделать нельзя

/*
 * A Datum contains either a value of a pass-by-value type or a pointer to a
 * value of a pass-by-reference type.  Therefore, we require:
 *
 * sizeof(Datum) == sizeof(void *) == 4 or 8
 *
 * The macros below and the analogous macros for other types should be used to
 * convert between a Datum and the appropriate C type.
 */

typedef uintptr_t Datum;

Но для работы с ним определена куча макросов. Например, для работы с int есть 2 макроса

/*
 * DatumGetInt32
 *		Returns 32-bit integer value of a datum.
 */

#define DatumGetInt32(X) ((int32) (X))

/*
 * Int32GetDatum
 *		Returns datum representation for a 32-bit integer.
 */

#define Int32GetDatum(X) ((Datum) (X))

Вообще макросы определены для всех простых типов + собственные простые типы Postgres, например TransactionId

/*
 * DatumGetTransactionId
 *		Returns transaction identifier value of a datum.
 */

#define DatumGetTransactionId(X) ((TransactionId) (X))

/*
 * TransactionIdGetDatum
 *		Returns datum representation for a transaction identifier.
 */

#define TransactionIdGetDatum(X) ((Datum) (X))

Дополнительно определена отдельная структура NullableDatum, для работы со значениями допускающими null. Но единственные места, где явно она используется - работа с jit функциями и компиляция llvm.

typedef struct NullableDatum
{
#define FIELDNO_NULLABLE_DATUM_DATUM 0
	Datum		value;
#define FIELDNO_NULLABLE_DATUM_ISNULL 1
	bool		isnull;
	/* due to alignment padding this could be used for flags for free */
} NullableDatum;

Из примера выше, для добавления единичных значений в Tuplesortstate сначала получается Datum из подузла.

if (node->datumSort)
{
    for (;;)
    {
        slot = ExecProcNode(outerNode);

        if (TupIsNull(slot))
            break;
        slot_getsomeattrs(slot, 1);
        tuplesort_putdatum(tuplesortstate,
                           slot->tts_values[0],
                           slot->tts_isnull[0]);
    }
}

Для добавления значения используется tuplesort_putdatum. Как я уже сказал, явный NullableDatum используется не везде. Здесь флаги null сделаны в виде дополнительного массива.

Вот так Datum добавляется в Tuplesortstate (в качестве Datum выступают ссылки на кортежи, поэтому используется ссылочная интерпретация значения)

void
tuplesort_putdatum(Tuplesortstate *state, Datum val, bool isNull)
{
    // ...
    if (isNull || !state->tuples)
	{
		/*
		 * Set datum1 to zeroed representation for NULLs (to be consistent,
		 * and to support cheap inequality tests for NULL abbreviated keys).
		 */
		stup.datum1 = !isNull ? val : (Datum) 0;
		stup.isnull1 = isNull;
		stup.tuple = NULL;		/* no separate storage */
		MemoryContextSwitchTo(state->sortcontext);
	}
	else
	{
		Datum		original = datumCopy(val, false, state->datumTypeLen);

		stup.isnull1 = false;
		stup.tuple = DatumGetPointer(original);
        // ...
    }
}

Но вот логика сортировки - единая для всех. Содержится она в tuplesort_performsort(src/backend/utils/sort/tuplesort.c).

Внутри происходит выбор между несколькими стратегиями сортировки, определяемыми перечислением TupSortStatus. Например, в начальном состоянии, когда памяти достаточно, мы можем использовать quick sort

/*
 * All tuples have been provided; finish the sort.
 */
void
tuplesort_performsort(Tuplesortstate *state)
{
    switch (state->status)
	{
		case TSS_INITIAL:

			/*
			 * We were able to accumulate all the tuples within the allowed
			 * amount of memory, or leader to take over worker tapes
			 */
			if (SERIAL(state))
			{
				/* Just qsort 'em and we're done */
				tuplesort_sort_memtuples(state);
				state->status = TSS_SORTEDINMEM;
			}
     
        // ...
    }
}

На этом сортировка окончена и ее состояние сохранилось в SortState. Теперь мы можем начать получать уже отсортированные кортежи.

Получение данных из Tuplesortstate также зависит от типа данных. В случае кортежей используется tuplesort_gettupleslot (src/backend/utils/sort/tuplesort.c).

Получение кортежа из отсортированного списка довольно тривиальное - возвращается следующий элемент из отсортированного массива (но это в случае если сортировка происходила в памяти)

switch (state->status)
{
    case TSS_SORTEDINMEM:
        Assert(forward || state->sortopt & TUPLESORT_RANDOMACCESS);
        Assert(!state->slabAllocatorUsed);
        if (forward)
        {
            if (state->current < state->memtupcount)
            {
                *stup = state->memtuples[state->current++];
                return true;
            }
            state->eof_reached = true;

            /*
             * Complain if caller tries to retrieve more tuples than
             * originally asked for in a bounded sort.  This is because
             * returning EOF here might be the wrong thing.
             */
            if (state->bounded && state->current >= state->bound)
                elog(ERROR, "retrieved too many tuples in a bounded sort");

            return false;
        }
    // ...
}

Теперь мы можем читать кортежи из отсортированного массива.

GROUP

Нижележащий узел - группировка. Стратегий группировки много. Предположим что мы используем группировку хэшированием. Структура ее представляющая - AggState

typedef struct AggState
{
	ScanState	ss;				/* its first field is NodeTag */
	List	   *aggs;			/* all Aggref nodes in targetlist & quals */
	int			numaggs;		/* length of list (could be zero!) */
	int			numtrans;		/* number of pertrans items */
	AggStrategy aggstrategy;	/* strategy mode */
	AggSplit	aggsplit;		/* agg-splitting mode, see nodes.h */
	AggStatePerPhase phase;		/* pointer to current phase data */
	int			numphases;		/* number of phases (including phase 0) */
	int			current_phase;	/* current phase number */
	AggStatePerAgg peragg;		/* per-Aggref information */
	AggStatePerTrans pertrans;	/* per-Trans state information */
	ExprContext *hashcontext;	/* econtexts for long-lived data (hashtable) */
	ExprContext **aggcontexts;	/* econtexts for long-lived data (per GS) */
	ExprContext *tmpcontext;	/* econtext for input expressions */
#define FIELDNO_AGGSTATE_CURAGGCONTEXT 14
	ExprContext *curaggcontext; /* currently active aggcontext */
	AggStatePerAgg curperagg;	/* currently active aggregate, if any */
#define FIELDNO_AGGSTATE_CURPERTRANS 16
	AggStatePerTrans curpertrans;	/* currently active trans state, if any */
	bool		input_done;		/* indicates end of input */
	bool		agg_done;		/* indicates completion of Agg scan */
	int			projected_set;	/* The last projected grouping set */
#define FIELDNO_AGGSTATE_CURRENT_SET 20
	int			current_set;	/* The current grouping set being evaluated */
	Bitmapset  *grouped_cols;	/* grouped cols in current projection */
	List	   *all_grouped_cols;	/* list of all grouped cols in DESC order */
	Bitmapset  *colnos_needed;	/* all columns needed from the outer plan */
	int			max_colno_needed;	/* highest colno needed from outer plan */
	bool		all_cols_needed;	/* are all cols from outer plan needed? */
	/* These fields are for grouping set phase data */
	int			maxsets;		/* The max number of sets in any phase */
	AggStatePerPhase phases;	/* array of all phases */
	Tuplesortstate *sort_in;	/* sorted input to phases > 1 */
	Tuplesortstate *sort_out;	/* input is copied here for next phase */
	TupleTableSlot *sort_slot;	/* slot for sort results */
	/* these fields are used in AGG_PLAIN and AGG_SORTED modes: */
	AggStatePerGroup *pergroups;	/* grouping set indexed array of per-group
									 * pointers */
	HeapTuple	grp_firstTuple; /* copy of first tuple of current group */
	/* these fields are used in AGG_HASHED and AGG_MIXED modes: */
	bool		table_filled;	/* hash table filled yet? */
	int			num_hashes;
	MemoryContext hash_metacxt; /* memory for hash table itself */
	struct LogicalTapeSet *hash_tapeset;	/* tape set for hash spill tapes */
	struct HashAggSpill *hash_spills;	/* HashAggSpill for each grouping set,
										 * exists only during first pass */
	TupleTableSlot *hash_spill_rslot;	/* for reading spill files */
	TupleTableSlot *hash_spill_wslot;	/* for writing spill files */
	List	   *hash_batches;	/* hash batches remaining to be processed */
	bool		hash_ever_spilled;	/* ever spilled during this execution? */
	bool		hash_spill_mode;	/* we hit a limit during the current batch
									 * and we must not create new groups */
	Size		hash_mem_limit; /* limit before spilling hash table */
	uint64		hash_ngroups_limit; /* limit before spilling hash table */
	int			hash_planned_partitions;	/* number of partitions planned
											 * for first pass */
	double		hashentrysize;	/* estimate revised during execution */
	Size		hash_mem_peak;	/* peak hash table memory usage */
	uint64		hash_ngroups_current;	/* number of groups currently in
										 * memory in all hash tables */
	uint64		hash_disk_used; /* kB of disk space used */
	int			hash_batches_used;	/* batches used during entire execution */

	AggStatePerHash perhash;	/* array of per-hashtable data */
	AggStatePerGroup *hash_pergroup;	/* grouping set indexed array of
										 * per-group pointers */

	/* support for evaluation of agg input expressions: */
#define FIELDNO_AGGSTATE_ALL_PERGROUPS 53
	AggStatePerGroup *all_pergroups;	/* array of first ->pergroups, than
										 * ->hash_pergroup */
	ProjectionInfo *combinedproj;	/* projection machinery */
	SharedAggInfo *shared_info; /* one entry per worker */
} AggState;

Реализация ExecProcNodeMtd - ExecAgg (src/backend/executor/nodeAgg.c).

Группировка может производиться несколькими стратегиями. Например, хэширование или сканирование уже отсортированных данных. Такие стратегии определяются перечислением AggStrategy

/*
 * AggStrategy -
 *	  overall execution strategies for Agg plan nodes
 *
 * This is needed in both pathnodes.h and plannodes.h, so put it here...
 */
typedef enum AggStrategy
{
	AGG_PLAIN,					/* simple agg across all input rows */
	AGG_SORTED,					/* grouped agg, input must be sorted */
	AGG_HASHED,					/* grouped agg, use internal hashtable */
	AGG_MIXED					/* grouped agg, hash and sort both used */
} AggStrategy;

В нашем случае будет использоваться группировка хэшированием - AGG_HASHED.

Общая логика получения следующего кортежа находится в agg_retrieve_hash_table_in_memory (src/backend/executor/nodeAgg.c).

Для хэштаблицы используемой в хэш агрегировании есть своя структура - TupleHashTable

typedef struct TupleHashTableData
{
	tuplehash_hash *hashtab;	/* underlying hash table */
	int			numCols;		/* number of columns in lookup key */
	AttrNumber *keyColIdx;		/* attr numbers of key columns */
	FmgrInfo   *tab_hash_funcs; /* hash functions for table datatype(s) */
	ExprState  *tab_eq_func;	/* comparator for table datatype(s) */
	Oid		   *tab_collations; /* collations for hash and comparison */
	MemoryContext tablecxt;		/* memory context containing table */
	MemoryContext tempcxt;		/* context for function evaluations */
	Size		entrysize;		/* actual size to make each hash entry */
	TupleTableSlot *tableslot;	/* slot for referencing table entries */
	/* The following fields are set transiently for each table search: */
	TupleTableSlot *inputslot;	/* current input tuple's slot */
	FmgrInfo   *in_hash_funcs;	/* hash functions for input datatype(s) */
	ExprState  *cur_eq_func;	/* comparator for input vs. table */
	uint32		hash_iv;		/* hash-function IV */
	ExprContext *exprcontext;	/* expression context */
}  TupleHashTableData;

typedef struct TupleHashTableData *TupleHashTable;

Элементы TupleHashTable - TupleHashEntryData

typedef struct TupleHashEntryData
{
	MinimalTuple firstTuple;	/* copy of first tuple in this group */
	void	   *additional;		/* user data */
	uint32		status;			/* hash status */
	uint32		hash;			/* hash value (cached) */
} TupleHashEntryData;

typedef struct TupleHashEntryData *TupleHashEntry;

Для AggState есть несколько типов состояний:

AggStatePerAgg - содержит данные для каждого набора кортежей, которые будут использованы для вычисления результирующего кортежа
AggStatePerGroup - представляет собой единый блок группировки
AggStatePerHash - состояние, используемое при стратегии AGG_HASH (в обычном случае имеется один экземпляр, но если используется GROUPING SET, то для каждой группы свое состояние)

Вначале мы итерируемся по всем вхождениям в хэш таблице и пытаемся найти первый TupleHashEntryData, удовлетворяющий HAVING.

Для получения следующего кортежа из хэш таблицы используется макрос ScanTupleHashTable.

Хэш-таблицы генерируются во время компиляции и для каждого отдельного варианта использования можно создать свою структуру хэш-таблицы. Причем можно задавать свой префикс для этой структуры. Для хэш таблицы кортежей используется префикс tuplehash и все операции с этой хэш таблицей и имеют префикс tuplehash

Когда элемент найден, вызывается project_aggregates. Эта функция заключает в себе проверку условия HAVING и результирующее отображение.

За проверку HAVING отвечает ExecQual. По факту она просто вызывает функцию, хранящуюся в поле evalfunc

typedef Datum (*ExprStateEvalFunc) (struct ExprState *expression,
									struct ExprContext *econtext,
									bool *isNull);

typedef struct ExprState
{
    // ...
  	/*
	 * Function that actually evaluates the expression.  This can be set to
	 * different values depending on the complexity of the expression.
	 */
	ExprStateEvalFunc evalfunc;
    // ...
}

Если условие выполнилось, то вызовом ExecProject создается результирующий кортеж, который и возвращается.

SeqScan

Последним этапом будет последовательное сканирование таблицы. Ее реализация была обсуждена в предыдущей статье. Поэтому сейчас просто напомню, что:

Реализация ExecProcNodeMtd - ExecSeqScan. Она использует обобщенную функцию ExecScan
В начале приобретается ссылка на необходимую таблицу
Для чтения данных используется менеджер буферов, из которого поочередно получаются буферы представляющие область памяти на диске

Конец

Теперь эта цепочка функций будет вызываться до того момента, пока не будут возвращены все кортежи.

PostgreSQL под капотом. Часть 6. Сложный SELECT запрос +13

Парсинг

FROM users AS u

SELECT u.name

WHERE u.age > 24

HAVING COUNT(*) = 1

Хранение

Заполнение

ORDER BY u.name

LIMIT 50

Предобработка

LIMIT

GROUP BY

SELECT …

Создание путей

Сортировка

Пути группировки

Создание планов

Планы группировки

Планы сортировки

План LIMIT/OFFSET

Создание результирующего плана

Этап выполнения

LIMIT

SORT

GROUP

SeqScan

Конец

Комментарии (0)