Nov 11, 2024

Oracle的DELETE [FROM] (subquery)语法

数据库中的删除操作通常是直接指定表名进行操作,但Oracle数据库支持一种特别的DELETE [FROM] (subquery)语法,使用户能够通过子查询指定删除操作,其它任何数据库均不支持,详细了解之后不由感叹Oracle的强大。

1,支持情况

Oracle对 DELETE [FROM] (subquery) 语法的支持包括以下几种情形:

  1. 单表子查询:单表子查询的返回结果可以直接用于DELETE操作。
  2. 多表JOIN子查询:支持基于包含多表JOIN的子查询进行DELETE。
  3. 可修改视图:可修改视图也相当于一种子查询,同样支持。

但无论是哪种情况,子查询或可修改视图中必须存在一个“键值保存表”(Key-Preserved Table),键值保存表也是实现该语法的关键。

什么是键值保存表

在一个多表JOIN的查询中,如果某个表的主键或唯一键被投影,且在JOIN结果中依旧保持唯一性,也就是说,JOIN操作不会破坏该键的唯一性,这个表在这个查询中便被称为键值保存表。

2,Oracle行为分析

Oracle在执行 DELETE [FROM] (subquery) 时会根据子查询的结构确定删除目标,这其中存在一些有趣的现象。

单表子查询行为

如果子查询中仅包含一个表,Oracle允许删除操作,即使该表没有唯一键。例如:

CREATE TABLE A (id int);

INSERT INTO A VALUES (1);

DELETE (SELECT * FROM A WHERE id = 1);

多表JOIN子查询的复杂行为

DELETE [FROM] (subquery) 包含多表JOIN时,行为变得复杂。例如:

CREATE TABLE A (id int PRIMARY KEY);
CREATE TABLE B (id int, price int);

INSERT INTO A VALUES (1);
INSERT INTO B VALUES (1,1);

DELETE (SELECT * FROM A, B WHERE A.id = B.id);

在该例中,如果 A.id是主键,而 B 表没有,则 DELETE 语句会删除 B 表中的数据。如果 B.id是主键,而 A 表没有,则 DELETE 语句会删除 A 表中的数据。若 A.idB.id 都是主键,Oracle会选择语句中顺序居前的键值保存表进行删除。

这里非常反直觉,不是说会删除“键值保存表”上的数据吗?为什么反而是另外一个没有主键的表数据会被删除?其实也合理:因为在JOIN的结果中,唯一键的值可能出现多次,所以那个表并不是“键值保存表”,而匹配到的多个相同值却保留了一个隐藏的唯一键,即Oracle表的ROWID,其所在的表可以视为键值保存表。

比如在如下例子中,A表有主键,且主键在JOIN之后保留,则A为键值保存表,数据就是从A中删除:

CREATE TABLE A (id int PRIMARY KEY);
INSERT INTO A VALUES (1);
INSERT INTO A VALUES (2);

CREATE TABLE B (id int, price int);
INSERT INTO B VALUES (1,1);
INSERT INTO B VALUES (1,1);

DELETE (SELECT A.id FROM A WHERE NOT EXISTS (SELECT 1 FROM B WHERE A.id = B.id));

3,Oracle的可能实现原理

Oracle对 DELETE [FROM] (subquery) 的实现依赖于键值保存表来避免歧义,其可能的执行步骤如下:

  1. 首先语法分析确定子查询或视图中是否存在键值保存表。如果有多个键值保存表,则选择第一个;没有时,引入 ROWID 作为唯一键,此时只能存在一个键值保存表(无唯一键的复杂单表子查询可能也是如此)。
  2. 执行子查询,并从选中的键值保存表中删除子查询选中的行。

一点感慨

说起来有点复杂,做起来其实也并不简单,多表中对“键值保存表”的判断其实是非常复杂的逻辑,一些视图比如USER_UPDATABLE_COLUMNS也与此相关,只能感慨Oracle还是太全面了。


Nov 22, 2019

Why does offlineimap not work with launchd

TL;DR, don’t place your Maildir into Documents, Downloads or Desktop.

I’m a heavy user of offlineimap. Thanks to Homebrew, which provides a quite nice plist (which stands for “property list”) file for macOS’s launchd to start the service at login, the offlineimap always works well until macOS Vista Catalina.

Days ago, after I upgraded the system to Catalina, offlineimap stopped working, no new mails got in at all. Although I appreciate the break, it’s a thing need to be fixed.

  1. Run offlineimap -o directly, good, so it’s a launchd thing.
  2. Enable the launchd logging, it reports “Operation not permitted” when os.listdir(), what?
  3. Check the UID, it has the permission.
  4. Check the effective UID, it’s the same as the UID.

After almost two hours’ searching, I got nothing interesting and went back to think “Why does it not work after upgrading to Catalina?”

The answer is right in macOS Catalina Release Notes:

Launch daemons and launch agents introduce new user privacy protections. Specifying privacy-sensitive files and folders in a launchd property list might not work as expected and prevent the service from running.

🙄 Good work, Apple.


Jun 30, 2018

Jump consistent hash算法分析

Jump Consistent Hash是一个来自Google的快速, 几乎没有内存消耗的一致性哈希算法. 这篇文章是我的翻译和理解思路.

哈希算法的设计目标

  1. 平衡性, 把对象均匀地分布在所有桶(bucket)中.
  2. 单调性, 当桶的数量变化时, 只需要把一些对象从旧桶移动到新桶, 不需要做其它(例如旧桶之间的)移动.

比如一个分布式存储的系统, 按哈希分布带来优化的便利, 平衡性可以避免数据倾斜, 单调性则使得扩容和减容更快速.

算法的设计思路

我们用循序渐进的思路来设计, ch(key, num_buckets)为桶数为num_buckets时的hash函数, 为了满足设计目标, 需要:

  1. num_buckets为1时, 任意key, ch(key, 1)==0, 全落在第0个桶里.
  2. num_buckets由1变为2后, ch(key, 2)应该有1/2的结果保持为0, 有1/2变为1.
  3. num_buckets由n变为n+1后, ch(key, n+1)应该有n/(n+1)的结果保持不变, 有1/(n+1)跳变为n+1.

因此, 只要我们用一个分布均匀的函数, 例如伪随机数生成器, 并且让这个伪随机数生成器的状态只依赖于key, 从0个桶开始推导, 变到第num_buckets个桶, 结果就是我们想要的hash值:

int ch(int key, int num_buckets) {
    random.seed(key);
    int b = 0;
    for (int j = 1; j < num_buckets; j++) {
        if (random.next() < 1.0 / (j + 1))
            b = j;
    }
    return b;
}

很显然, 这个算法是O(n)的, 而且随着j越来越大, b=j需要执行的概率也越来越低. Google最终的算法叫Jump Consistent Hash, Jump指的是j是跳跃的, 那他们是怎么让j跳跃, 并保持结果正确呢?

我们用概率的思路来想这个问题, 假设变化过程中的上一次结果是b, 下一次结果j大于等于一个大于b的数i的概率, 也就是可以跳到i的概率, 也就是从b+1i的过程中, 桶序号都没有变的概率为:

P(j>=i) = P(ch(key, b+1)==ch(key, i))

这个概率也就是连续多次不变事件概率的乘积:

P(j>=i) = P(ch(key, b+1)==ch(k, b+2)) * P(ch(key, b+2)==ch(key, b+3)) * ... * P(ch(key, i-1)==ch(key, i))

由于单次不变的概率为:

P(ch(key, i)==ch(key, i+1)) = i/(i+1)

所以连续多次不变的概率为:

P(j>=i) = (b+1)/(b+2) * (b+2)/(b+3) * ... * (i-1)/i = (b+1)/i

基于之前的代码, 当随机结果r小于(b+1)/i时跳到i就可以保持概率不变, 也就是任意的r都可以跳到floor((b+1)/r):

int ch(int key, int num_buckets) {
    random.seed(key);
    int b = -1;
    int j = 0;
    while (j < num_buckets) {
        b = j;
        double r = random.next();
        j = floor((b + 1) / r);
    }
    return b;
}

最终的完整代码

最终代码里没有用random函数, 而是写了一个基于”线性同余方法”的伪随机数产生器, 意思是一样的.

int32_t JumpConsistentHash(uint64_t key, int32_t num_buckets) {
    int64_t b = -1, j = 0;
    while (j < num_buckets) {
        b = j;
        key = key * 2862933555777941757ULL + 1;
        j = (b + 1) * (double(1LL << 31) / double((key >> 33) + 1));
    }
    return b;
}

May 31, 2018

How Greenplum plans a JOIN

What is a “JOIN”

INNER JOIN is an AND, OUTER JOIN is an OR.

id name       id  name
-- ----       --  ----
1  Pirate     1   Rutabaga
2  Monkey     2   Pirate
3  Ninja      3   Darth Vader
4  Spaghetti  4   Ninja

INNER JOIN’s results are those tuples in table A AND B.

# SELECT * FROM TableA
INNER JOIN TableB
ON TableA.name = TableB.name;

 id |  name  | id |  name
----+--------+----+--------
  3 | Ninja  |  4 | Ninja
  1 | Pirate |  2 | Pirate
(2 rows)

FULL [OUTER] JOIN’s results are those tuples in table A OR B.

# SELECT * FROM TableA
FULL OUTER JOIN TableB
ON TableA.name = TableB.name;
 id |   name    | id |    name
----+-----------+----+-------------
  3 | Ninja     |  4 | Ninja
    |           |  3 | Darth Vader
  2 | Monkey    |    |
  1 | Pirate    |  2 | Pirate
    |           |  1 | Rutabaga
  4 | Spaghetti |    |
(6 rows)

LEFT [OUTER] JOIN’s results are those tuples in table A OR (A AND B).

# SELECT * FROM TableA
LEFT OUTER JOIN TableB
ON TableA.name = TableB.name;

 id |   name    | id |  name
----+-----------+----+--------
  1 | Pirate    |  2 | Pirate
  3 | Ninja     |  4 | Ninja
  2 | Monkey    |    |
  4 | Spaghetti |    |
(4 rows)

JOIN is not simple in Greenplum

Greenplum is (was ?) a shared-nothing distributed data warehouse, data transferring (motion) between nodes costs a lot.

To have a better performance, Greenplum needs to avoid motions and loops as possible.

How does Greenplum choose the path

build_simple_rel()

This stage is to get the info how data are distributed, GpPolicy.

The data to store are partitioned onto segment databases, the other data, for internal use like catalog, are not partitioned.

Entry database, I take it as the main backend, which holds none or all, but not a part of the data when processing.

/*
 * GpPolicyType represents a type of policy under which a relation's
 * tuples may be assigned to a component database.
 */
typedef enum GpPolicyType
{
    POLICYTYPE_PARTITIONED,     /* Tuples partitioned onto segment database. */
    POLICYTYPE_ENTRY            /* Tuples stored on entry database. */
} GpPolicyType;

set_base_rel_pathlists()

Finds all paths available for scanning each base-relation entry, sequential scan and any available indices are considered.

RTE_FUNCTION: create_functionscan_path()
RTE_RELATION: create_external_path() / create_aocs_path() / create_seqscan_path() / create_index_paths() …
RTE_VALUES: create_valuesscan_path()

create_xxx_path()

Decide Locus (location, describing the distribution of a base relation).

The difference between entry policy and entry locus is that, entry database is a concept of how data store, entry locus is where data come from.

For instances, catalog’s locus is General and its policy is Entry; The value to insert, its locus is Entry, data go into master firstly; generate_series()’s locus is SingleQE, it could be planned onto any db.

typedef enum CdbLocusType
{
    CdbLocusType_Null,
    CdbLocusType_Entry,         /* a single backend process on the entry db:
                                 * usually the qDisp itself, but could be a
                                 * qExec started by the entry postmaster.
                                 */
    CdbLocusType_SingleQE,      /* a single backend process on any db: the
                                 * qDisp itself, or a qExec started by a
                                 * segment postmaster or the entry postmaster.
                                 */
    CdbLocusType_General,       /* compatible with any locus (data is
                                 * self-contained in the query plan or
                                 * generally available in any qExec or qDisp) */
    CdbLocusType_Replicated,    /* replicated over all qExecs of an N-gang */
    CdbLocusType_Hashed,        /* hash partitioned over all qExecs of N-gang */
    CdbLocusType_HashedOJ,      /* result of hash partitioned outer join */
    CdbLocusType_Strewn,        /* partitioned on no known function */
    CdbLocusType_End            /* = last valid CdbLocusType + 1 */
} CdbLocusType;

make_rel_from_joinlist()

Build access paths using a “joinlist” to guide the join path search, nestloop, merge, hash?

standard_join_search() -> set_cheapest()

for_each_cell(p, lnext(list_head(parent_rel->pathlist)))
{
	Path       *path = (Path *) lfirst(p);
	compare_path_costs(..., path, ...)
	if (BETTER)
		cheapest_total_path = path;
}

Quizzes

What and where are the motions? Entry, SingleQE, General, Hashed and Strewn locuses, join and left join, each other?


Aug 15, 2017

Fix "out of pty devices" in Guardian containers

Who are affected

Some certain Linux distros as Guardian containers, like SLES jobs on Concourse.

How to reproduce

Create a SLES 11 SP4 Guardian container, then

# useradd john
# su - john
$ python -c "import pty; pty.fork()"
...
OSError: out of pty devices

Why is this

A pseudo TTY (or “PTY”) is a pair of devices — a slave and a master — that provide a special sort of communication channel.

Linux has a virtual filesystem called devpts that is normally mounted at /dev/pts. Whenever a new pseudo TTY is created, a device node for the slave is created in that virtual filesystem.

“Unix98” Single Unix Specification (SUS) requires that the group ID of the slave device should not be that of the creating process, but rather some definite, though unspecified, value. The GNU C Library (glibc) takes responsibility for implementing this requirement; it quite reasonably chooses the ID of the group named “tty” (often 5) to fill this role. If the devpts filesystem is mounted with options gid=5,mode=620, this group ID and the required access mode will be used and glibc will be happy. If not, glibc will (if so configured) run a setuid helper found at /usr/libexec/pt_chown.

The problem is some Linux distros don’t provide the setuid helper pt_chown, if devpts is not mounted with option gid=5, the device node could not be created with tty group.

Fix and workaround

Guardian’s default mount options are fixed now, workaround is adding normal users into tty group.

ref:

1, https://lwn.net/Articles/688809/
2, https://en.wikipedia.org/wiki/Single_UNIX_Specification


May 26, 2017

Understanding _GLIBCXX_USE_CXX11_ABI

C++程序发布时需要自带运行时么? 我觉得很有必要. 因为即使C++标准相同, ABI也不一定.

#include <iostream>

int main()
{
    std::string str = "Hello, world!\n";
    std::cout << str;

    return 0;
}

这样一个简单的C++程序, 用g++ 6.2.0和C++98标准编译后的二进制文件放到CentOS 5上运行会提示:

./a.out: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by ./a.out)

又没有用什么很新的特性, 还指定了C++98标准, CentOS上的g++ 4.1.2当然也支持C++98, 为什么还会报错?

因为从GCC 5.1开始[1], std::stringstd::list使用了新的C++11的ABI, 即使你显式指定使用老标准.

解决方法不难, 手动将_GLIBCXX_USE_CXX11_ABI宏置0就可以了, 当然, 为了避免更多的麻烦(谁知道哪儿还有坑呢?), 你也可以自带运行时发布, 感谢GCC Runtime Library Exception[2], 即使私有软件也是可以的.

为什么GCC要这么做呢? 我是不大理解, 官方文档说因为这样就保证可以和C++11的代码链接了, 即使你显式指定使用老标准, 惊不惊喜?

ref:

1, https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_dual_abi.html
2, https://www.gnu.org/licenses/gcc-exception-3.1.en.html


Apr 30, 2016

自动处理Make的依赖关系

Make是如何工作的

简单说, Makefile中指定目标和依赖, 然后Make运行的时候会检测文件的时间信息, 如果目标落后于依赖的修改时间则目标需要重新生成, 以此达到按需生成, 节省时间的目的.

实际场景中的问题是什么

举个简单的例子:

OBJS := foo.o bar.o

program: $(OBJS)
	gcc $(OBJS) -o program

%.o: %.c
	gcc -c $(CFLAGS) $< -o $@

这个Makefile表达的意思是由foo.cbar.c生成program, 中间目标文件分别依赖相应的源文件. 但是实际场景远没有这么简单, 例如每个.o还会各自依赖很多不同的.h头文件, 怎么办? 挨个列出来么? 太麻烦了, 而且依赖关系还可能会变化.

依赖关系能自动生成么

当然可以, 直接看代码:

OBJS := foo.o bar.o

program: $(OBJS)
	gcc $(OBJS) -o program

-include $(OBJS:.o=.d)

%.o: %.c
	gcc -MD -MP -c $(CFLAGS) $< -o $@

相较于之前的版本, 这次每个.o的编译过程中都会生成一个Makefile语法的.d文件, 里面列出了gcc分析得出的依赖关系. 所以当我们把这些.d文件包含进来之后, Make就可以知道详细的依赖关系信息, 进而判断哪些依赖发生了变化, 哪些目标需要重新生成.

PS, 如果你不想把系统头文件也列在依赖里面, 可以把参数-MD换成-MMD.


Mar 14, 2016

Linux下的性能分析工具: Perf

什么是Perf

Perf是一个与Linux Kernel紧密结合的软件性能分析工具.

Perf的工作原理

Perf的基本原理是hook事件, 对被分析对象进行采样, 获取数据并分析. 例如在时钟中断触发时采样context可以分析函数的运行时间, 在cache miss触发时采样可以分析缓存的工作效率. 此外, Perf还支持缺页, 进程切换, 具体内核函数等事件, 具体可以参考perf list.

Perf怎么用

  • perf stat ./a.out, perf stat -p 1234, 分析程序的整体性能, 可以看到程序的CPU使用率, 进程切换次数, cache利用情况等等.

  • perf top, 类似top命令, 可以分析整个系统当前的状态, 例如寻找当前系统最耗时的用户进程或者内核函数.

  • perf recordperf report, 可以记录并分析一段时间内的性能事件.

  • perf --help :)

Happy π day!

这只是篇简介, 主旨是要说明Perf很强大很易用以及更新一下blog. 另外, 圆周率日快乐!

ref:

1, https://perf.wiki.kernel.org/
2, http://www.brendangregg.com/perf.html


Aug 31, 2015

龟兔算法为什么有效

什么是龟兔算法

检测单链表上的环有个很经典的算法叫做”弗洛伊德判圈算法”, 也叫”龟兔算法”.

龟兔算法的基本思路

是否有环

我们让龟(tortoise)和兔(hare)从单链表的起点出发, 每次龟走一步, 兔走两步, 如果time_a步之后龟兔相遇, 则该链表有环.

环的长度

上述算法刚判断出存在环时, 龟和兔位于同一节点. 此时让兔不动, 而龟不断前进, 肯定又会相遇, 而这一次龟前进的步数显然就是环的长度length.

环的起点

还是之前所说龟兔相遇的节点, 首先兔保持原位, 而把龟置回链表起点, 然后龟和兔同时走, 每次都走一步, 经过time_b步最终再次相遇, 则这次相遇的节点就是环的起点start.

龟兔算法为什么有效

假设兔每次走step步, 那么只要单链表中存在环, 而且龟到达环的起点之后, 龟和兔之间的路程差能够是环长度length的整数倍, 则二者就能相遇. 即:

((step - 1) * time_a) % length == 0  && time_a >= start

其中start和length是固定的, time_a却无限增长, 所以总会有机会整除, 所以龟兔算法可以判断是否有环.

略过显而易见的长度计算, 接着说查找环的起点. 当仅当step为2, 即兔每次走两步时, 根据前面的条件得知time_a能被length整除. 于是我们接着假设time_b为龟到达环的起点start后绕了n圈再次到达起点, 即time_b == start + n * length, 则兔此时的位置(假设绕了m圈)是time_a + time_b - m * length, 代入后为time_a + start + (n - m) * length, 因为step为2时time_a能被length整除, 所以兔此时正好也在起点start处, 所以龟兔算法可以找到环的起点.


Apr 26, 2015

Hybrid bootable USB stick for legacy BIOS & UEFI

This is a simple note for people who are googling this topic right now, and you are welcome.

Backgrounds

  • Legacy BIOS needs bootloaders to be installed into MBR and the reserved space after it.

  • UEFI only needs bootloaders in ESP(EFI System Partition).

  • There is no resource competition between legacy BIOS and UEFI requirements.

Preparations

  • grub-pc-bin and grub-efi-amd64-bin packages for Debian users.

  • An USB storage stick, MBR or GPT. Some unallocated space(1MB is enough) at the beginning of disk if you are using GPT.

  • A vfat partition with the “boot” flag.

Cooking

# mount /dev/sdb1 /media/sdb1

# grub-install --target=i386-pc --boot-directory=/media/sdb1/boot /dev/sdb

# grub-install --target=x86_64-efi --efi-directory=/media/sdb1/ --boot-directory=/media/sdb1/boot --removable /dev/sdb

Configuration(just a template)

# /media/sdb1/boot/grub/grub.cfg
set root=(hd0,1)

menuentry "FreeDOS" {
        linux16  /boot/freedos/memdisk
        initrd16 /boot/freedos/balder10.img
}
menuentry "Memtest86+" {
        loopback loop /boot/systemrescuecd-x86-x.y.z.iso
        linux16 (loop)/bootdisk/memtestp
}
menuentry "Ubuntu xx.yy" {
        set isofile="/boot/ubuntu-xx.yy-desktop-amd64.iso"
        loopback loop $isofile
        linux (loop)/casper/vmlinuz.efi boot=casper iso-scan/filename=$isofile noprompt noeject
        initrd (loop)/casper/initrd.lz
}
menuentry "SystemRescueCd" {
        set isofile="/boot/systemrescuecd-x86-x.y.z.iso"
        loopback loop $isofile
        linux   (loop)/isolinux/rescue64 docache isoloop=$isofile
        initrd  (loop)/isolinux/initram.igz
}
menuentry "Install Debian GNU/Linux" {
        linux   /boot/debian/vmlinuz priority=low recommends=false
        initrd  /boot/debian/initrd.gz
}
menuentry "Debian GNU/Linux Rescue Mode" {
        linux   /boot/debian/vmlinuz priority=low recommends=false rescue/enable=true
        initrd  /boot/debian/initrd.gz
}

Done, enjoy fixing others’ computers then, which I already quit :D