Git源码学习系列（二）——git-init-db&git-update-cache #161

soapgu · 2022-08-10T02:23:27Z

前言

想了一下，还是先追git-init-db吧。
再看代码前，再做一次“热身”。
建一个只有一个文件的文件夹，我们看看git init会发生一点什么

PS D:\WorkSpace\PlayPen\emptygit> git init
Initialized empty Git repository in D:/WorkSpace/PlayPen/emptygit/.git/
PS D:\WorkSpace\PlayPen\emptygit> git status
On branch master

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        readme.txt

nothing added to commit but untracked files present (use "git add" to track)
PS D:\WorkSpace\PlayPen\emptygit> tree .\.git\
卷 软件 的文件夹 PATH 列表
卷序列号为 000000F7 50CE:F5A9
D:\WORKSPACE\PLAYPEN\EMPTYGIT\.GIT
├─hooks
├─info
├─objects
│  ├─info
│  └─pack
└─refs
    ├─heads
    └─tags
PS D:\WorkSpace\PlayPen\emptygit>

建立了.git文件夹
建立了objects数据库的，里面所有对象为空
当前工作目录的文件是untracked的状态
建立了refs文件夹，但里面内容为空
新建了config文件，少量core基础配置
index文件还没有新建

好了，这是从现象层面的观察，接下来让我们从代码层面出发来分析

git-init-db源码分析

看代码先看文档

git-init-db
文档说得很清晰。
主要工作是
This simply creates an empty git object database - basically a .git
directory and .git/object/??/ directories.
两个环境变量可以忽略不计
另外有一句话有意思
"git-init-db" won't hurt an existing repository.
可能就是指不会破坏现有工作区吧

好了进入正题init-db.c

通过safe_create_dir(git_dir);建立.git文件夹
create_default_files(git_dir);建立默认文件夹以及文件
我们看一些代码片段

         /*
	 * Create .git/refs/{heads,tags}
	 */
	strcpy(path + len, "refs");
	safe_create_dir(path);
	strcpy(path + len, "refs/heads");
	safe_create_dir(path);
	strcpy(path + len, "refs/tags");
	safe_create_dir(path);

	/*
	 * Create the default symlink from ".git/HEAD" to the "master"
	 * branch
	 */
	strcpy(path + len, "HEAD");
	if (symlink("refs/heads/master", path) < 0) {
		if (errno != EEXIST) {
			perror(path);
			exit(1);
		}
	}

整个refs目录被建立起来了
另外往HEAD里写入默认refs/heads/master软链接，表示默认是master分支。虽然refs/heads里面还没有内容
好像还没看到对象数据库嘛。等下

set up the object store

         /*
	 * And set up the object store.
	 */
	sha1_dir = get_object_directory();
	len = strlen(sha1_dir);
	path = xmalloc(len + 40);
	memcpy(path, sha1_dir, len);

	safe_create_dir(sha1_dir);
	for (i = 0; i < 256; i++) {
		sprintf(path+len, "/%02x", i);
		safe_create_dir(path);
	}
	strcpy(path+len, "/pack");
	safe_create_dir(path);

代码依旧简单。
sha1_dir 获取默认对象数据库文件夹目录路径（objects）
创建文件夹
然后建立SHA1首个byte的文件夹，为以后对象分批存储做准备，一下子把256个文件夹全建了，好像最新版的git不是这样的逻辑。现版本应该还是lazy原则。不用管这个不重要
另外顺便把pack文件也建了，这个到和现版本一致

完了。对的，完了。看来还是git-init-db最简单啊。昨天找错对象了。
好，直接进入下一个章节
git-update-cache --add

git-update-cache 源码分析
老规矩先过文档git-update-cache.txt
--add

加文件进cache，我要观察的重点

--remove

把进cache的文件去掉

--refresh

贴上原文解释
'--refresh' does not calculate a new sha1 file or bring the cache
up-to-date for mode/content changes. But what it does do is to
"re-match" the stat information of a file with the cache, so that you
can refresh the cache for a file that hasn't been changed but where
the stat entry is out of date
就是说，刷新不会重新生成，而是把index文件重新加载/关联。换句话说就是cache如果“脏”了的情况下需要刷一下。

--cacheinfo

原文：
--cacheinfo' is used to register a file that is not in the
current working directory. This is useful for minimum-checkout
merging.

To pretend you have a file with mode and sha1 at path, say:

$ git-update-cache --cacheinfo mode sha1 path
就是在工作目录没有文件的情况下，可以“无中生有”把对象数据库内的文件逻辑上加入cache。
从实体上解释不通，但是从cache结构就好解释，本身cache就是和object对象数据库关联。其实cache以及可以和工作目录是隔离状态。

--info-only

'--info-only' is
useful when the file is available, but you do not wish to update the
object database.反过来实体文件存在，暂时不想要进对象数据库的。逻辑上有用，暂时想不到使用场景

--force-remove
这个也很有意思，就是在实体文件没有被删除的情况下，可以删了cache中的文件。像不像现在的“社死”
--replace
至少这个版本，对于文件名和文件夹名冲突的情况处理，翻来覆去看老半天才懂- --replace，用--add不行，会refuses。
file
这个版本是不能用 . 的，也好，更逻辑更简单一些
update-cache.c分析
从main函数开始

hold_index_file_for_update
就是准备一个index.lock文件做好更新准备，这里用的
cache_file这个struct

struct cache_file {
	struct cache_file *next;
	char lockfile[PATH_MAX];
};

比较鸡贼的是还有一个next的指针指向上一次的cache_file。
好像和rollback有点关系。暂时和主线无关略。

读取index文件
entries = read_cache()；这个昨天已经分析过了，略
分析参数
主要式分析前面的--参数
--add：allow_add = 1;
--remove：allow_remove = 1;
--replace：allow_replace = 1;
--info-only：info_only = 1;
--force-remove：force_remove = 1;
这些参数就是直接把标志位设置了一下

--refresh：直接调用refresh_cache()，非主线后面再看
--cacheinfo：也是直接分析3个参数，调用add_cacheinfo直接加了，非主线后面再看吧

分析path
调用verify_path(path)
并非验证本身文件是否存在，而是检查是否为“合法“文件名。同时对.git文件夹”误操作“做出的规避。
强制删除（如果有）
这里如果前面force_remove = 1，那么就会调用remove_file_from_cache，和add_cacheinfo正好是一对，我们后面再分析。
把文件加入缓存
调用add_file_to_cache(path)，循环结束。好了，进入了update-cache的逻辑核心区了。
对了不是还有remove的嘛，怎么没有“专门”给他的函数。也许你可以“猜”出来，remove也是通过add_file_to_cache来实现的。

a）删除文件
好了，说到曹操，曹操就到。上代码

         status = lstat(path, &st);
	if (status < 0 || S_ISDIR(st.st_mode)) {
		/* When we used to have "path" and now we want to add
		 * "path/file", we need a way to remove "path" before
		 * being able to add "path/file".  However,
		 * "git-update-cache --remove path" would not work.
		 * --force-remove can be used but this is more user
		 * friendly, especially since we can do the opposite
		 * case just fine without --force-remove.
		 */
		if (status == 0 || (errno == ENOENT || errno == ENOTDIR)) {
			if (allow_remove)
				return remove_file_from_cache(path);
		}
		return error("open(\"%s\"): %s", path, strerror(errno));
	}

这里是“正宗”对工作目录的操作，lstat就是拿文件状态，“代价”非常小。
如果我前面已经加过--remove了，这里就会调用remove_file_from_cache实现，和前面步骤5相同。

b）增加cache_entry
好了，remove cache的逻辑处理完了，剩下的都是add的了

        int size, namelen, option, status;
	struct cache_entry *ce;
	struct stat st;

       //other logic

        namelen = strlen(path);
	size = cache_entry_size(namelen);
	ce = xmalloc(size);
	memset(ce, 0, size);
	memcpy(ce->name, path, namelen);
	fill_stat_cache_info(ce, &st);
	ce->ce_mode = create_ce_mode(st.st_mode);
	ce->ce_flags = htons(namelen);

这里cache_entry_size函数计算cache_entry的内存大小，大小直接和path呈函数关系，侧面说明cache_entry本身不占用空间，大小和path路径长度强关联。
然后对新cache_entry的ce对象进行赋值操作

接下来是判断st_mode 和File type mask位的判断
这里有个小分支
如果是是普通文件，执行index_fd方法。
里面是通过mmap，把文件的内容全部读入buf

如果是链接，就把链接对应的内容，直接全部读进来。

这里有个汇聚，都去执行write_sha1_file

c) 写入到objects对象数据库去
就是write_sha1_file的核心逻辑
首先调用write_sha1_file_prepare方法来把要写入的文件名算出来

char *write_sha1_file_prepare(void *buf,
			      unsigned long len,
			      const char *type,
			      unsigned char *sha1,
			      unsigned char *hdr,
			      int *hdrlen)
{
	SHA_CTX c;

	/* Generate the header */
	*hdrlen = sprintf((char *)hdr, "%s %lu", type, len)+1;

	/* Sha1.. */
	SHA1_Init(&c);
	SHA1_Update(&c, hdr, *hdrlen);
	SHA1_Update(&c, buf, len);
	SHA1_Final(sha1, &c);

	return sha1_file_name(sha1);
}

首先计算SHA1签名
签名包括HEAD(type+length) 和内容加在一起生成签名
最后把filename算出来是这样的.git/objects/{2}/{38} 这样的
其中sha1_file_name和fill_sha1_path还是的代码逻辑还是蛮技巧性的，这里不展开了

通过has_sha1_file判断去重
1 从objects/{2}/{38} 里面找
2 从pack里面找，这个是对象数据库的压缩格式，这里非主线任务不展开了

再开一个临时文件objects/obj_XXXXXX
接下来使用deflate压缩算法，把文件内容压缩写入临时文件
写入成功后ret = link(tmpfile, filename); 把临时文件内容link到正式文件里面。
到这里总算是完成了写入工作了

这里多说一句，如果用--info-only，那就是直接调用write_sha1_file_prepare，生成sha1就完事了，没有后续写入对象数据库的操作。

d）写入到cache中去
绕到这里其实有点晕，对象数据库已经加入了，但是终点还没到。现在只是新增了一个cache_entry，并没有active_cache内存对象的“回写”。

        option = allow_add ? ADD_CACHE_OK_TO_ADD : 0;
	option |= allow_replace ? ADD_CACHE_OK_TO_REPLACE : 0;
	return add_cache_entry(ce, option);

先看option，这个值和我们前面的--add/replace 有关

调用cache_name_pos

int cache_name_pos(const char *name, int namelen)
{
	int first, last;

	first = 0;
	last = active_nr;
	while (last > first) {
		int next = (last + first) >> 1;
		struct cache_entry *ce = active_cache[next];
		int cmp = cache_name_compare(name, namelen, ce->name, ntohs(ce->ce_flags));
		if (!cmp)
			return next;
		if (cmp < 0) {
			last = next;
			continue;
		}
		first = next+1;
	}
	return -first-1;
}

似懂非懂的cache_name_pos！

看字面代码意思，是对cache的name进行一个二分查找对比
如果发现一致就返回一个正值
如果没有找到，就在一个符合排序的位置并赋值上负值。

/* existing match? Just replace it */
	if (pos >= 0) {
		active_cache_changed = 1;
		active_cache[pos] = ce;
		return 0;
	}

这段就是如果当前文件已经存在cache中，直接替换

pos = -pos-1;
接下来是反正操作，就是默认要插入新值了

接下来这段是不懂的代码？？？

        /*
	 * Inserting a merged entry ("stage 0") into the index
	 * will always replace all non-merged entries..
	 */
	if (pos < active_nr && ce_stage(ce) == 0) {
		while (ce_same_name(active_cache[pos], ce)) {
			ok_to_add = 1;
			if (!remove_cache_entry_at(pos))
				break;
		}
	}

既然前面pos 是没找到一样的，为啥后面还要判断ce_same_name啊，明显不会发生啊。
后来再反复看了几遍！
#define ce_namelen(ce) (CE_NAMEMASK & ntohs((ce)->ce_flags))
就是说ce_flags的高位掩码不同，肯定是做了啥处理了！这个花头要结合其他写入这个flag结合去处理了
我这里立一个flag吧，打脸就打理吧

因为已经说了是merged entry
是不是冲突解决标志位有关？待揭秘？
反正就是把原有的高位flag不同的cache给去掉

if (!skip_df_check && check_file_directory_conflict(ce, pos, ok_to_replace)) {
		if (!ok_to_replace)
			return -1;
		pos = cache_name_pos(ce->name, ntohs(ce->ce_flags));
		pos = -pos-1;
	}

这段是处理文件和文件夹子串冲突的，因为cache的冲突删了，pos重新算了

接下来逻辑还算简单一点

/* Make sure the array is big enough .. */
	if (active_nr == active_alloc) {
		active_alloc = alloc_nr(active_alloc);
		active_cache = xrealloc(active_cache, active_alloc * sizeof(struct cache_entry *));
	}

	/* Add it in.. */
	active_nr++;
	if (active_nr > pos)
		memmove(active_cache + pos + 1, active_cache + pos, (active_nr - pos - 1) * sizeof(ce));
	active_cache[pos] = ce;
	active_cache_changed = 1;

内存有没有到上限active_alloc，如果超标了，重新申请内存。
把新的cache_entry放到active_cache的pos位置

e）回写index
write_cache函数来实现
和read_cache是相反的
写入cache_header
写入entries
写入SHA1签名

注意：这里操作都是针对index.lock的

还有一步commit_index_file
把index.lock重命名为index

好像疑问更多了

小结明天再写……，再缕一缕……

盘点总结

一天都干了点其他事情，再把 Git源码学习系列（一）重新再回顾下。似乎思路更清楚了

active_cache究竟是个啥东西？
active_cache再回头想了一下，首先它是个数组，但是它不普通，他是整个cache的核心，所以肯定不简单。

动态高效分配内存，active_alloc是预分配的个数，active_nr。一旦active_alloc==active_nr就会再次触发扩容，每次扩容都是1.5倍+16，可能是性能和内存节省的平衡。
终于看懂的内存移位骚操作
这个应该是C语言的特性，只怪平时接触得太少。
第一处是读取index的cache_entry部分，每次读取下一条记录都直接offset当前ce_size就行了！效率很高，因为index文件内存已经整个映射到内存中了。
第二处是插入记录部分

        active_nr++;
	if (active_nr > pos)
		memmove(active_cache + pos + 1, active_cache + pos, (active_nr - pos - 1) * sizeof(ce));

看看这个memmove函数，功能：由src所指内存区域复制count个字节到dest所指内存区域。
字面意思还难理解

实时排序的数组
和我们以前不同，这个数组天生一直保持排序。每次新增数据都会插入到“正确位置”

cache_name_pos这个函数就是查和插一体的！

1一开始没看懂，其实是往右移位操作，2进制，右移1位就是缩小一般，差不多next就是中点

差不多就是三种情况

因为是已排序的数组，
所以二分查找就有了意义，三种情况正好是等于，小于和大于
根据情况直接前队变后队，或者后对变前队这种阵型转换了
最后last first错位，代表查找结束，就算找不到也能找到插入位置

cache中应该没有树！
现在cache已经从内存结构翻个底朝天了，应该是没有树这个概念了，当然也没有文件夹这个概念了。和想象不太一样。
想想也对。cache是追求高效率的结构。name已经记录了全路径，再记录文件夹似乎“冗余”了
突然想到一个有趣的场景，如果你文件修改正好改得和某个历史版本一毛一样会怎么样？
不会有新对象文件被写入
前面已经整理了有去重逻辑。

if (has_sha1_file(sha1))
		return 0;

直接返回了，sha1重复了，所以直接关联“旧”的对象，非常有意思。

下一步计划

应该是commit啦

The text was updated successfully, but these errors were encountered:

soapgu added IDE Good for newcomers Git labels Sep 9, 2022

soapgu changed the title ~~Git源码学习系列（二）~~ Git源码学习系列（二）——git-init-db&git-update-cache Sep 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Git源码学习系列（二）——git-init-db&git-update-cache #161

Git源码学习系列（二）——git-init-db&git-update-cache #161

soapgu commented Aug 10, 2022 •

edited

Loading

git-init-db源码分析

git-update-cache 源码分析

好像疑问更多了

盘点总结

下一步计划

Git源码学习系列（二）——git-init-db&git-update-cache #161

Git源码学习系列（二）——git-init-db&git-update-cache #161

Comments

soapgu commented Aug 10, 2022 • edited Loading

前言

git-init-db源码分析

git-update-cache 源码分析

好像疑问更多了

盘点总结

下一步计划

soapgu commented Aug 10, 2022 •

edited

Loading