Open vswitch 虛擬交換機(http://openvswitch.org/)
聽 說Xen Cloud Platform 就是用了這個來管理各個虛擬機器直接的網路介面。大概看了一下文檔,感覺特點是管理的控制介面很方便吧,這樣虛擬機器主機就可以方便的控制虛擬機器的網路,進行 一個網卡遷移到另外一個網卡上面等,動態配置這個vswitch應該是很方便的。比如物理交換機裡面的查看mac table這些操作估計不好做,但使用vswitch這種的虛機設備的話應該是很容易控制的。總之管理各個交換機埠等很方便,特別是你有很多虛擬機器群, 大量的網路配置要做的時候。另外一個特點是支持vlan、openflow、QOS流量控制等功能,好像其他常見的交換機管理協定也都是支援的。
下載了代碼回來,大概看了一下網路資料包在這個open vswitch著哦你的流向,關鍵代碼都在datapath目錄下。
聽 說Xen Cloud Platform 就是用了這個來管理各個虛擬機器直接的網路介面。大概看了一下文檔,感覺特點是管理的控制介面很方便吧,這樣虛擬機器主機就可以方便的控制虛擬機器的網路,進行 一個網卡遷移到另外一個網卡上面等,動態配置這個vswitch應該是很方便的。比如物理交換機裡面的查看mac table這些操作估計不好做,但使用vswitch這種的虛機設備的話應該是很容易控制的。總之管理各個交換機埠等很方便,特別是你有很多虛擬機器群, 大量的網路配置要做的時候。另外一個特點是支持vlan、openflow、QOS流量控制等功能,好像其他常見的交換機管理協定也都是支援的。
下載了代碼回來,大概看了一下網路資料包在這個open vswitch著哦你的流向,關鍵代碼都在datapath目錄下。
如 上圖所示,open vswitch 有eth0 eth1 tap1 tap24個虛擬埠,這個的創建應該是自己使用open vswitch的控制工具掛載到某個系統介面上得到。 像tap1在xen 裡面對於那個vif0等虛擬裝置)這些網路設備應該有你自己負責創建,然後自己控制虛擬裝置上的處理辦法,比如我vswich轉發包給你tap0,你自己 負責再通過hypercall轉發給對應的虛擬機器等。open vswttch複雜的是在不同的網路設備之間轉發的邏輯的控制。
vswitch關鍵的一個結構是struct
vport,就是用來表示對應的物理埠的,比如上圖連接到eth0就是vswitch的一個vport連過去的。
================vport.h============================
/**
* struct vport - one port within a datapath
* @port_no: Index into @dp's @ports array.
* @dp: Datapath to which this port belongs.
* @kobj: Represents /sys/class/net/<devname>/brport.
* @linkname: The name of the link from /sys/class/net/<datapath>/brif to this
* &struct vport. (We keep this around so that we can delete it if the
* device gets renamed.) Set to the null string when no link exists.
* @node: Element in @dp's @port_list.
* @sflow_pool: Number of packets that were candidates for sFlow sampling,
* regardless of whether they were actually chosen and sent down to userspace.
* @hash_node: Element in @dev_table hash table in vport.c.
* @ops: Class structure.
* @percpu_stats: Points to per-CPU statistics used and maintained by the vport
* code if %VPORT_F_GEN_STATS is set to 1 in @ops flags, otherwise unused.
* @stats_lock: Protects @err_stats and @offset_stats.
* @err_stats: Points to error statistics used and maintained by the vport code
* if %VPORT_F_GEN_STATS is set to 1 in @ops flags, otherwise unused.
* @offset_stats: Added to actual statistics as a sop to compatibility with
* XAPI for Citrix XenServer. Deprecated.
*/
struct vport {
u16 port_no;
struct datapath *dp;
struct kobject kobj;
char linkname[IFNAMSIZ];
struct list_head node;
atomic_t sflow_pool;
struct hlist_node hash_node;
const struct vport_ops *ops;
struct vport_percpu_stats *percpu_stats;
spinlock_t stats_lock;
struct vport_err_stats err_stats;
struct rtnl_link_stats64 offset_stats;
};
#define VPORT_F_REQUIRED (1 << 0) /* If init fails, module loading fails. */
#define VPORT_F_GEN_STATS (1 << 1) /* Track stats at the generic layer. */
#define VPORT_F_FLOW (1 << 2) /* Sets OVS_CB(skb)->flow. */
#define VPORT_F_TUN_ID (1 << 3) /* Sets OVS_CB(skb)->tun_id. */
/**
* struct vport_parms - parameters for creating a new vport
*
* @name: New vport's name.
* @type: New vport's type.
* @config: Kernel copy of 'config' member of &struct odp_port describing
* configuration for new port. Exactly %VPORT_CONFIG_SIZE bytes.
* @dp: New vport's datapath.
* @port_no: New vport's port number.
*/
struct vport_parms {
const char *name;
const char *type;
const void *config;
/* For vport_alloc(). */
struct datapath *dp;
u16 port_no;
};
/**
* struct vport_ops - definition of a type of virtual port
*
* @type: Name of port type, such as "netdev" or "internal" to be matched
* against the device type when a new port needs to be created.
* @flags: Flags of type VPORT_F_* that influence how the generic vport layer
* handles this vport.
* @init: Called at module initialization. If VPORT_F_REQUIRED is set then the
* failure of this function will cause the module to not load. If the flag is
* not set and initialzation fails then no vports of this type can be created.
* @exit: Called at module unload.
* @create: Create a new vport configured as specified. On success returns
* a new vport allocated with vport_alloc(), otherwise an ERR_PTR() value.
* @modify: Modify the configuration of an existing vport. May be null if
* modification is not supported.
* @destroy: Detach and destroy a vport.
* @set_mtu: Set the device's MTU. May be null if not supported.
* @set_addr: Set the device's MAC address. May be null if not supported.
* @set_stats: Provides stats as an offset to be added to the device stats.
* May be null if not supported.
* @get_name: Get the device's name.
* @get_addr: Get the device's MAC address.
* @get_kobj: Get the kobj associated with the device (may return null).
* @get_stats: Fill in the transmit/receive stats. May be null if stats are
* not supported or if generic stats are in use. If defined and
* VPORT_F_GEN_STATS is also set, the error stats are added to those already
* collected.
* @get_dev_flags: Get the device's flags.
* @is_running: Checks whether the device is running.
* @get_operstate: Get the device's operating state.
* @get_ifindex: Get the system interface index associated with the device.
* May be null if the device does not have an ifindex.
* @get_iflink: Get the system interface index associated with the device that
* will be used to send packets (may be different than ifindex for tunnels).
* May be null if the device does not have an iflink.
* @get_mtu: Get the device's MTU.
* @send: Send a packet on the device. Returns the length of the packet sent.
*/
struct vport_ops {
const char *type;
u32 flags;
/* Called at module init and exit respectively. */
int (*init)(void);
void (*exit)(void);
/* Called with RTNL lock. */
struct vport *(*create)(const struct vport_parms *);
int (*modify)(struct vport *, struct odp_port *);
int (*destroy)(struct vport *);
int (*set_mtu)(struct vport *, int mtu);
int (*set_addr)(struct vport *, const unsigned char *);
int (*set_stats)(const struct vport *, struct rtnl_link_stats64 *);
/* Called with rcu_read_lock or RTNL lock. */
const char *(*get_name)(const struct vport *);
const unsigned char *(*get_addr)(const struct vport *);
struct kobject *(*get_kobj)(const struct vport *);
int (*get_stats)(const struct vport *, struct rtnl_link_stats64 *);
unsigned (*get_dev_flags)(const struct vport *);
int (*is_running)(const struct vport *);
unsigned char (*get_operstate)(const struct vport *);
int (*get_ifindex)(const struct vport *);
int (*get_iflink)(const struct vport *);
int (*get_mtu)(const struct vport *);
int (*send)(struct vport *, struct sk_buff *);
};
========================================================
是 可以自己實現vport埠,然後往相應的datapath上面的註冊的吧,關鍵是要實現vport_ops 這個介面的各個函數,比如int (*send)(struct vport *, struct sk_buff *);這個是vswitch用來往某個port上發送資料包的。然後你自己的vport的實現裡面調用 vport_receive這個函數通知vswitch核心你這個port有包要從switch通過了。在vswitch的角度看來,他就是關注一個埠 上面的發送和接受兩個資料流程向,其他的他不管了吧。
我們再看看把一個net_device 掛載為一個 vswitch的vport埠後的,資料包的流向是怎麼樣的,他這種 net_device 的vport是怎麼實現的。
首先,vswitch根據你傳過來的interface的名字,比如eth0,用dev_get_by_name找到對應的 net_device結構。然後給這個net_device註冊
rx_handler函數。這樣linux系統就會在這個net-device收到資料包的時候調用我們的rx_handler函數了。
====================vportnetdev.c======================================
static struct vport *netdev_create(const struct vport_parms *parms)
{
struct vport *vport;
struct netdev_vport *netdev_vport;
int err;
vport = vport_alloc(sizeof(struct netdev_vport), &netdev_vport_ops, parms);
if (IS_ERR(vport)) {
err = PTR_ERR(vport);
goto error;
}
netdev_vport = netdev_vport_priv(vport);
netdev_vport->dev = dev_get_by_name(&init_net, parms->name);
if (!netdev_vport->dev) {
err = -ENODEV;
goto error_free_vport;
}
if (netdev_vport->dev->flags & IFF_LOOPBACK ||
netdev_vport->dev->type != ARPHRD_ETHER ||
is_internal_dev(netdev_vport->dev)) {
err = -EINVAL;
goto error_put;
}
/* If we are using the vport stats layer initialize it to the current
* values so we are roughly consistent with the device stats. */
if (USE_VPORT_STATS) {
struct rtnl_link_stats64 stats;
err = netdev_get_stats(vport, &stats);
if (!err)
vport_set_stats(vport, &stats);
}
err = netdev_rx_handler_register(netdev_vport->dev, netdev_frame_hook,
vport);
if (err)
goto error_put;
dev_set_promiscuity(netdev_vport->dev, 1);
dev_disable_lro(netdev_vport->dev);
netdev_vport->dev->priv_flags |= IFF_OVS_DATAPATH;
return vport;
error_put:
dev_put(netdev_vport->dev);
error_free_vport:
vport_free(vport);
error:
return ERR_PTR(err);
}
================================================================
rx_handler 應該是2.6.36裡面改動後才有的,看樣子是專門用於brigde的橋接器實現而做的, 之前的都是直接在內核裡面匯出一個br_handle_frame_hook函數,然後內核在網路資料包的收包的地方調用這個函數來處理橋接器相關的邏輯。 不過看現在的代碼只能一個net-device註冊一個rx_handler函數的。之前看的cisco vpn用戶端,其實也可以用這種辦法來實現,輕鬆掛鉤某個網路設備的收包點,然後如果這個rx_handler消耗了某個skb,內核的代碼也是不會往下 繼續傳的。
看看內核裡明註冊處理函數相關的代碼:
2722/**
2723 * netdev_rx_handler_register - register receive handler
2724 * @dev: device to register a handler for
2725 * @rx_handler: receive handler to register
2726 * @rx_handler_data: data pointer that is used by rx handler
2727 *
2728 * Register a receive hander for a device. This handler will then be
2729 * called from __netif_receive_skb. A negative errno code is returned
2730 * on a failure.
2731 *
2732 * The caller must hold the rtnl_mutex.
2733 */
2734int netdev_rx_handler_register(struct net_device *dev,
2735 rx_handler_func_t *rx_handler,
2736 void *rx_handler_data)
2737{
2738 ASSERT_RTNL();
2739
2740 if (dev->rx_handler)
2741 return -EBUSY;
2742
2743 rcu_assign_pointer(dev->rx_handler_data, rx_handler_data);
2744 rcu_assign_pointer(dev->rx_handler, rx_handler);
2745
2746 return 0;
2747}
2748EXPORT_SYMBOL_GPL(netdev_rx_handler_register);
2749
2750/**
2751 * netdev_rx_handler_unregister - unregister receive handler
2752 * @dev: device to unregister a handler from
2753 *
2754 * Unregister a receive hander from a device.
2755 *
2756 * The caller must hold the rtnl_mutex.
2757 */
2758void netdev_rx_handler_unregister(struct net_device *dev)
2759{
2760
2761 ASSERT_RTNL();
2762 rcu_assign_pointer(dev->rx_handler, NULL);
2763 rcu_assign_pointer(dev->rx_handler_data, NULL);
2764}
2765EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);
2817static int __netif_receive_skb(struct sk_buff *skb)
2818{
2894 /* Handle special case of bridge or macvlan */
2895 rx_handler = rcu_dereference(skb->dev->rx_handler); ///////__netif_receive_skb函數裡面會調用註冊的處理函數的
2896 if (rx_handler) {
2897 if (pt_prev) {
2898 ret = deliver_skb(skb, pt_prev, orig_dev);
2899 pt_prev = NULL;
2900 }
2901 skb = rx_handler(skb);
2902 if (!skb)
2903 goto out;
2904 }
=======================vport-netdev.c================
static int netdev_init(void)
{
/* Hook into callback used by the bridge to intercept packets.
* Parasites we are. */
br_handle_frame_hook = netdev_frame_hook; /////以前久版本內核,還是採用直接替換內核匯出的bridge的處理函數的辦法
return 0;
}
static struct sk_buff *netdev_frame_hook(struct sk_buff *skb)
{
struct vport *vport;
if (unlikely(skb->pkt_type == PACKET_LOOPBACK))
return skb;
vport = netdev_get_vport(skb->dev);
netdev_port_receive(vport, skb);
return NULL;
}
/* Must be called with rcu_read_lock. */
static void netdev_port_receive(struct vport *vport, struct sk_buff *skb)
{
/* Make our own copy of the packet. Otherwise we will mangle the
* packet for anyone who came before us (e.g. tcpdump via AF_PACKET).
* (No one comes after us, since we tell handle_bridge() that we took
* the packet.) */
skb = skb_share_check(skb, GFP_ATOMIC);
if (unlikely(!skb))
return;
skb_warn_if_lro(skb);
skb_push(skb, ETH_HLEN);
compute_ip_summed(skb, false);
vport_receive(vport, skb); ////////調用vport_receive 通知核心,我們這個埠有資料進來了
}
===========================vport.c=============================
/**
* vport_receive - pass up received packet to the datapath for processing
*
* @vport: vport that received the packet
* @skb: skb that was received
*
* Must be called with rcu_read_lock. The packet cannot be shared and
* skb->data should point to the Ethernet header. The caller must have already
* called compute_ip_summed() to initialize the checksumming fields.
*/
void vport_receive(struct vport *vport, struct sk_buff *skb)
{
if (vport->ops->flags & VPORT_F_GEN_STATS) {
struct vport_percpu_stats *stats;
local_bh_disable();
stats = per_cpu_ptr(vport->percpu_stats, smp_processor_id());
write_seqcount_begin(&stats->seqlock);
stats->rx_packets++;
stats->rx_bytes += skb->len;
write_seqcount_end(&stats->seqlock);
local_bh_enable();
}
if (!(vport->ops->flags & VPORT_F_FLOW))
OVS_CB(skb)->flow = NULL;
if (!(vport->ops->flags & VPORT_F_TUN_ID))
OVS_CB(skb)->tun_id = 0;
dp_process_received_packet(vport, skb); //////////進去datapath核心裡面處理////////
}
============================datapath.c=========================================
這個函數裡面會進行處理邏輯判斷了,判斷netflow流類型,然後執行相應的控制規則action等等,根據你的配置來進行的吧。這裡面才是open vswitch的控制核心所在。
/* Must be called with rcu_read_lock. */
void dp_process_received_packet(struct vport *p, struct sk_buff *skb)
{
struct datapath *dp = p->dp;
struct dp_stats_percpu *stats;
int stats_counter_off;
struct sw_flow_actions *acts;
struct loop_counter *loop;
int error;
OVS_CB(skb)->vport = p;
if (!OVS_CB(skb)->flow) {
struct odp_flow_key key;
struct tbl_node *flow_node;
bool is_frag;
/* Extract flow from 'skb' into 'key'. */
error = flow_extract(skb, p ? p->port_no : ODPP_NONE, &key, &is_frag);
if (unlikely(error)) {
kfree_skb(skb);
return;
}
if (is_frag && dp->drop_frags) {
kfree_skb(skb);
stats_counter_off = offsetof(struct dp_stats_percpu, n_frags);
goto out;
}
/* Look up flow. */ /////搜索匹配的 流類型的,比如是不是某個tcp連接來的阿 等等?????
flow_node = tbl_lookup(rcu_dereference(dp->table), &key,
flow_hash(&key), flow_cmp);
if (unlikely(!flow_node)) {
dp_output_control(dp, skb, _ODPL_MISS_NR, OVS_CB(skb)->tun_id);
stats_counter_off = offsetof(struct dp_stats_percpu, n_missed);
goto out;
}
OVS_CB(skb)->flow = flow_cast(flow_node);
}
stats_counter_off = offsetof(struct dp_stats_percpu, n_hit);
flow_used(OVS_CB(skb)->flow, skb);
acts = rcu_dereference(OVS_CB(skb)->flow->sf_acts);
/* Check whether we've looped too much. */
loop = loop_get_counter();
if (unlikely(++loop->count > MAX_LOOPS))
loop->looping = true;
if (unlikely(loop->looping)) {
loop_suppress(dp, acts);
kfree_skb(skb);
goto out_loop;
}
/* Execute actions. */
execute_actions(dp, skb, &OVS_CB(skb)->flow->key, acts->actions,
acts->actions_len); ////////執行相應的規則???????????????
/* Check whether sub-actions looped too much. */
if (unlikely(loop->looping))
loop_suppress(dp, acts);
out_loop:
/* Decrement loop counter. */
if (!--loop->count)
loop->looping = false;
loop_put_counter();
out:
/* Update datapath statistics. */
local_bh_disable();
stats = per_cpu_ptr(dp->stats_percpu, smp_processor_id());
write_seqcount_begin(&stats->seqlock);
(*(u64 *)((u8 *)stats + stats_counter_off))++;
write_seqcount_end(&stats->seqlock);
local_bh_enable();
}
=========================actiona.c============================
/* Execute a list of actions against 'skb'. */
int execute_actions(struct datapath *dp, struct sk_buff *skb,
const struct odp_flow_key *key,
const struct nlattr *actions, u32 actions_len)
{
/* Every output action needs a separate clone of 'skb', but the common
* case is just a single output action, so that doing a clone and
* then freeing the original skbuff is wasteful. So the following code
* is slightly obscure just to avoid that. */
int prev_port = -1;
u32 priority = skb->priority;
const struct nlattr *a;
int rem, err;
if (dp->sflow_probability) {
struct vport *p = OVS_CB(skb)->vport;
if (p) {
atomic_inc(&p->sflow_pool);
if (dp->sflow_probability == UINT_MAX ||
net_random() < dp->sflow_probability)
sflow_sample(dp, skb, actions, actions_len, p);
}
}
OVS_CB(skb)->tun_id = 0;
for (a = actions, rem = actions_len; rem > 0; a = nla_next(a, &rem)) {
if (prev_port != -1) {
do_output(dp, skb_clone(skb, GFP_ATOMIC), prev_port); ////根據規則,決定從那個埠出去了。
prev_port = -1;
}
switch (nla_type(a)) {
case ODPAT_OUTPUT:
prev_port = nla_get_u32(a);
break;
case ODPAT_CONTROLLER:
err = output_control(dp, skb, nla_get_u64(a));
if (err) {
kfree_skb(skb);
return err;
}
break;
case ODPAT_SET_TUNNEL:
OVS_CB(skb)->tun_id = nla_get_be64(a);
break;
case ODPAT_SET_DL_TCI:
skb = modify_vlan_tci(dp, skb, key, a, rem);
if (IS_ERR(skb))
return PTR_ERR(skb);
break;
case ODPAT_STRIP_VLAN:
skb = strip_vlan(skb);
break;
case ODPAT_SET_DL_SRC:
skb = make_writable(skb, 0);
if (!skb)
return -ENOMEM;
memcpy(eth_hdr(skb)->h_source, nla_data(a), ETH_ALEN);
break;
case ODPAT_SET_DL_DST:
skb = make_writable(skb, 0);
if (!skb)
return -ENOMEM;
memcpy(eth_hdr(skb)->h_dest, nla_data(a), ETH_ALEN);
break;
case ODPAT_SET_NW_SRC:
case ODPAT_SET_NW_DST:
skb = set_nw_addr(skb, key, a);
break;
case ODPAT_SET_NW_TOS:
skb = set_nw_tos(skb, key, nla_get_u8(a));
break;
case ODPAT_SET_TP_SRC:
case ODPAT_SET_TP_DST:
skb = set_tp_port(skb, key, a);
break;
case ODPAT_SET_PRIORITY:
skb->priority = nla_get_u32(a);
break;
case ODPAT_POP_PRIORITY:
skb->priority = priority;
break;
case ODPAT_DROP_SPOOFED_ARP:
if (unlikely(is_spoofed_arp(skb, key)))
goto exit;
break;
}
if (!skb)
return -ENOMEM;
}
exit:
if (prev_port != -1)
do_output(dp, skb, prev_port);
else
kfree_skb(skb);
return 0;
}
static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port)
{
struct vport *p;
if (!skb)
goto error;
p = rcu_dereference(dp->ports[out_port]);
if (!p)
goto error;
vport_send(p, skb); //////從埠發送出去
return;
error:
kfree_skb(skb);
}
=========================vport.c=======================================
/**
* vport_send - send a packet on a device
*
* @vport: vport on which to send the packet
* @skb: skb to send
*
* Sends the given packet and returns the length of data sent. Either RTNL
* lock or rcu_read_lock must be held.
*/
int vport_send(struct vport *vport, struct sk_buff *skb)
{
int mtu;
int sent;
mtu = vport_get_mtu(vport);
if (unlikely(packet_length(skb) > mtu && !skb_is_gso(skb))) {
if (net_ratelimit())
pr_warn("%s: dropped over-mtu packet: %d > %d\n",
dp_name(vport->dp), packet_length(skb), mtu);
goto error;
}
sent = vport->ops->send(vport, skb); ////////我們註冊vport時候的發送出去的函數。
if (vport->ops->flags & VPORT_F_GEN_STATS && sent > 0) {
struct vport_percpu_stats *stats;
local_bh_disable();
stats = per_cpu_ptr(vport->percpu_stats, smp_processor_id());
write_seqcount_begin(&stats->seqlock);
stats->tx_packets++;
stats->tx_bytes += sent;
write_seqcount_end(&stats->seqlock);
local_bh_enable();
}
return sent;
error:
kfree_skb(skb);
vport_record_error(vport, VPORT_E_TX_DROPPED);
return 0;
}
======================vport-netdev.c==================
看看我們net_device attach類型的vport的處理函數。
static int netdev_send(struct vport *vport, struct sk_buff *skb)
{
struct netdev_vport *netdev_vport = netdev_vport_priv(vport);
int len = skb->len;
skb->dev = netdev_vport->dev;
forward_ip_summed(skb);
dev_queue_xmit(skb); ///加到網路設備的發送佇列裡面,從net_device 發送出去外面網路
return len;
}
=============================vport-internal_dev.c=====================================
上 面的 netdev類型 port,我們可以看到應該是被vswitch使用之後,他那個網路設備就沒有辦法正常的把網路包分發給系統上層協定來處理的了。比如說eth0被 vswitch接管了,linux內核是不能直接收到eth0過來的包了,而是有open vswitch接管了,vswtich可能根據規則就直接轉發給另外一個vport的net_device,這個另外一個net-device可能是對應 的虛擬機器的介面的,比如xen裡面vif網路設備,然後包就通過vif過去虛擬機器了。 這樣eth0的包,自己的host主機是看不到他過來的包的。不過open vswitch還實現了另外一種internal_dev類型的vport 。這種vport他會自己註冊一個網路設備,通過這個特定的網路設備,host主機是可以給vswitch發送網路包的,然後它這個vport是受到 vswtich過來的包的話,他也是往上傳給內核協議棧的。
static int internal_dev_recv(struct vport *vport, struct sk_buff *skb)
{
struct net_device *netdev = netdev_vport_priv(vport)->dev;
int len;
skb->dev = netdev; //傳給vport的網路設備
len = skb->len;
skb->pkt_type = PACKET_HOST;
skb->protocol = eth_type_trans(skb, netdev);
if (in_interrupt())
netif_rx(skb); /////net_device收到包,上傳給上層處理
else
netif_rx_ni(skb);
#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,29)
netdev->last_rx = jiffies;
#endif
return len;
}
const struct vport_ops internal_vport_ops = {
.type = "internal",
.flags = VPORT_F_REQUIRED | VPORT_F_GEN_STATS | VPORT_F_FLOW,
.create = internal_dev_create, //////創建vport的函數
.destroy = internal_dev_destroy,
.set_mtu = netdev_set_mtu,
.set_addr = netdev_set_addr,
.get_name = netdev_get_name,
.get_addr = netdev_get_addr,
.get_kobj = netdev_get_kobj,
.get_dev_flags = netdev_get_dev_flags,
.is_running = netdev_is_running,
.get_operstate = netdev_get_operstate,
.get_ifindex = netdev_get_ifindex,
.get_iflink = netdev_get_iflink,
.get_mtu = netdev_get_mtu,
.send = internal_dev_recv, ////vswitch 要給你這個vport發包的時候,就調用的這個。
};
總 結: 大概看了一下之後,vswitch的流程和大概實現就清除一點了。他也是通過內核裡面net_device結構,掛鉤網路設備的發包出口點和接受點來做到 的。然後讓包在不同的netdevice之間轉發資料包,修改包的流向等,這就是一個虛擬交換機的功能了。當然他裡面的邏輯控制還是要做很多工作的。不過 這些在net-device之間玩弄網路skb資料包的辦法也可以學習一下。
================vport.h============================
/**
* struct vport - one port within a datapath
* @port_no: Index into @dp's @ports array.
* @dp: Datapath to which this port belongs.
* @kobj: Represents /sys/class/net/<devname>/brport.
* @linkname: The name of the link from /sys/class/net/<datapath>/brif to this
* &struct vport. (We keep this around so that we can delete it if the
* device gets renamed.) Set to the null string when no link exists.
* @node: Element in @dp's @port_list.
* @sflow_pool: Number of packets that were candidates for sFlow sampling,
* regardless of whether they were actually chosen and sent down to userspace.
* @hash_node: Element in @dev_table hash table in vport.c.
* @ops: Class structure.
* @percpu_stats: Points to per-CPU statistics used and maintained by the vport
* code if %VPORT_F_GEN_STATS is set to 1 in @ops flags, otherwise unused.
* @stats_lock: Protects @err_stats and @offset_stats.
* @err_stats: Points to error statistics used and maintained by the vport code
* if %VPORT_F_GEN_STATS is set to 1 in @ops flags, otherwise unused.
* @offset_stats: Added to actual statistics as a sop to compatibility with
* XAPI for Citrix XenServer. Deprecated.
*/
struct vport {
u16 port_no;
struct datapath *dp;
struct kobject kobj;
char linkname[IFNAMSIZ];
struct list_head node;
atomic_t sflow_pool;
struct hlist_node hash_node;
const struct vport_ops *ops;
struct vport_percpu_stats *percpu_stats;
spinlock_t stats_lock;
struct vport_err_stats err_stats;
struct rtnl_link_stats64 offset_stats;
};
#define VPORT_F_REQUIRED (1 << 0) /* If init fails, module loading fails. */
#define VPORT_F_GEN_STATS (1 << 1) /* Track stats at the generic layer. */
#define VPORT_F_FLOW (1 << 2) /* Sets OVS_CB(skb)->flow. */
#define VPORT_F_TUN_ID (1 << 3) /* Sets OVS_CB(skb)->tun_id. */
/**
* struct vport_parms - parameters for creating a new vport
*
* @name: New vport's name.
* @type: New vport's type.
* @config: Kernel copy of 'config' member of &struct odp_port describing
* configuration for new port. Exactly %VPORT_CONFIG_SIZE bytes.
* @dp: New vport's datapath.
* @port_no: New vport's port number.
*/
struct vport_parms {
const char *name;
const char *type;
const void *config;
/* For vport_alloc(). */
struct datapath *dp;
u16 port_no;
};
/**
* struct vport_ops - definition of a type of virtual port
*
* @type: Name of port type, such as "netdev" or "internal" to be matched
* against the device type when a new port needs to be created.
* @flags: Flags of type VPORT_F_* that influence how the generic vport layer
* handles this vport.
* @init: Called at module initialization. If VPORT_F_REQUIRED is set then the
* failure of this function will cause the module to not load. If the flag is
* not set and initialzation fails then no vports of this type can be created.
* @exit: Called at module unload.
* @create: Create a new vport configured as specified. On success returns
* a new vport allocated with vport_alloc(), otherwise an ERR_PTR() value.
* @modify: Modify the configuration of an existing vport. May be null if
* modification is not supported.
* @destroy: Detach and destroy a vport.
* @set_mtu: Set the device's MTU. May be null if not supported.
* @set_addr: Set the device's MAC address. May be null if not supported.
* @set_stats: Provides stats as an offset to be added to the device stats.
* May be null if not supported.
* @get_name: Get the device's name.
* @get_addr: Get the device's MAC address.
* @get_kobj: Get the kobj associated with the device (may return null).
* @get_stats: Fill in the transmit/receive stats. May be null if stats are
* not supported or if generic stats are in use. If defined and
* VPORT_F_GEN_STATS is also set, the error stats are added to those already
* collected.
* @get_dev_flags: Get the device's flags.
* @is_running: Checks whether the device is running.
* @get_operstate: Get the device's operating state.
* @get_ifindex: Get the system interface index associated with the device.
* May be null if the device does not have an ifindex.
* @get_iflink: Get the system interface index associated with the device that
* will be used to send packets (may be different than ifindex for tunnels).
* May be null if the device does not have an iflink.
* @get_mtu: Get the device's MTU.
* @send: Send a packet on the device. Returns the length of the packet sent.
*/
struct vport_ops {
const char *type;
u32 flags;
/* Called at module init and exit respectively. */
int (*init)(void);
void (*exit)(void);
/* Called with RTNL lock. */
struct vport *(*create)(const struct vport_parms *);
int (*modify)(struct vport *, struct odp_port *);
int (*destroy)(struct vport *);
int (*set_mtu)(struct vport *, int mtu);
int (*set_addr)(struct vport *, const unsigned char *);
int (*set_stats)(const struct vport *, struct rtnl_link_stats64 *);
/* Called with rcu_read_lock or RTNL lock. */
const char *(*get_name)(const struct vport *);
const unsigned char *(*get_addr)(const struct vport *);
struct kobject *(*get_kobj)(const struct vport *);
int (*get_stats)(const struct vport *, struct rtnl_link_stats64 *);
unsigned (*get_dev_flags)(const struct vport *);
int (*is_running)(const struct vport *);
unsigned char (*get_operstate)(const struct vport *);
int (*get_ifindex)(const struct vport *);
int (*get_iflink)(const struct vport *);
int (*get_mtu)(const struct vport *);
int (*send)(struct vport *, struct sk_buff *);
};
========================================================
是 可以自己實現vport埠,然後往相應的datapath上面的註冊的吧,關鍵是要實現vport_ops 這個介面的各個函數,比如int (*send)(struct vport *, struct sk_buff *);這個是vswitch用來往某個port上發送資料包的。然後你自己的vport的實現裡面調用 vport_receive這個函數通知vswitch核心你這個port有包要從switch通過了。在vswitch的角度看來,他就是關注一個埠 上面的發送和接受兩個資料流程向,其他的他不管了吧。
我們再看看把一個net_device 掛載為一個 vswitch的vport埠後的,資料包的流向是怎麼樣的,他這種 net_device 的vport是怎麼實現的。
首先,vswitch根據你傳過來的interface的名字,比如eth0,用dev_get_by_name找到對應的 net_device結構。然後給這個net_device註冊
rx_handler函數。這樣linux系統就會在這個net-device收到資料包的時候調用我們的rx_handler函數了。
====================vportnetdev.c======================================
static struct vport *netdev_create(const struct vport_parms *parms)
{
struct vport *vport;
struct netdev_vport *netdev_vport;
int err;
vport = vport_alloc(sizeof(struct netdev_vport), &netdev_vport_ops, parms);
if (IS_ERR(vport)) {
err = PTR_ERR(vport);
goto error;
}
netdev_vport = netdev_vport_priv(vport);
netdev_vport->dev = dev_get_by_name(&init_net, parms->name);
if (!netdev_vport->dev) {
err = -ENODEV;
goto error_free_vport;
}
if (netdev_vport->dev->flags & IFF_LOOPBACK ||
netdev_vport->dev->type != ARPHRD_ETHER ||
is_internal_dev(netdev_vport->dev)) {
err = -EINVAL;
goto error_put;
}
/* If we are using the vport stats layer initialize it to the current
* values so we are roughly consistent with the device stats. */
if (USE_VPORT_STATS) {
struct rtnl_link_stats64 stats;
err = netdev_get_stats(vport, &stats);
if (!err)
vport_set_stats(vport, &stats);
}
err = netdev_rx_handler_register(netdev_vport->dev, netdev_frame_hook,
vport);
if (err)
goto error_put;
dev_set_promiscuity(netdev_vport->dev, 1);
dev_disable_lro(netdev_vport->dev);
netdev_vport->dev->priv_flags |= IFF_OVS_DATAPATH;
return vport;
error_put:
dev_put(netdev_vport->dev);
error_free_vport:
vport_free(vport);
error:
return ERR_PTR(err);
}
================================================================
rx_handler 應該是2.6.36裡面改動後才有的,看樣子是專門用於brigde的橋接器實現而做的, 之前的都是直接在內核裡面匯出一個br_handle_frame_hook函數,然後內核在網路資料包的收包的地方調用這個函數來處理橋接器相關的邏輯。 不過看現在的代碼只能一個net-device註冊一個rx_handler函數的。之前看的cisco vpn用戶端,其實也可以用這種辦法來實現,輕鬆掛鉤某個網路設備的收包點,然後如果這個rx_handler消耗了某個skb,內核的代碼也是不會往下 繼續傳的。
看看內核裡明註冊處理函數相關的代碼:
2722/**
2723 * netdev_rx_handler_register - register receive handler
2724 * @dev: device to register a handler for
2725 * @rx_handler: receive handler to register
2726 * @rx_handler_data: data pointer that is used by rx handler
2727 *
2728 * Register a receive hander for a device. This handler will then be
2729 * called from __netif_receive_skb. A negative errno code is returned
2730 * on a failure.
2731 *
2732 * The caller must hold the rtnl_mutex.
2733 */
2734int netdev_rx_handler_register(struct net_device *dev,
2735 rx_handler_func_t *rx_handler,
2736 void *rx_handler_data)
2737{
2738 ASSERT_RTNL();
2739
2740 if (dev->rx_handler)
2741 return -EBUSY;
2742
2743 rcu_assign_pointer(dev->rx_handler_data, rx_handler_data);
2744 rcu_assign_pointer(dev->rx_handler, rx_handler);
2745
2746 return 0;
2747}
2748EXPORT_SYMBOL_GPL(netdev_rx_handler_register);
2749
2750/**
2751 * netdev_rx_handler_unregister - unregister receive handler
2752 * @dev: device to unregister a handler from
2753 *
2754 * Unregister a receive hander from a device.
2755 *
2756 * The caller must hold the rtnl_mutex.
2757 */
2758void netdev_rx_handler_unregister(struct net_device *dev)
2759{
2760
2761 ASSERT_RTNL();
2762 rcu_assign_pointer(dev->rx_handler, NULL);
2763 rcu_assign_pointer(dev->rx_handler_data, NULL);
2764}
2765EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);
2817static int __netif_receive_skb(struct sk_buff *skb)
2818{
2894 /* Handle special case of bridge or macvlan */
2895 rx_handler = rcu_dereference(skb->dev->rx_handler); ///////__netif_receive_skb函數裡面會調用註冊的處理函數的
2896 if (rx_handler) {
2897 if (pt_prev) {
2898 ret = deliver_skb(skb, pt_prev, orig_dev);
2899 pt_prev = NULL;
2900 }
2901 skb = rx_handler(skb);
2902 if (!skb)
2903 goto out;
2904 }
=======================vport-netdev.c================
static int netdev_init(void)
{
/* Hook into callback used by the bridge to intercept packets.
* Parasites we are. */
br_handle_frame_hook = netdev_frame_hook; /////以前久版本內核,還是採用直接替換內核匯出的bridge的處理函數的辦法
return 0;
}
static struct sk_buff *netdev_frame_hook(struct sk_buff *skb)
{
struct vport *vport;
if (unlikely(skb->pkt_type == PACKET_LOOPBACK))
return skb;
vport = netdev_get_vport(skb->dev);
netdev_port_receive(vport, skb);
return NULL;
}
/* Must be called with rcu_read_lock. */
static void netdev_port_receive(struct vport *vport, struct sk_buff *skb)
{
/* Make our own copy of the packet. Otherwise we will mangle the
* packet for anyone who came before us (e.g. tcpdump via AF_PACKET).
* (No one comes after us, since we tell handle_bridge() that we took
* the packet.) */
skb = skb_share_check(skb, GFP_ATOMIC);
if (unlikely(!skb))
return;
skb_warn_if_lro(skb);
skb_push(skb, ETH_HLEN);
compute_ip_summed(skb, false);
vport_receive(vport, skb); ////////調用vport_receive 通知核心,我們這個埠有資料進來了
}
===========================vport.c=============================
/**
* vport_receive - pass up received packet to the datapath for processing
*
* @vport: vport that received the packet
* @skb: skb that was received
*
* Must be called with rcu_read_lock. The packet cannot be shared and
* skb->data should point to the Ethernet header. The caller must have already
* called compute_ip_summed() to initialize the checksumming fields.
*/
void vport_receive(struct vport *vport, struct sk_buff *skb)
{
if (vport->ops->flags & VPORT_F_GEN_STATS) {
struct vport_percpu_stats *stats;
local_bh_disable();
stats = per_cpu_ptr(vport->percpu_stats, smp_processor_id());
write_seqcount_begin(&stats->seqlock);
stats->rx_packets++;
stats->rx_bytes += skb->len;
write_seqcount_end(&stats->seqlock);
local_bh_enable();
}
if (!(vport->ops->flags & VPORT_F_FLOW))
OVS_CB(skb)->flow = NULL;
if (!(vport->ops->flags & VPORT_F_TUN_ID))
OVS_CB(skb)->tun_id = 0;
dp_process_received_packet(vport, skb); //////////進去datapath核心裡面處理////////
}
============================datapath.c=========================================
這個函數裡面會進行處理邏輯判斷了,判斷netflow流類型,然後執行相應的控制規則action等等,根據你的配置來進行的吧。這裡面才是open vswitch的控制核心所在。
/* Must be called with rcu_read_lock. */
void dp_process_received_packet(struct vport *p, struct sk_buff *skb)
{
struct datapath *dp = p->dp;
struct dp_stats_percpu *stats;
int stats_counter_off;
struct sw_flow_actions *acts;
struct loop_counter *loop;
int error;
OVS_CB(skb)->vport = p;
if (!OVS_CB(skb)->flow) {
struct odp_flow_key key;
struct tbl_node *flow_node;
bool is_frag;
/* Extract flow from 'skb' into 'key'. */
error = flow_extract(skb, p ? p->port_no : ODPP_NONE, &key, &is_frag);
if (unlikely(error)) {
kfree_skb(skb);
return;
}
if (is_frag && dp->drop_frags) {
kfree_skb(skb);
stats_counter_off = offsetof(struct dp_stats_percpu, n_frags);
goto out;
}
/* Look up flow. */ /////搜索匹配的 流類型的,比如是不是某個tcp連接來的阿 等等?????
flow_node = tbl_lookup(rcu_dereference(dp->table), &key,
flow_hash(&key), flow_cmp);
if (unlikely(!flow_node)) {
dp_output_control(dp, skb, _ODPL_MISS_NR, OVS_CB(skb)->tun_id);
stats_counter_off = offsetof(struct dp_stats_percpu, n_missed);
goto out;
}
OVS_CB(skb)->flow = flow_cast(flow_node);
}
stats_counter_off = offsetof(struct dp_stats_percpu, n_hit);
flow_used(OVS_CB(skb)->flow, skb);
acts = rcu_dereference(OVS_CB(skb)->flow->sf_acts);
/* Check whether we've looped too much. */
loop = loop_get_counter();
if (unlikely(++loop->count > MAX_LOOPS))
loop->looping = true;
if (unlikely(loop->looping)) {
loop_suppress(dp, acts);
kfree_skb(skb);
goto out_loop;
}
/* Execute actions. */
execute_actions(dp, skb, &OVS_CB(skb)->flow->key, acts->actions,
acts->actions_len); ////////執行相應的規則???????????????
/* Check whether sub-actions looped too much. */
if (unlikely(loop->looping))
loop_suppress(dp, acts);
out_loop:
/* Decrement loop counter. */
if (!--loop->count)
loop->looping = false;
loop_put_counter();
out:
/* Update datapath statistics. */
local_bh_disable();
stats = per_cpu_ptr(dp->stats_percpu, smp_processor_id());
write_seqcount_begin(&stats->seqlock);
(*(u64 *)((u8 *)stats + stats_counter_off))++;
write_seqcount_end(&stats->seqlock);
local_bh_enable();
}
=========================actiona.c============================
/* Execute a list of actions against 'skb'. */
int execute_actions(struct datapath *dp, struct sk_buff *skb,
const struct odp_flow_key *key,
const struct nlattr *actions, u32 actions_len)
{
/* Every output action needs a separate clone of 'skb', but the common
* case is just a single output action, so that doing a clone and
* then freeing the original skbuff is wasteful. So the following code
* is slightly obscure just to avoid that. */
int prev_port = -1;
u32 priority = skb->priority;
const struct nlattr *a;
int rem, err;
if (dp->sflow_probability) {
struct vport *p = OVS_CB(skb)->vport;
if (p) {
atomic_inc(&p->sflow_pool);
if (dp->sflow_probability == UINT_MAX ||
net_random() < dp->sflow_probability)
sflow_sample(dp, skb, actions, actions_len, p);
}
}
OVS_CB(skb)->tun_id = 0;
for (a = actions, rem = actions_len; rem > 0; a = nla_next(a, &rem)) {
if (prev_port != -1) {
do_output(dp, skb_clone(skb, GFP_ATOMIC), prev_port); ////根據規則,決定從那個埠出去了。
prev_port = -1;
}
switch (nla_type(a)) {
case ODPAT_OUTPUT:
prev_port = nla_get_u32(a);
break;
case ODPAT_CONTROLLER:
err = output_control(dp, skb, nla_get_u64(a));
if (err) {
kfree_skb(skb);
return err;
}
break;
case ODPAT_SET_TUNNEL:
OVS_CB(skb)->tun_id = nla_get_be64(a);
break;
case ODPAT_SET_DL_TCI:
skb = modify_vlan_tci(dp, skb, key, a, rem);
if (IS_ERR(skb))
return PTR_ERR(skb);
break;
case ODPAT_STRIP_VLAN:
skb = strip_vlan(skb);
break;
case ODPAT_SET_DL_SRC:
skb = make_writable(skb, 0);
if (!skb)
return -ENOMEM;
memcpy(eth_hdr(skb)->h_source, nla_data(a), ETH_ALEN);
break;
case ODPAT_SET_DL_DST:
skb = make_writable(skb, 0);
if (!skb)
return -ENOMEM;
memcpy(eth_hdr(skb)->h_dest, nla_data(a), ETH_ALEN);
break;
case ODPAT_SET_NW_SRC:
case ODPAT_SET_NW_DST:
skb = set_nw_addr(skb, key, a);
break;
case ODPAT_SET_NW_TOS:
skb = set_nw_tos(skb, key, nla_get_u8(a));
break;
case ODPAT_SET_TP_SRC:
case ODPAT_SET_TP_DST:
skb = set_tp_port(skb, key, a);
break;
case ODPAT_SET_PRIORITY:
skb->priority = nla_get_u32(a);
break;
case ODPAT_POP_PRIORITY:
skb->priority = priority;
break;
case ODPAT_DROP_SPOOFED_ARP:
if (unlikely(is_spoofed_arp(skb, key)))
goto exit;
break;
}
if (!skb)
return -ENOMEM;
}
exit:
if (prev_port != -1)
do_output(dp, skb, prev_port);
else
kfree_skb(skb);
return 0;
}
static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port)
{
struct vport *p;
if (!skb)
goto error;
p = rcu_dereference(dp->ports[out_port]);
if (!p)
goto error;
vport_send(p, skb); //////從埠發送出去
return;
error:
kfree_skb(skb);
}
=========================vport.c=======================================
/**
* vport_send - send a packet on a device
*
* @vport: vport on which to send the packet
* @skb: skb to send
*
* Sends the given packet and returns the length of data sent. Either RTNL
* lock or rcu_read_lock must be held.
*/
int vport_send(struct vport *vport, struct sk_buff *skb)
{
int mtu;
int sent;
mtu = vport_get_mtu(vport);
if (unlikely(packet_length(skb) > mtu && !skb_is_gso(skb))) {
if (net_ratelimit())
pr_warn("%s: dropped over-mtu packet: %d > %d\n",
dp_name(vport->dp), packet_length(skb), mtu);
goto error;
}
sent = vport->ops->send(vport, skb); ////////我們註冊vport時候的發送出去的函數。
if (vport->ops->flags & VPORT_F_GEN_STATS && sent > 0) {
struct vport_percpu_stats *stats;
local_bh_disable();
stats = per_cpu_ptr(vport->percpu_stats, smp_processor_id());
write_seqcount_begin(&stats->seqlock);
stats->tx_packets++;
stats->tx_bytes += sent;
write_seqcount_end(&stats->seqlock);
local_bh_enable();
}
return sent;
error:
kfree_skb(skb);
vport_record_error(vport, VPORT_E_TX_DROPPED);
return 0;
}
======================vport-netdev.c==================
看看我們net_device attach類型的vport的處理函數。
static int netdev_send(struct vport *vport, struct sk_buff *skb)
{
struct netdev_vport *netdev_vport = netdev_vport_priv(vport);
int len = skb->len;
skb->dev = netdev_vport->dev;
forward_ip_summed(skb);
dev_queue_xmit(skb); ///加到網路設備的發送佇列裡面,從net_device 發送出去外面網路
return len;
}
=============================vport-internal_dev.c=====================================
上 面的 netdev類型 port,我們可以看到應該是被vswitch使用之後,他那個網路設備就沒有辦法正常的把網路包分發給系統上層協定來處理的了。比如說eth0被 vswitch接管了,linux內核是不能直接收到eth0過來的包了,而是有open vswitch接管了,vswtich可能根據規則就直接轉發給另外一個vport的net_device,這個另外一個net-device可能是對應 的虛擬機器的介面的,比如xen裡面vif網路設備,然後包就通過vif過去虛擬機器了。 這樣eth0的包,自己的host主機是看不到他過來的包的。不過open vswitch還實現了另外一種internal_dev類型的vport 。這種vport他會自己註冊一個網路設備,通過這個特定的網路設備,host主機是可以給vswitch發送網路包的,然後它這個vport是受到 vswtich過來的包的話,他也是往上傳給內核協議棧的。
static int internal_dev_recv(struct vport *vport, struct sk_buff *skb)
{
struct net_device *netdev = netdev_vport_priv(vport)->dev;
int len;
skb->dev = netdev; //傳給vport的網路設備
len = skb->len;
skb->pkt_type = PACKET_HOST;
skb->protocol = eth_type_trans(skb, netdev);
if (in_interrupt())
netif_rx(skb); /////net_device收到包,上傳給上層處理
else
netif_rx_ni(skb);
#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,29)
netdev->last_rx = jiffies;
#endif
return len;
}
const struct vport_ops internal_vport_ops = {
.type = "internal",
.flags = VPORT_F_REQUIRED | VPORT_F_GEN_STATS | VPORT_F_FLOW,
.create = internal_dev_create, //////創建vport的函數
.destroy = internal_dev_destroy,
.set_mtu = netdev_set_mtu,
.set_addr = netdev_set_addr,
.get_name = netdev_get_name,
.get_addr = netdev_get_addr,
.get_kobj = netdev_get_kobj,
.get_dev_flags = netdev_get_dev_flags,
.is_running = netdev_is_running,
.get_operstate = netdev_get_operstate,
.get_ifindex = netdev_get_ifindex,
.get_iflink = netdev_get_iflink,
.get_mtu = netdev_get_mtu,
.send = internal_dev_recv, ////vswitch 要給你這個vport發包的時候,就調用的這個。
};
總 結: 大概看了一下之後,vswitch的流程和大概實現就清除一點了。他也是通過內核裡面net_device結構,掛鉤網路設備的發包出口點和接受點來做到 的。然後讓包在不同的netdevice之間轉發資料包,修改包的流向等,這就是一個虛擬交換機的功能了。當然他裡面的邏輯控制還是要做很多工作的。不過 這些在net-device之間玩弄網路skb資料包的辦法也可以學習一下。
沒有留言:
張貼留言