[OVN] VLAN networks for North / South Traffic Broken

Bug #2035332 reported by Graeme Moss
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
New
Undecided
Unassigned

Bug Description

## Environment

### Deployment

- Ubuntu 22.04 LTS
- Openstack Release ZED
- Kolla-ansible - stable/zed repo
- Kolla - stable/zed repo
- Containers built with ubuntu 22.04 LTS
- Containers built on 2023-08-23
- OVN+DVR+VLAN tenant networks.
- We have three controllers occ00001, occ00002 occ00003
- Neutron version neutron-21.1.3.dev34 commit d6ee668cc32725cb7d15d2e08fdb50a761f91fe4
- ovn-nbctl 22.09.1
- Open vSwitch Library 3.0.3
- DB Schema 6.3.0

1. New provider network deployed into openstack, on vlan 504.
2. Router connected to this provider network.
3. Instance connected to provider network no FIP

## Issues

Attempting to send north/south traffic (ping 8.8.8.8), results in the following symptoms. 2 pings are successful, all other pings fail, until the ping is cancelled, and a couple of minutes pass, then two pings will be successful again, then back to failing.

New routers with vlan networks attached don't create all three ports on the controllers.

Even when fixing the localnet ports on the router to have three with changing the priority when attaching a FIP the N/S traffic is limited to 2 pings

Only when setting `reside-on-redirect-chassis` to `True` can we get the vlan to work with FIP and have baremetal nodes have FIP.

## Diagnostics

After looking at the ovn-controller logs on the control nodes we can see that it tries to claim the port on occ0001. which matches the gateway chassis on the routers LRP port.

```
2023-09-06T14:13:32.454Z|00718|binding|INFO|Claiming lport cr-lrp-1a089d8f-d7a3-4116-a496-94cb87abe57f for this chassis.
2023-09-06T14:13:32.454Z|00719|binding|INFO|cr-lrp-1a089d8f-d7a3-4116-a496-94cb87abe57f: Claiming fa:16:3e:fc:ba:cf 1xx.xx.xxx.xxx/25
```

Gateway chassis of the LRP port.

```
ovn-nbctl list Gateway_Chassis | grep -A2 -B4 lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1

_uuid : cf26be06-206d-443c-b224-25cc06ef2094
chassis_name : occ00002
external_ids : {}
name : lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1_occ00002
options : {}
priority : 2
--

_uuid : 1d9e8314-ed00-4694-8974-0328b78d34e1
chassis_name : occ00001
external_ids : {}
name : lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1_occ00001
options : {}
priority : 3
--

_uuid : b1e41ceb-ca2d-42eb-a896-b3551ea1b32f
chassis_name : occ00003
external_ids : {}
name : lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1_occ00003
options : {}
priority : 1
```

We see nothing about `occ00002` or `occ00003` trying to claim the LRP port but we found that when you change the priority around to try resolve, we can see that the port is not on `occ00001` but is on occ0002
We change occ0001 = 1 and occ0003 = 3 which means `occ00003` will be come the highest gateway.

```
ovn-nbctl set gateway_chassis 1d9e8314-ed00-4694-8974-0328b78d34e1 priority=1
ovn-nbctl set gateway_chassis b1e41ceb-ca2d-42eb-a896-b3551ea1b32f priority=3
```

the logs show the following.

occ0001

```
2023-09-06T14:10:06.134Z|00667|binding|INFO|Releasing lport cr-lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1 from this chassis (sb_readonly=0)
2023-09-06T14:10:06.134Z|00668|if_status|WARN|Trying to release unknown interface cr-lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1
```

occ0002

```
2023-09-06T14:10:14.883Z|00444|binding|INFO|Releasing lport cr-lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1 from this chassis (sb_readonly=0)
2023-09-06T14:10:14.883Z|00445|if_status|WARN|Trying to release unknown interface cr-lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1
```

occ0003

```
2023-09-06T14:10:14.789Z|00459|binding|INFO|Changing chassis for lport cr-lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1 from occ00002 to occ00003.
2023-09-06T14:10:14.789Z|00460|binding|INFO|cr-lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1: Claiming fa:16:3e:71:df:71 1xx.xx.xxx.xxx/25
```

on `occ00003` we can see that `occ00002` had the gateway and not `occ00001` which it should of had. This happens on creating new routers on the vlan provider network.All exisiting Routers before upgrade are working and that they have the same priority.

## Second diagnostics

Looking at each Logical Router we can see that when the router is first created that only two of the three ports are created.
Broken router:

```
_uuid : 773bb527-f193-4b47-8685-e62c9325dd7b
copp : []
enabled : true
external_ids : {"neutron:availability_zone_hints"="", "neutron:gw_network_id"="c9d130bc-301d-45c0-9328-a6964af65579", "neutron:gw_port_id"="1a089d8f-d7a3-4116-a496-94cb87abe57f", "neutron:revision_number"="4", "neutron:router_name"=new-r1-test}
load_balancer : []
load_balancer_group : []
name : neutron-2b51e12e-5505-477e-9720-e5db31a05790
nat : [f22e6004-ad69-4b12-9445-7006a03495f5]
options : {always_learn_from_arp_request="false", dynamic_neigh_routers="true"}
policies : []
ports : [c59b5f9e-707e-43eb-912a-ea2679f1f723, c8f8ba72-64b4-4209-8209-128c93b157bc]
static_routes : [36ad39c0-c3f0-4842-b9b8-b4e986147624]
```

The working Router has all three ports after we make the priority change this means that the change forces the ports to be created.
Working Router:

```
_uuid : 8734ea01-21e7-4e69-8649-b05b125ce36e
copp : []
enabled : true
external_ids : {"neutron:availability_zone_hints"="", "neutron:gw_network_id"="c9d130bc-301d-45c0-9328-a6964af65579", "neutron:gw_port_id"="dbe08713-97e1-4bea-880b-70910e05180d", "neutron:revision_number"="16", "neutron:router_name"=R2-test-demo2}
load_balancer : []
load_balancer_group : []
name : neutron-cbabcf4c-08a3-4e31-9485-a456237ef427
nat : [4bba0f50-6937-47cc-8771-2caef2aee7e6, 51f7f8fc-3b07-4a75-8dc3-32b0e2c4e02a, 663f6c59-4cc1-4802-b0ff-5ae34e83210e]
options : {always_learn_from_arp_request="false", dynamic_neigh_routers="true"}
policies : []
ports : [a9590024-feb2-4724-be7a-8bdb5fe3f9af, c1b94349-d320-4573-a2d5-2b1d3e91f679, ccae3d63-7203-4e39-8960-1e17df22fb31]
static_routes : [8e89f98e-cf75-4ae4-bbb6-e459e6ae9a6c]
```

## Resolution

When we look at the Logical Router Port of the internal interface (the one attached to the vlan) we can see that options has the following.

```
name : lrp-d6e063e5-d209-43ec-9da2-4ac9f9e8ccbc
networks : ["192.168.0.1/24"]
options : {reside-on-redirect-chassis="false"}
```

And on the External LRP we have the following.

```
mac : "fa:16:3e:fc:ba:cf"
name : lrp-1a089d8f-d7a3-4116-a496-94cb87abe57f
networks : ["1xx.xx.2xx.2xx/25"]
options : {redirect-type=bridged, reside-on-redirect-chassis="false"}
```

My understanding is that `reside-on-redirect-chassis` is to force traffic to the gateway rather then DVR this should be `True` as Vlan networks will need to go through the chassis gateway for everything where geneve DVR can have this as false to allow for DVR.
When I change this to true `ovn-nbctl set logical_router_port lrp-d6e063e5-d209-43ec-9da2-4ac9f9e8ccbc options:reside-on-redirect-chassis=true` on the VLAN LRP, packets flow through the chassis and I can ping outwards FIP's can now be attached to the VLAN network and we can connect with no problem.

When looking at the merged https://review.opendev.org/c/openstack/neutron/+/879296 fix I don't understand what is meant to happen but the VLAN LRP is not been set to true which causes problems. the External LRP is been set correctly but VLANS need to be centralised.

Tags: ovn
Graeme Moss (gramimoss)
description: updated
Graeme Moss (gramimoss)
summary: - VLAN networks for North / South Traffic Broken
+ [OVN] VLAN networks for North / South Traffic Broken
Miro Tomaska (mtomaska)
tags: added: ovn
Revision history for this message
Steven Relf (srelf) wrote :

Hey all,
Do we need anymore information on this, or a discussion about a way forward?

Revision history for this message
yatin (yatinkarel) wrote :

Just to update, For the second part of the issue, this was discussed over IRC on Dec 14th and Luis pointed out the issue possibly related to https://bugzilla.redhat.com/show_bug.cgi?id=2007120, i checked the fix is missing from the ovn version used in environment. The patch was available since 22.09.2+ while the reported version here is 22.09.1.

Also i didn't reproduce the second part of the issue(FIP connectivity and N/S without FIP also worked) with an Antelope OpenStack with ovn23.06, related references:-

External router port:-
_uuid : 7e328b13-5e75-4f58-8c0c-2849b6e1c206
enabled : []
external_ids : {"neutron:network_name"=neutron-05738145-1ebe-4286-ad60-3236c73c5b4d, "neutron:revision_number"="7", "neutron:router_name"="757ac803-6856-4293-9723-6cdd5cd05ad6", "neutron:subnet_ids"="c7003b60-d89e-4f0b-9383-534659dd913b"}
gateway_chassis : [2ec37a79-3bdf-43a5-89c1-191c4dc39590, 533f41d7-756a-4b19-b89b-56a7fbe1975e, e83c98a9-7593-4c67-8a25-144762323493]
ha_chassis_group : []
ipv6_prefix : []
ipv6_ra_configs : {}
mac : "fa:16:3e:e4:07:86"
name : lrp-b8c79af5-0cec-471b-b641-c75ae13d809c
networks : ["172.30.0.205/24"]
options : {redirect-type=bridged, reside-on-redirect-chassis="false"}
peer : []

Internal router port:-
_uuid : 86707480-6538-4255-b6cc-04dd191ea9ae
enabled : []
external_ids : {"neutron:network_name"=neutron-48e66011-fd12-41d3-9b98-58eda3fc9390, "neutron:revision_number"="3", "neutron:router_name"="757ac803-6856-4293-9723-6cdd5cd05ad6", "neutron:subnet_ids"="098375eb-ac3a-449d-88c2-d08b64f04ea9"}
gateway_chassis : []
ha_chassis_group : []
ipv6_prefix : []
ipv6_ra_configs : {}
mac : "fa:16:3e:15:7e:9e"
name : lrp-f7ec59b1-5ebb-4627-9547-b304ff764e24
networks : ["192.19.0.1/24"]
options : {reside-on-redirect-chassis="false"}
peer : []

Revision history for this message
Sven Kieske (s-kieske) wrote :

we are currently discussing the by default activation of ovn in https://review.opendev.org/c/openstack/kolla-ansible/+/904959 and I was linked to this bug report.

From the last message I take it, that the second part of this issue is fixed with a newer ovn release and that the first part (north/south traffic being unreliable/dropped) is not being reproducible.

Is that a correct understanding? Can maybe the person who opened the bug report confirm this? Did anybody test this on a newer release?

Thank you all for your effort debugging this.

Revision history for this message
Sven Kieske (s-kieske) wrote :

during discussion of this topic in #openstack-neutron the possibility was raised, that the issue with north/south traffic not working could be related to MTU fragmentation from an external network to OVN (the other way around it is handled, but not from external->OVN).

Could you (the original reporter) or maybe someone else verify this?

This bug is now also on the list for the next neutron meeting, thanks to @ralonsoh for that.

Thanks

Revision history for this message
Steven Relf (srelf) wrote :

I shall attempt to get upgraded and test.

Revision history for this message
Graeme Moss (gramimoss) wrote :

Due to a sudden change we no longer have access to hardware/stack and I'm unable to do any upgrades and testing.

How ever I have confirmation that a similar setup with antelope version of kolla-ansible with ubuntu and can't reproduce the first issue. As this would indict that the updated version of OVN in the kolla-build of antelope means that the bug is based in the ovn version release in the ZED release.

I think if any way does come across this problem they should look to upgrade to antelope.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.