Nov 222011
 

Esxi5 had some changes in iscsi initiator, causing quite a few problem for some existing open source target.

I endup tried freenas,  You would want to use version 8. which works okay, performance is good enough, but there’s one thing, it’s on freebsd.

It’s not I don’t like freebsd, it’s just I havn’t got too much exposure to it, and it lacks some surprising feature, for example, there’s no PNP support ?!  So you can’t just add a network card and expect it shows up, you have to reboot the machine, WTF? same apply to scsi controller.  Also it don’t support vmxnet3, and vmxnet2 seems to have trouble locking itself up in the kernel. sigh. It might work on physical machines, but I want to use it in the ESXi environment.

However, the underlying iscsi daemon , istgt is not only for freebsd, it claims to be *nix compatible, so, I tested it, it works! It has been working fine on my lucid box, with vmxnet3 , local software raid10 (md) , with a top natch performance.

First, to compile it, you need libssl-dev

sudo apt-get install libssl-dev

And then it’s very simple:

wget http://shell.peach.ne.jp/~aoyama/wordpress/download/istgt-20111008.tar.gz
tar zxvf istgt-20111008.tar.gz
cd istgt-20111008
./configure
make
make install

It defaults install to /usr/local/ , you might want to change that, everything compiles to just one binary istgt, with some config files. create a target pointing to /dev/md_d0 and it runs perfectly!

Compare to various other software I’ve tried, this is good enough with only some very small problem:

I also tried some other choices including those from di

  • IET http://iscsitarget.sourceforge.net/
    IET doesn’t support SCSI-Persistent Reservation, although a patch has been submitted, a new version hasn’t been released, it’s hard to say how good it will be, I may try it when they hash that out.
  • Starwind iscsi free version http://www.starwindsoftware.com/downloads
    This is only on windows, I used it for about half year now and it works pretty stable, the MPIO support seems to be problematic, so I endup using LACP instead,  read/write Caches are okay, however you can’t change it on-the-fly, you have to remove the target and recreat one (WTF?!)
    It also don’t support using raw device in free license, which is a weird restriction I don’t understand.
  • EMC Cellera Uber VSA http://nickapedia.com/
    I installed it, so called unisphere UI needs java plugin, which in turns doesn’t even launch correctly on my windows machine, why am I surprising ? it’s from EMC. maybe I will give another try someday.
  • FalconStor iscsi http://freeiscsisan.falconstor.com/
    This is better, a centos with some custom kernel modules, the UI is also java, it works nonetheless, however this software is troubled with their idea of SANClients, which in order to manage their customized iscsi client , they requie you either put all your initiators (WTF?) in to ACL, or create a CHAP user/pass so that each of your connection can use that to authenticate.
    This is just way too much trouble, I am not going to figure out all the imitator names, they could change anyway, at least give me something like “ALLOW ALL”.
Oct 032011
 

China constantly amuse me. Today, I was paged by some users that the website is unreachable. Okay, I am thinking, I’m using SSL on my website (free cert from StartCom) , that should avoid being GFWed in the first place, so it is a server problem, but servers are all fine.

Then, sniffing on my user’s machine, it seems the machien is getting fake RST (a typical effect being GFW blocked).  I then tried some other domains with certs from StartCom, they are blocked too! Damn, lucky me got a godaddy cert as a backup on the domain, so I load it up on the server, everything is now working again! It’s like black magic!!!

ok, so some rationale here, I think it is very possible that GFW is detecting StartCom certificate by sniffing the session parameters, as Iran demonstrated in blocking Tor. It then send fake RST to both server and client to interrupt the connection.

https://blog.torproject.org/blog/iran-blocks-tor-tor-releases-same-day-fix

 

Okay, what to do now? We need to figure out *which* part of the certificate they are using as keyword, unless they are blocking the whole CA (which I fear,but might be truth), Hopefully StartCom can do something to change the keyword in the cert too.

Oct 012011
 

I have a web application running of https://mydomain.com , which then issue ajax calls. Recently, some of my useres, complains that AJAX calls timeout/fail from time to time, despite the main UI loads fine.

I looked at the server log, and ask user to try again, then I found those calls never reach server!

Luckily I am able to remote re-produce the problem on user’s machine, I opened chrome’s developer console, and found this:

XXX mydomain.com/api GET (cancel) text/plain 30B 0B 0ms 0ms 0ms

Why are my calls canceled? In this case, my callback get a status_code = 0. Okay, then I think I should trace this call.

I open a new tab with chrome://net-internals, navigate to Events Tab. Filter it by “mydomain.com”, There’s a few red entries, indicating problems.

One of the entry shows

ssl/mydomain.com:443

Start Time: Sun Oct 02 2011 00:46:24 GMT-0700 (Pacific Daylight Time)

t=1317541584591 [st= 0] +SOCKET_POOL_CONNECT_JOB             [dt=10]
t=1317541584591 [st= 0]    +SOCKET_POOL_CONNECT_JOB_CONNECT  [dt=10]
                            --> group_name = "ssl/mydomain.com:443"
t=1317541584591 [st= 0]        HOST_RESOLVER_IMPL            [dt= 1]
t=1317541794095 [st=  0]       +SOCKET_POOL                   [dt= 22]
t=1317541794117 [st= 22]           SOCKET_POOL_BOUND_TO_CONNECT_JOB  
                                   --> source_dependency = {"id":1485081,"type":4}
t=1317541794117 [st= 22]           SOCKET_POOL_BOUND_TO_SOCKET  
                                   --> source_dependency = {"id":1485083,"type":5}
t=1317541794117 [st= 22]       -SOCKET_POOL                   
t=1317541794216 [st=121]        CONNECT_JOB_SET_SOCKET        
                                --> source_dependency = {"id":1485083,"type":5}
t=1317541794216 [st=121]    -SOCKET_POOL_CONNECT_JOB_CONNECT  
                             --> net_error = -200 (CERT_UNABLE_TO_CHECK_REVOCATION)
t=1317541794216 [st=121] -SOCKET_POOL_CONNECT_JOB

Wait, what? Then I open my website directly in the tab, click on the padlock icon, open up the certifacte and found the CRL service from godaddy. Then I tried to fetch that in the chrome, no wonder! It stalls!

Okay, now it is clear, Chrome won’t fire my request because it is not able to check the CRL on my certificate, it is probably for security reason, but that is a very hard dependency on godaddy CRL service being up, what can I do about that?!

If you meet the same problem, please star/comment on http://code.google.com/p/chromium/issues/detail?id=98794

Sep 272011
 

Has been using auto deploy feature for a while for my small esxi clusters.

Here’s is a tutorial: http://osdude.wordpress.com/2011/09/20/diskless-esxi-vsphere-5/

Here’s some other stuff I found extremely useful.

1) Where do I get a “vmware image depot file” ?

You can get vmware patches as the depot file here:

http://www.vmware.com/patchmgr/download.portal

Select esxi5 and download the patch as offline depot, because esxi always patch the whole firmware, you always get the latest one.

But better way doing this is : vmware have a online depot! use following when following instruction to create your own image profile

Add-EsxSoftwareDepot https://hostupdate.vmware.com/software/VUM/PRODUCTION/main/vmw-depot-index.xml

2) add Vmware HA depot too, it contains the HA agent, which I wonder why they didn’t include them in the first place

Add-EsxSoftwareDepot http://<vcenter server address>/vSphere-HA-depot

Remember to create your own image profile and add the packaget to it, otherwise your newly boot machine will have to install it every time reboots.

Add-EsxSoftwarePackage -ImageProfile "MyProfile" -SoftwarePackage vmware-fdm

3) Now, the fun stuff. Say a new patch comes along, and you updated your imageprofile , updated your deploy rule and updated your deploy ruleset, reboot the machine and found nothing changed. Right, You need to fix the cache! Do this in powershell :

Get-VMHost <hostname>  | Test-DeployRuleSetCompliance | Repair-DeployRuleSetCompliance

4) Host Profile nightmare
HostProfile have issues with raid controller, which includes those local disk in the hostprofile: http://kb.vmware.com/kb/2002488

There’s also other issues with HostProfile for example:
* it doesn’t support ScratchConfig , doesn’t support Passthrough config.
http://kb.vmware.com/kb/2003473
* DNS config dialog has a bug where you can’t set it to DHCP correctly, my advise is to set to static and provide hostname in the answer file.
* Also in hostprofile dialog you can’t adjust ordering of the element, making vmnicX to be numbered by their ordering in the host profile. Make sure don’t get surprises.

5) Useful settings:
* Serial Logging: Finally I am able to set a serial logging on my pxe booted esxi5, add this to your HostProfile Advanced settings section

VMKernel.Boot.logPort = com2

I use dells, com2 is mapped to IPMI-over-lan which is very easy for me to see the logs. There’s other entries, make sure don’t map two things to same com port, you will get a nice and confusing

u'Handler in use'

error when trying to apply it.

* Syslog forwarding and verbosity

Syslog.global.logHost = udp://host:514
Config.HostAgent.log.level = info
Vpx.Vpxa.config.log.level = info

Also, when you finish your hostprofile tweaking, try apply it on a host first, if it fails, fix the problem before rebooting other hosts, otherwise you will have a very confusing result where the host was put in maintenance mode but never being applied hostprofile. Also, try restarting the “Vmware Auto Deploy Waiter” service, it seems to get stuck when you muck with the imageprofiles.

Leave comments below and I will try to help you!